The complete guide to C strings

Almost any program will need to work with text at some point. Variables storing text are commonly referred to as "strings". But while most programming languages have a dedicated data type for strings, C is much older and its approach to string handling is fairly different.

C strings

The C language has no builtin string type, only char with a size of 1 byte. The ASCII standard maps each of its 128 possible values to an english or control character. (A byte has 8 bits, but one was reserved historically, leaving only 2^7 = 128 possible values. )


In order to store strings, C simply uses arrays of type char ending in a null byte '\0' to signal the end of the string. Helper functions starting with the prefixes str or mem are available through the string.h standard library header.


There are two types of strings you could create. The first are stack-allocated read-only string literals:

char s[] = "Hello";
char *s = "Hello";

Both lines do the same, except the second variant returns a pointer to the string instead of the string itself. This syntax is more common, as all string functions in the standard library take pointer arguments exclusively.


Since string literals are read-only and a stack array's lifetime is limited to the function body they were declared in, their use is limited. If you need a string you can alter or return from functions, you need a heap allocated string.


You can either allocate heap memory manually and then copy a string into it:

char *s = calloc(10, sizeof(char));
if(!s){
  // memory allocation error
}
strncpy(s, "Hello", 10);

Or duplicate a string literal into a heap string directly:

char *s = strdup("Hello");
if(!s){
  // memory allocation error
}

Remember to call free() on heap allocated strings when done using them, as they are not automatically disposed at the end of the function like stack string literals. Not freeing them keeps their memory allocated until the program ends, resulting in a memory leak bug.


You may have noticed that the string literals do not contain a trailing null byte \0. By convention, the C compiler implicitly adds a trailing null byte to string literals, so the string literal "abc" will produce the array {'a', 'b', 'c', '\0'}. This mechanism only applies to string literals, for all other cases you have to add a trailing null byte manually.

Pointer decay and length issues

Before showing how to work with strings, it is important to understand some known caveats and library conventions.


All string functions in the standard library take strings as char pointers (char*) because passing the entire array/string around would copy it in memory every time. Strings can be arbitrarily long, so that may add up very quickly. And if the function needs to modify the string, it needs a pointer anyway.


Pointer arithmetic allows traversing through an array with pointers alone: By adding to the pointer, it is moved forward by the size of one element, subtracting moves it backwards.

char *s = "ab";
printf("%c", s); // prints "a" 
s++;; // move pointer one element forward
printf("%c", s); // prints "b"
s--; // move pointer one element backward
printf("%c", s); // printf "a"

Unfortunately, passing a pointer to the first element instead of the entire array means losing context, specifically array length. This is known as the pointer decay problem, and forces programmers to find the string/array length in some other way.


The null byte is traditionally used for exactly this purpose: if any of the standard string functions encounter a null byte \0, they assume the string ends there. If a string does not end with a null byte, the result is undefined behavior: It may read more memory until it finds a null byte by coincidence, or produces a memory error ("segmentation error" or "SEGFAULT") if it attempts to read memory not accessible by the program.

Common string functions

The string.h library provides many string functions, but most program needs boil down to a common subset of them.


Get string length:

size_t len = strlen("Hello");

Create heap copy of string:

char *s = strdup("Hello");
if(!s){
  // memory allocation failed
}

Compare strings:

int result = strcmp("one", "two");
if(result == 0){
  // strings are equal
}else if(result < 0){
  // first string is smaller
}else{
  // second string is smaller
}

Find substring in string:

char *match = strstr("mystring", "string");
if(!match){
  // "string" not found in "mystring"
}
// match points to the start of "string" in "mystring"

Split a string by delimiter:

// strtok modifies string, so needs heap allocated string
char *s = strdup("my/custom/string");
if(!s){
  // memory allocation failed
}
char *token = strtok(s, "/"); //pass string on first pass
while(token){
  printf("%s ", token);
  token = strtok(NULL, "/"); // pass NULL instead of string after first pass
}
// prints "my custom string"

Iterate over characters:

char *s = "Hello";
for(char *p = s; *p; p++){
  printf("%c ", *p);
}
// prints "H e l l o"

Bounded string functions

Since forgetting the null byte or surrounding attacks are such a common problem, the string.h library provides bounded variants of functions that need to know string length, starting with strn instead of str.


For example, the strlen function has a bounded variant strnlen, which takes a boundary as a second parameter. The boundary is not the string length, but the maximum number of bytes it is willing to read before returning. If it finds a null byte \0 before reaching the boundary limit, it treats that as the end of the string instead.


For heavily improved string handling safety, you can combine a custom string type with the strn functions:

typedef struct{
  char *content;
  size_t len;
} string;

As long as you remember to update the len field every time you change content, the length is now passed around with the string, and strn functions provide absolute safety by setting their boundary limit to the real string length:

string s = {"Hello", 5};
char *newStr = strndup(s.content, s.len);

Available bounded string functions are:

strlen => strnlen

strdup => strndup

strcmp => strncmp

strcpy => strncpy

strcat => strncat

Binary strings

If you need to store binary data that may contain null bytes, you also reach for char arrays, but with a few important differences.


Firstly, you cannot use the null byte as a length indicator, so you have to store length alongside the string. Most use cases will also want to distinguish between length len (number of bytes in string) and capacity cap (amount of memory allocated for string storage):

typedef struct{
  char *content;
  size_t len;
  size_t cap;
} string;

string *string_init(const char *content, size_t len, size_t cap){
	string *s = malloc(sizeof(string));
	s->content = calloc(cap, sizeof(char));
	memcpy(s->content, content, len);
	s->len = len;
	s->cap = cap;
	return s;
}

The str and strn functions all interpret the first null byte as the end of a string, so none of them can be used for binary strings. Instead, you need to use the mem functions also contained in string.h.


Copy binary string:

string *src = string_init("Hello", 5, 5);
string *dst = string_init("", 0, 5);
memcpy(dst->content, src->content, 5); //copy 5 chars from src to dst
dst->len = src->len; // don't forget to update len!

Compare binary strings:

Since memcmp does not know the length of either string, you have to make sure both strings are equally long beforehand, otherwise it will read out of bounds of the shorter string during comparison.

string *s1 = string_init("Hello", 5, 5);
string *s2 = string_init("World", 5, 5);

// memcmp does not know string length
// ALWAYS check length manually before memcmp
if(s1->len != s2->len){
  // strings are not equal, stop comparing here
}else{
  // only use memcmp if both strings are equally long
  int result = memcmp(s1->content, s2->content, s1->len);
  if(result == 0){
    // strings are equal
  }else if(result < 0){
    // first string is smaller
  }else{
    // second string is smaller
  }
}

Iterate over binary string bytes:

Since binary strings may contain unprintable characters, you cannot rely on puts() or the %s/%c formats of printf. The only safe way to print binary string contents to console is using padded hexadecimal %02x.

string *s = string_init("Hello", 5, 5);
for(size_t i = 0; i < s->len; i++){
  printf("%02x", s->content[i]);
}

If you want to output raw binary strings to a terminal/console with any format function printf, fprintf, sprintf, ...), you cannot use %s as it will treat the first null byte as the end of the string. Iterate over all bytes and output each using %c instead, but beware that writing unprintable characters to terminals often results in unreadable output. If you want to output the raw string contents to a file, use fwrite().


There are no binary-safe alternatives for other string functions like strstr (find substring), strlen (string length) or strtok (split string by delimiter).

Common pitfalls

Now that the string basics are covered, let's go over a few pitfalls when using strings in C.


Unicode

Strings in C can print the 128 characters defined in the ASCII table, but modern text needs to support way more than 128 characters for special symbols, language signs and even additions like emojis. Therefore, modern text will use unicode text encoding, where multiple bytes form a single character. Typically, you will see UTF-8, which is backward-compatible with ASCII.

You cannot rely on functions like strlen/strnlen to compute the character count of a string, only the byte count.

Here is a quick example:

char *s = "Ä";
size_t len = strlen(s);
printf("%d", len); // prints 2, as "Ä" requires 2 bytes of storage in UTF-8

Since most C programs will not need to interpret text encodings (aka know the number of characters in it), normal C string handling is fine even for unicode strings. But as soon as you do need to know character count, you will have to parse unicode sequences properly.


Modifying vs allocating string functions

The only functions in string.h that allocates new heap memory are strdup/strndup. Every other function returning a char* instead provides a pointer into the char array passed as an argument (strstr,strtok etc).

Functions that do not modify the original string will accept it as const char*, any string function not specifying a string argument as const will modify it in place. This is even true for functions like strtok which return a char*, as it also replaces the delimiters in the original string with null bytes.


Const is not a guarantee outside string.h

Passing strings around as const char* by convention means the function won't modify the string, but beware that this is not guaranteed for third-party code. It is perfectly possible to cast a const char* back into a char* and modify it anyway if the programmer chooses. The standard library functions will not do this, but third-party libraries providing opaque structs often do.

More articles

Working with disk image files in linux

From loop devices to mounting

Reducing internal tool sprawl with homarr

Replace scattered links with a role-aware web launcher