String manipulation

[scissors cut tape]

This chapter deals with the conversion of strings to numbers and with the basic operations on strings implemented in the string library.

Table of contents:

Strings that represent numbers

A decimal character chain is any sequence of decimal digits (i.e., characters 0 to 9) possibly preceded by the character + or − .  A decimal string is any ASCII string that represents a decimal character chain.  For exemple, the decimal string  45 57 56 55 54 0  represents (see the ASCII table) the character chain  −9876 .

How to convert a decimal string into the corresponding number of type int?  The function atoi (the name is a shorthand for alphanumeric-to-int) of the stdlib library receives a decimal string and returns the corresponding int. Unfortunately, the function does not check whether the string represents a valid int (in particular, it does not check whether the string is too long) and so creates opportunities for errors. This can be exploited by malicious third parties, especially when used to treat command line arguments.  Therefore, it is better to use the more complex strtol function.

The  strtol function  (the name is a shorthand for string-to-long) of the stdlib library converts a decimal string into the corresponding long int.  This function accepts decimal strings as well as strings in other bases: the third argument specifies the base.  For example,

strtol ("-9876", NULL, 10)

returns the number −9876 and  strtol ("-9876", NULL, 16)  returns −39030.  We shall not discuss the meaning of the second argument for now.

Exercises 1

  1. A-to-i.  Write a function that emulates the behavior of atoi but checks whether the given string is valid. If the string is too long, discard the last digits.
  2. I-to-a.  Write a function itoa (the name is a shorthand for int-to-alphanumeric) that receives um integer n and return a decimal string that represents n. For example, the integer −123 must be converted into the string "-123".
  3. From binary string to int.  Write a function to receive a string of '0's and '1's (that is, bytes 48 and 49), interprets this string as a natural number in binary notation, and return the corresponding int. For example, the string "1111011" must be converted into the integer 123. (If the string is too long, discard the last digits.)
  4. From int to binary string.  Write a function to receive an int n and return the string of '0's and '1's (that is, bytes 48 and 49) that represents n in binary notation. For example, the int 123 must be converted into the string "1111011".
  5. Write a function to convert a decimal string into the corresponding hexadecimal string.  To test your function, write a program that receives a decimal character chain on the command line and prints the corresponding hexadecimal sequence.  Repeat the exercise to convert hexadecimal into decimal, then decimal into octal, then octal into decimal, etc.

Lexicographic order

We say that a byte b is smaller than a byte c if the number represented by b in binary notation is smaller than the number represented by c.  For example, 01000001 is smaller than 01000010.  In the set of bytes whose first bit is 0, this order relation between bytes is consistent with the usual order ('A' < 'B' < ... < 'Z') of the letters in the ASCII alphabet according to the ASCII table.

Now that we defined the order relation between bytes, we can deal with the order relation between strings.  We say that a string  s  is lexicographically smaller than a string  t  if the first byte of s that differs from the corresponding byte of t is smaller than that byte of t.  More precisely, s is lexicographically smaller than t if

s[k] < t[k]

for the smallest k such that s[k] ≠ t[k]. Moreover, if such k does not exist then s and t are equal or one is a proper prefix of the other; in the second case, the shorter string is considered lexicographically smaller that the longer one.

A sequence of strings is in lexicographic order if each string in the sequence is lexicographically smaller than (or equal to) each of the subsequent strings. (Some people say alphabetic order in place de lexicographic order, but this is incorrect.)  In the case of ASCII strings, the lexicographic order is analogous to the order in which words appear in a dictionary (except that it puts all the uppercase letters before all the lowercase).  For example, the following list of strings is in lexicographic order:

a
ab
abc
abcd
abd
ac
acd
ad
b
bc
bcd
bd
c
cd
d

(See also the Lexicographic order entry in the appendix.)

Exercises 2

  1. Write a function to receive two strings, say s and t, and decide whether s is lexicographically smaller than t.
  2. Someone wrote the following code to decide whether "apple" is lexicographically smaller than "banana". What is wrong?
    char *a, *b;
    a = "apple"; b = "banana";
    if (a < b) 
       printf ("%s is smaller than %s\n", a, b);
    else       
       printf ("%s is smaller than %s\n", b, a);
    
  3. Someone wrote the following code to decide whether "apple" is lexicographically smaller than "banana". What is wrong?
    char *a, *b;
    a = "apple"; b = "banana";
    if (*a < *b) 
       printf ("%s is smaller than %s\n", a, b);
    else         
       printf ("%s is greater than %s\n", a, b);
    

The string library

The string library contains various string manipulatation functions.  We describe next the most important ones.

String length.  The function strlen receives a string and returns its length.  This function could be written thus:

unsigned int strlen (char *s) {
   int k;
   for (k = 0; s[k] != '\0'; ++k) ;
   return k;
}

String copy.  The function strcpy receives two strings and copies the second (including the final null byte) to the space occupied by the first. The original content of the first string is lost. Do not call this function if the length of the first string is smaller than the length of the second.  (Buffer overflow is a frequent cause of security bugs!)  This function could be written like this:

void strcpy (char *s, char *t) {
   int i;
   for (i = 0; t[i] != '\0'; ++i) 
      s[i] = t[i];
   s[i] = '\0';
}

(Actually, the function strcpy is not of type void: it returns its first argument. This is useful in some situations, but in general the user is only interested in the side effect of the function and ignores its return value.)

The function strcpy can be used as follows to obtain an effect similar to that of the example in the beginning of the chapter Strings and character chains:

char s[10];
strcpy (s, "ABC");

String compare. The function strcmp compares two strings lexicographically, byte-for-byte. It returns a strictly negative number if the first string is lexicographically smaller than the second, returns 0 if the two strings are equal, and returns a strictly positive number if the first string is lexicographically greater than the second.  This function could have been coded like this:

int strcmp (char *s, char *t) {
   int i;
   for (i = 0; s[i] == t[i]; ++i)
      if (s[i] == '\0') return 0;
   unsigned char si = s[i], ti = t[i];
   return si - ti;
}

Though the parameters of the function are of type char *, the function behaves as if they were of type unsigned char *. (One must remember that the value of the expression  si - ti  is computed in int arithmetic rather than modulo 256.) 

If s and t are ASCII strings, the lexicographic order in based on the ordering of characters  (... 0 .. 9 .. A B .. Z .. a b .. z ...)  in the ASCII table. For example, the following sequence of words is in lexicographic order:

Acapulco
Seattle
acarus
amoeba
among
naive
native
pinata
pitch
role
rubber
seat

String collate. To compare non-ASCII strings you should use the function strcoll from the string library. The behavior of this function depends on the value of environment variable LC_COLLATE on your system. We shall assume that our system uses UTF-8 encoding and that the value of LC_COLLATE is en_US.UTF-8.  (Type locale on the terminal to learn the value of this variable.)

Function strcoll receives two strings that represent character chains and decides the relative positions of these chains in a dictionary order that respects the norms of the local language, as determined by LC_COLLATE.  The function returns a strictly negative number if the first chain comes before the second, returns zero if the two chains are equal, and returns a strictly positive number if the first chain comes after the second.  The following list, for example, is in dictionary order:

Acapulco
acarus
amoeba
amœba
among
naive
naïve
native
pinata
piñata
pitch
role
rôle
rubber
seat
Seattle

Exercises 3

  1. What is the difference between the expressions  strcpy (s, t)  and  s = t ?
  2. What is the difference between the expressions  if (strcmp (s, t) < 0)  and  if (s < t) ?
  3. Discuss the differences between the following three code fragments:
    char a[7], b[7];
    strcpy (a, "orange"); strcpy (b, "apple");
    
    char *a = malloc (7), *b = malloc (7); 
    strcpy (a, "orange"); strcpy (b, "apple");
    
    char *a, *b; 
    a = "orange"; b = "apple";
    
  4. Study the documentation of the function strncpy in the string library.
  5. Write a function that receives a string s and returns a copy of s.  (Compare your solution with the function strdup of the string library. What is the difference between this function and strcpy?)
  6. Write a function that receives an ASCII string s and an ASCII character c and returns the index of the first position in s that is equal to c.  (Compare your function with the function strchr of the string library.)  Now write a more general version of the function that will look for c starting at a given position i.
  7. Write a function to decide whether an array vs[0..n-1] of ASCII strings is in lexicographic order.
  8. ★ Suppose given an array vs[0..n-1] of ASCII strings, in lexicographic order, and an ASCII string s.  If vs does not contain a string equal to s, insert s in the array so as to maintain the lexicographic order. Write a function to solve the problem; return 1 if the insertion was done and 0 otherwise.