Reading lines of text

Many programs take their input from some large file. Such programs can be conveniently organized as a line-oriented iterative process. Each iteration in this process has three steps:

  1. read a line of the file and convert it into a string;
  2. extract the desired information from the string;
  3. process the information.

Step 2 can be executed by the function sscanf from the stdio library.  Step 1 can be executed by the function fgets from the stdio library.  This chapter offers a home-made alternative to fgets which is more comfortable and more powerful.  (This alternative is just a variant of the getline function from the GNU C Library.)

Table of contents:

Text files

A large part of our data is stored in text files (also known as plain text).  The bytes of such a file represent characters.  If the first bit of every byte is 0, we have an ASCII file. In this case, each byte of the file represents one character of the ASCII alphabet.  In general, however, text files contain non-ASCII characters. We are assuming that all our files use UTF-8 encoding; hence, each character is represented by 1, 2, 3, or 4 consecutive bytes.  (But certain byte values are not part of the code of any character.)

Usually, a text file has several lines, the end of each line being indicated by a byte \n.  We say that each line of the file is a line of text(By the way, the last byte of every polite text file should be a \n; moreover, no line of text should have a space before the \n.)

Reading a line of text

Eric Roberts wrote a simpio library (see the simpio.h interface and the simpio.c implementation) whose functions provide a comfortable way of reading a line of text.  The main function of the library, ReadLine, reads a line of arbitrary length from a text file.  The function is better (but possibly a little slower) than fgets.  Here is a slightly modified version of ReadLine:

#include <stdlib.h>
#include <string.h>
typedef unsigned char byte;

// This function reads a line from the text file
// infile (starting from the current position in
// the file) and returns a string with the same
// content as the line. The \n byte that signals
// the end of the line is not stored in the
// string. The function returns NULL if the
// current position in the file is at the end of
// the file. Typical use:
//         s = readline (infile);
// (This function is a slight variant  of the
// ReadLine function in the simpio library by
// Eric Roberts.)

byte *readline (FILE *infile)
{
   int n = 0, size = 128, ch;
   byte *line;
   line = malloc (size + 1);
   while ((ch = getc (infile)) != '\n' && ch != EOF) {
      if (n == size) {
         size *= 2;
         line = realloc (line, size + 1);
      }
      line[n++] = ch;
   }
   if (n == 0 && ch == EOF) {
      free (line);
      return NULL;
   }
   line[n] = '\0';
   line = realloc (line, n + 1);
   return line;
}

The function transfers bytes from infile to the array line. Every time the array becomes full, its size is doubled.

The function makes the (quite reasonable) assumption that the line read from the file contains no null bytes.

Exercises 1

  1. Compare the documentation of readline with that of the function fgets in the stdio library.
  2. What is the behavior of readline if it runs into a \0 before seeing a \n?
  3. The documentation of readline does not explain what happens if infile has no \n between the current position and the end of the file. Fill in this gap in the documentation.  (By the way, it is a good idea to end every text file with a \n.)
  4. Analyse the following variant of readline (it is closer to the ReadLine by Roberts).  The function strncpy used in the code is a variant of the function strcpy: it copies the first n bytes of line to nline.
       int n = 0, size = 128, ch;
       byte *line, *nline;
       line = malloc (size + 1);
       while ((ch = getc (infile)) != '\n' && ch != EOF) {
          if (n == size) {
             size *= 2;
             nline = malloc (size + 1);
             strncpy (nline, line, n);
             free (line);
             line = nline; }
          line[n++] = ch;
       }
       if (n == 0 && ch == EOF) {
          free (line);
          return NULL; }
       line[n] = '\0';
       nline = malloc (n + 1);
       strcpy (nline, line);
       free (line);
       return nline;
    
  5. In the previous exercise, what happens if we replace strncpy (nline, line, n) by strcpy (nline, line)?

Reading an integer number

The function GetInteger in Eric Roberts' simpio library uses the function ReadLine to extract an integer from a line of text typed on the keyboard. Here is a slightly modified version of that function:

// Extracts an integer from the line of text typed
// on the keyboard and returns this integer. If the
// line has some non-white-space byte before or
// after the integer, the function gives the user
// a chance to try again. Typical use:
//     i = getInteger ();
// (This function is a variant  of the function
// GetInteger in the simpio library by Eric
// Roberts.)

int getInteger (void) {
   while (1) {
      byte *line = readline (stdin);
      int value;
      byte ch;
      switch (sscanf (line, " %d %c", &value, &ch)) {
         case 1:
            free (line);
            return value;
         case 2:
            printf ("Unexpected byte: '%c'\n", ch);
         default:
            printf ("Try again\n");
      }
      free (line);
   }
}

The function reads a line of text and then uses the sscanf function (from the stdio library) to extract an integer from it. Reading an entire line is essential to allow the function to recover in case the user makes a typing mistake; otherwise, the bytes after the occurrence of the mistake would remain in the input buffer and mess up the subsequent input operations.

The function sscanf is like fscanf, except that it operates on a string rather than a file. The function receives a string and attempts to extract from it the objects specified by the format given in the second argument. The function returns the number of objects that it managed to successfully extract from the string.

Exercises 2

  1. Study the functions in the simpio library by Eric Roberts.
  2. Study the documentation of the function sscanf in the stdio standard library.