Unicode and UTF-8

  ! ? @ + - * / = < >
  0 1 2 3 4 5 6 7 8 9
  A E I O U C …
  Á Â Ã É Í Ó Ô Õ Ú Ç …
  ≡ ≠ ≤ ≥
  Γ Δ Π Σ Ω
  ⋮
  

This chapter makes a quick introduction to Unicode, encoding schemes, and UTF-8.  For more on the subject, see the references at the end of this page.

Table of contents:

Characters

A character is a typographic symbol used to write text in some language. (This definition is not perfect, but it will suffice.)  Here are some examples of characters:

! " - 9 A B a b ~ £ Á ñ ó Σ − ∞ ≤

The number of characters used by the different languages in the world is huge. Ordinary English uses just 94 characters, but we are exposed to many other languages, sometimes several languages in the same sentence. To this we must add the special characters used by different areas of science.

To begin organizing this Tower of Babel, we must give names to all the characters. The Unicode Consortium of IT companies assigned numerical names (known as code points) to more than 1 million characters.  Here is a tiny sample of the list of characters and their numerical names:

Unicode number    character
33 !
34 "
45 -
57 9
65 A
66 B
97 a
98 b
126 ~
163 £
193 Á
241 ñ
243 ó
931 Σ
8722
8734
8804

In this sample, the numerical names of the characters are written in decimal notation. In general, however, these names are written in hexadecimal notation and preceded by U+:

Unicode character
U+0021 !
U+0022 "
U+002D -
U+0039 9
U+0041 A
U+0042 B
U+0061 a
U+0062 b
U+007E ~
U+00A3 £
U+00C1 Á
U+00F1 ñ
U+00F3 ó
U+03A3 Σ
U+2212
U+221E
U+2264

The complete list of characters and their Unicode numbers can be seen on the Wikipedia page List of Unicode characters or the Wikibooks page Unicode / Character reference.

The set of all the characters on the Unicode list could be called Unicode alphabet and we could say that each character of this alphabet is a Unicode character.  (If the aspirations of the Unicode project are justified, then all the characters in the world are Unicode characters.)

ASCII characters

The first 128 characters of the Unicode alphabet are the most important. This set of characters goes from U+0000 to U+007F and is known as ASCII alphabet. The elements of this alphabet will be called ASCII characters. The ASCII alphabet contains letters, decimal digits, punctuation, and some special characters. The list of the 128 ASCII characters and their Unicode numbers is recorded in the ASCII table.

Unfortunately, the ASCII alphabet is not sufficient to write text in a language like Spanish and French since it lacks letters with diacritics.

Encoding schemes

How can we store Unicode characters in digital files and in memory? We could represent each character by its Unicode number written in binary notation. But this would require 3 bytes per character, which is very inefficient given that 1 byte is enough for the most common characters. We must, therefore, resort do more complex representations.

An encoding scheme (or character encoding) is a table that associates a sequence of bytes with each Unicode number, and therefore with each Unicode character.  The sequence of bytes associated with a character is the code of the character.  The next sections examine two encodings: ASCII and UTF-8.

ASCII encoding

The ASCII code is very simple: the Unicode number of each character is written in binary notation.  This code is used only for the ASCII alphabet.  Since this alphabet has only 128 characters, the ASCII code uses only 1 byte per character and the first bit of this byte is 0.  Here is a sample of the code table:

Unicode       ASCII hexa
U+0021 ! 00100001 x21
U+0022 " 00100010 x22
U+002D - 00101101 x2D
U+0039 9 00100111 x39
U+0041 A 01000001 x41
U+0042 B 01000010 x42
U+0061 a 01100001 x61
U+0062 b 01100010 x62
U+007E ~ 01111110 x7E

The last column shows the ASCII code written in hexadecimal notation.

(Why not use all the 8 bits of a byte?  We could then encode additional 128 characters.  The ISO-LATIN-1 code does exactly this, but the table is rarely used nowadays. The ISO-LATIN-1 set includes the characters  ª ± º ¼ ½ ¾ À Á Â Ã Ç È É Ê Ì Í Î Ò Ó Ô × Ù Ú Û à á â ã ç è é ê ì í î ò ó ô õ ÷ ù ú û  among others. The the numerical names of these characters are the same in the ISO-LATIN-1 table and the Unicode table.)

UTF-8 encoding

If we were to use a fixed number of bytes per character we would need 3 bytes. The solution is to resort to a multibyte code, that employs a variable number of bytes per character: some characters use 1 byte, others use 2 bytes, and so on.

The most widely used multibyte code is known as UTF-8.  It associates a sequence of 1 to 4 bytes (8 to 32 bits) with each Unicode character.  The first 128 characters use the good old ASCII code of 1 byte por character.  The remaining characters have a longer code.  Here is a tiny sample:

Unicode       UTF-8 code hexa
U+0021 ! 00100001 x21
U+0022 " 00100010 x22
U+002D - 00101101 x2D
U+0039 9 00100111 x39
U+0041 A 01000001 x41
U+0042 B 01000010 x42
U+0061 a 01100001 x61
U+0062 b 01100010 x62
U+007E ~ 01111110 x7E
U+00A3 £ 11000010 10100011 xC2A3
U+00C1 Á 11000011 10000001 xC381
U+00F1 ñ 11000011 10110001 xC3B1
U+00F3 ó 11000011 10110011 xC3B3
U+03A3 Σ 11001110 10100011 xCEA3
U+2212 11100010 10001000 10010010 xE28892
U+221E 11100010 10001000 10011110 xE2889E
U+2264 11100010 10001001 10100100 xE289A4

(The last column shows the UTF-8 code in hexadecimal notation.)  The list of UTF-8 codes of all the Unicode characters can be seen in UTF-8 encoding table and Unicode characters or in Wikibooks page Unicode / Character reference.  For example, the character chain  i ≤ 99  is represented in UTF-8 by the sequence of bytes

x69 x20 xE2 x89 xA4 x20 x39 x39
i 9 9

where  ␣  indicates the space character.

Decoding.  Since the number of bytes per character is not fixed, the decoding of a sequence of bytes is not easy. How do we know where the code of one character ends and the code of the next character begins?  The UTF-8 encoding scheme was designed so that the first bits of the code of a character indicate how many bytes the code occupies.  If the first bit is 0, and therefore the value of the first byte is smaller than 128, then this is the only byte of the character.  If the value of the first byte belongs to the interval 192 .. 223 then the code of the character has two bytes.  And so on.

Assume UTF-8.  The C programming language does not prescribe any specific encoding scheme. But the most used encoding is UTF-8.  The present site assumes that all the text files, be they programs or data, use UTF-8 code.  (But in many examples, only the ASCII subset of UTF-8 is used.)

Exercises 1

  1. Consider the following sequence of bytes, written in decimal notation. What character chain does the sequence of bytes represent in UTF-8 encoding?
    118 91 110 93 32 61 32 226 136 158
    
  2. The following sequence of bytes is written in hexadecimal notation. What character chain do these bytes represent in UTF-8 encoding?
    x41 x74 x65 x6E x63 x69 xC3 xB3 x6E x21
    
  3. Write a function to receive a file containing text in UTF-8 encoding and decide whether each byte of the file represents one character (that is, whether the alphabet of the file is ASCII).
  4. Write the sequences of bytes that represent each of the following character chains in UTF-8 encoding:
    • ASCII string
    • Atención!
    • piña colada
    • $50 ≈ £42
    • π = 3.14±0.01
    • ⌊9.9⌋ = 9
    • v[n] = ∞

    (Consult the page UTF-8 encoding table and Unicode characters. You may have to use the go to other block button.)

How is my file encoded?

There is no way of knowing, with certainty, which encoding a given text file uses. The author of the file must announce, outside the file, the encoding scheme he/she used.

There are utilities (as file, for example) that scan a file and try to guess, with some degree of confidence, its encoding scheme.

If you know the encoding scheme used by your file, you can use the iconv filter to change the encoding. You can, for example, convert an ISO-LATIN-1 file into an equivalent UTF-8 file.

Exercises 2

  1. The utilities od and hexdump print the sequence of (numerical values of) the bytes of a file.  Use one of these utilities to study the contents of a file and guess the encoding it uses.
  2. The function isalpha in the ctype library decides whether a given ASCII character is a letter. Write an extension of isalpha that will recognize letters with diacritics.  Your function must receive a string that contains the UTF-8 code of a character and decide whether the code represents a valid letter (with or without a diacritic mark on it).