Using the Baby X UTF-8 utility functions

The Baby X UTF-8 utility functions are a simple set of functions which make it easy to manipulate UTF-8. All Baby X API functions which accept a char * accept UTF-8. The functions are easy to use and efficient.

What is UTF-8

Unicode is a character encoding which represents all the glyphs in common use in human languages. So not just English letters, like ASCII, but also Greek and Arabic and Japanese and so on. It was originally thought that 64,000 characters would be enough, and so the original version of Unicode was a 16-bit character set. This proved to be a big mistake. More than 64,000 characters were needed. And there was a massive amount of legacy code which expected strings in ASCII, and couldn't work with 16-bit wide characters.

So to address these probles, UTF-8 was invented. It is backwards compatible with ASCII. So every ASCII string is also a valid UTF-8 string. However ASCII is only a seven bit encoding, whilst bytes have eight bits. So UTF-8 exploits the extra high bit. If the high bit is set, the character is part of a multi-byte encoding of a Unicode value greater than 127. The higher the value, the more bytes used. There might be one (for ASCII), two, three, or even four UTF-8 bytes used to encode a single Unicode value, or "code point".

You don't have to know the details of how UTF-8 works to use the Baby X utility functions successfully, but for the curious, here are the details:

0xxxxxxx - ASCII character
1xxxxxxx - Part of a multibyte character

Bytes	Bits	Hex Min	Hex Max	Byte Sequence in Binary
1	7	00000000	0000007f	0zzzzzzz
2	13	00000080	0000207f	10zzzzzz 1yyyyyyy
3	19	00002080	0008207f	110zzzzz 1yyyyyyy 1xxxxxxx
4	25	00082080	0208207f	1110zzzz 1yyyyyyy 1xxxxxxx 1wwwwwww
5	31	02082080	7fffffff	11110zzz 1yyyyyyy 1xxxxxxx 1wwwwwww 1vvvvvvv

The bits included in the byte sequence is biased by the minimum value so that if all the z's, y's, x's, w's, and v's are zero, the minimum value is represented. In the byte sequences, the lowest-order encoded bits are in the last byte; the high-order bits (the z's) are in the first byte.

An easy way to remember this transformation format is to note that the number of high-order 1's in the first byte is the same as the number of subsequent bytes in the multibyte character:

The Baby X UTF-8 utility functions


        int bbx_isutf8z(const char *str);
        int bbx_utf8_skip(const char *utf8);
        int bbx_utf8_getch(const char *utf8);
        int bbx_utf8_putch(char *out, int ch);
        int bbx_utf8_charwidth(int ch);
        int bbx_utf8_Nchars(const char *utf8);

Examples


   int *uft8toUnicode(const char *utf8)
   {
      int *result;
      const char *ptr;
      int N;
      int i;

      /* are we passed a valid UTF-8 string? */
      if (!bbx_utf8_isutf8z(utf8))
        return 0;
      /* count the character so we know how many to allocate */
      N = bbx_utf8_Nchars(utf8);
      result = malloc((N +1) * sizeof(int));

      /* pull out the characters are return as an array of code points */
      ptr = utf8;
      for (i =0; i < N + 1; i++)
      {
        int codepoint = bbx_utf8_getch(ptr);
        result[i] = ch;
        ptr += bbx_uf8_skip(ptr);
      }

      return result;
   }

This is code to convert from UTF-8 to Unicode code points.


	char *utf16toutf8(wchar_t *utf16)
        {
           int size = 0;;
           int i;
           char *result;
           char *ptr;
 
           /* get the size of the output buffer */
           for (i =0; utf16[i]; i++)
              size += bbx_utf8_charwidth(utf16[i]);
           size++;

           result = malloc(size);
           ptr = result;

           /* now put the characters into the buffer in UTF-8 */
           for (i =0; utf16[i]; i++)
           {
              ptr += bbx_utf8_putch(ptr, utf16[i]);
           }
           /* add the nul */
           *ptr++ = 0;

           return result;
        }

This is code to convert from a wide character encoding to UTF-8.