12.9 Unicode Character Codes

You can specify Unicode characters using escape sequences called universal character names that start with ‘\u’ and ‘\U’. They are valid in C for individual character constants, inside string constants (see String Constants), and even in identifiers. These escape sequence includes a hexadecimal Unicode character code, also called a code point in Unicode terminology.

Use the ‘\u’ escape sequence with a 16-bit hexadecimal Unicode character code. If the character’s numeric code is too big for 16 bits, use the ‘\U’ escape sequence with a 32-bit hexadecimal Unicode character code. Here are some examples.

\u6C34      /* 16-bit code (Chinese for “water”), UTF-16 */
\U0010ABCD  /* 32-bit code, UTF-32 */

One way to use these is in UTF-8 string constants (see UTF-8 String Constants). For instance, here we use two of them, each preceded by a space.

u8"fóó \u6C34 \U0010ABCD"

You can also use them in wide character constants (see Wide Character Constants), like this:

u'\u6C34'      /* 16-bit code (water) */
U'\U0010ABCD'  /* 32-bit code */

and in wide string constants (see Wide String Constants), like this:

u"\u6C34\u706B"  /* 16-bit codes (water, fire) */
U"\U0010ABCD"    /* 32-bit code */

And in an identifier:

int foo\u6C34bar = 0;

Codes in the range of D800 through DFFF are invalid in universal character names. Trying to write them using ‘\u’ causes an error. Unicode calls them “surrogate code points” and uses them in UTF-16 for purposes too specialized to explain here.

Codes less than 00A0 are likewise invalid in universal character names, and likewise cause errors, except for 0024 (‘$’), 0040 (‘@’), and 0060 (‘`’). Character codes which can’t be represented with universal character names can be specified with octal or hexadecimal escape sequences (see Character Constants).