Unicode defines 1,114,112 (0x110000) code points (think of them as characters). UTF-8 is one way to transform a code point (i.e. a number) into a byte sequence. It's the most compact but also then complexest UTF encoding. The other, commonly used UTF formats are:
Note that each UTF format can encode/transform all code points. They just provide different representations.
Beside the standard conform UTF-8 there two (unofficial) variants:
C0 80
instead of 00
. Used by Java and Tcl.Note that this variants shouldn't be used to exchange data.
UTF-8 has a Byte Order Mark (BOM). If used, it needs to be placed at the beginning of the string.
The BOM is EF BB BF
.
Note that UTF-8 is independent of endianess (i.e. little endian or big endian).
The design of UTF‑8 is most easily seen in the following table. The x
s are replaced by the bits of the
code point:
Bits | Last code point | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|---|---|
7 | U+007F (127) | 0xxxxxxx |
|||
11 | U+07FF (2,047) | 110xxxxx |
10xxxxxx |
||
16 | U+FFFF (65,535) | 1110xxxx |
10xxxxxx |
10xxxxxx |
|
21 | U+1FFFFF (2,097,151) | 11110xxx |
10xxxxxx |
10xxxxxx |
10xxxxxx |
Explanation:
1
s, while
continuation bytes all have 10
in the high-order position.
1
s in the leading byte of a multi-byte sequence indicates the number
of bytes in the sequence (including the leading byte), so that the length of the sequence can be determined without
examining the continuation bytes.
The x
s in the table above are filled with the bits of the code point (beginning at the right most byte).
The following table shows some examples:
Character | Binary code point | Binary UTF-8 | Hexadecimal UTF-8 | |
---|---|---|---|---|
$ | U+0024 |
00100100 |
00100100 |
24 |
¢ | U+00A2 |
00000000 10100010 |
11000010 10100010 |
C2 A2 |
€ | U+20AC |
00100000 10101100 |
11100010 10000010 10101100 |
E2 82 AC |
𤭢 | U+24B62 |
00000010 01001011 01100010 |
11110000 10100100 10101101 10100010 |
F0 A4 AD A2 |
Remarks:
Java code for converting a code point into UTF-8:
private static final int CONTINUATION_BYTE_MARKER = 0x80; // 10xxxxxx private static final int SIX_BIT_MASK = 0x3F; // 00111111 // NOTE: Since most programming language provide their own UTF-8 encoding facilities, this // method isn't optimized for speed. Instead it's implementation focuses on making it easy // to understand. public static byte[] encodeUTF8(int codePoint) { if (codePoint <= 127) { // MSB is 0 - single byte return new byte[] { (byte)codePoint }; } // multi byte sequence // NOTE: In November 2003 UTF-8 was restricted by RFC 3629 to end at U+10FFFF, in order to // match the constraints of the UTF-16 character encoding. This removed all 5- and // 6-byte sequences. if (codePoint > 0x10FFFF) { throw new IllegalArgumentException("Invalid code point: " + codePoint); } byte[] bytes = new byte[4]; int byteCount = 0; int leadingByteMask = 1 << 5; // 00011111 while (true) { // Extract the first (= low order, right most) 6 bits from the code point and create a // continuation byte with them. byte curByte = (byte)((codePoint & SIX_BIT_MASK) | CONTINUATION_BYTE_MARKER); bytes[byteCount] = curByte; // Remove the 6 bits we just encoded. // NOTE: Use ">>>" (shift zeros into the left most position) codePoint = codePoint >>> 6; byteCount++; if (codePoint <= leadingByteMask) { // Remaining bits fit into the leading byte // Calculate most significant bits: // 1. A "1" for each byte used (including the leading byte) // 2. Followed by a "0" int msbs; switch (byteCount) { // number of continuation bytes case 1: msbs = 0xC0; // 110xxxxx break; case 2: msbs = 0xE0; // 1110xxxx break; case 3: msbs = 0xF0; // 11110xxx break; default: // Continuation bytes are limited to 3 (see "invalid code point" exception above). throw new IllegalStateException(); } curByte = (byte)(msbs | codePoint); bytes[byteCount] = curByte; byteCount++; break; } else { // We need another continuation byte leadingByteMask = leadingByteMask >>> 1; } } // NOTE: Bytes are in reversed order. Make it correct. switch (byteCount) { case 2: return new byte[] { bytes[1], bytes[0] }; case 3: return new byte[] { bytes[2], bytes[1], bytes[0] }; case 4: return new byte[] { bytes[3], bytes[2], bytes[1], bytes[0] }; default: // Byte count is limited to 4 (see "invalid code point" exception above). throw new IllegalStateException(); } }
import java.nio.*; import javax.nio.charset.*; public class UnicodeTest { public static void main(String[] args) { CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder(); // Code point: 120120 (mathematical double-struck capital A) ByteBuffer bytes = ByteBuffer.wrap(new byte[] { (byte)0xF0, (byte)0x9D, (byte)0x94, (byte)0xB8 }); String decoded; try { decoded = decoder.decode(bytes).toString(); } catch (CharacterCodingException e) { throw new RuntimeException(e); } } }