Introduction

Unicode defines 1,114,112 (0x110000) code points (think of them as characters). UTF-8 is one way to transform a code point (i.e. a number) into a byte sequence. It's the most compact but also then complexest UTF encoding. The other, commonly used UTF formats are:

Note that each UTF format can encode/transform all code points. They just provide different representations.

Notes

Encoding

The design of UTF‑8 is most easily seen in the following table. The xs are replaced by the bits of the code point:

Bits Last code point Byte 1 Byte 2 Byte 3 Byte 4
  7 U+007F (127) 0xxxxxxx
11 U+07FF (2,047) 110xxxxx 10xxxxxx
16 U+FFFF (65,535) 1110xxxx 10xxxxxx 10xxxxxx
21 U+1FFFFF (2,097,151) 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Explanation:

The xs in the table above are filled with the bits of the code point (beginning at the right most byte). The following table shows some examples:

Character Binary code point Binary UTF-8 Hexadecimal UTF-8
$ U+0024 00100100 00100100 24
¢ U+00A2 00000000 10100010 11000010 10100010 C2 A2
U+20AC 00100000 10101100 11100010 10000010 10101100 E2 82 AC
𤭢 U+24B62 00000010 01001011 01100010 11110000 10100100 10101101 10100010 F0 A4 AD A2

Remarks:

Java code for converting a code point into UTF-8:

private static final int CONTINUATION_BYTE_MARKER =  0x80; // 10xxxxxx
private static final int SIX_BIT_MASK = 0x3F; // 00111111

// NOTE: Since most programming language provide their own UTF-8 encoding facilities, this
//   method isn't optimized for speed. Instead it's implementation focuses on making it easy
//   to understand.
public static byte[] encodeUTF8(int codePoint) {
  if (codePoint <= 127) {
    // MSB is 0 - single byte
    return new byte[] { (byte)codePoint };
  }

  // multi byte sequence

  // NOTE: In November 2003 UTF-8 was restricted by RFC 3629 to end at U+10FFFF, in order to
  //   match the constraints of the UTF-16 character encoding. This removed all 5- and
  //   6-byte sequences.
  if (codePoint > 0x10FFFF) {
    throw new IllegalArgumentException("Invalid code point: " + codePoint);
  }

  byte[] bytes = new byte[4];
  int byteCount = 0;
  int leadingByteMask = 1 << 5; // 00011111

  while (true) {
    // Extract the first (= low order, right most) 6 bits from the code point and create a
    // continuation byte with them.
    byte curByte = (byte)((codePoint & SIX_BIT_MASK) | CONTINUATION_BYTE_MARKER);
    bytes[byteCount] = curByte;

    // Remove the 6 bits we just encoded.
    // NOTE: Use ">>>" (shift zeros into the left most position)
    codePoint = codePoint >>> 6;
    byteCount++;
    if (codePoint <= leadingByteMask) {
      // Remaining bits fit into the leading byte

      // Calculate most significant bits:
      //  1. A "1" for each byte used (including the leading byte)
      //  2. Followed by a "0"
      int msbs;
      switch (byteCount) { // number of continuation bytes
      case 1:
        msbs = 0xC0; // 110xxxxx
        break;
      case 2:
        msbs = 0xE0; // 1110xxxx
        break;
      case 3:
        msbs = 0xF0; // 11110xxx
        break;

      default:
        // Continuation bytes are limited to 3 (see "invalid code point" exception above).
        throw new IllegalStateException();
      }

      curByte = (byte)(msbs | codePoint);
      bytes[byteCount] = curByte;

      byteCount++;
      break;
    }
    else {
      // We need another continuation byte
      leadingByteMask = leadingByteMask >>> 1;
    }
  }

  // NOTE: Bytes are in reversed order. Make it correct.
  switch (byteCount) {
  case 2:
    return new byte[] { bytes[1], bytes[0] };
  case 3:
    return new byte[] { bytes[2], bytes[1], bytes[0] };
  case 4:
    return new byte[] { bytes[3], bytes[2], bytes[1], bytes[0] };
  default:
    // Byte count is limited to 4 (see "invalid code point" exception above).
    throw new IllegalStateException();
  }
}

Example Code

import java.nio.*;
import javax.nio.charset.*;

public class UnicodeTest {
  public static void main(String[] args) {
    CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();

    // Code point: 120120 (mathematical double-struck capital A)
    ByteBuffer bytes = ByteBuffer.wrap(new byte[] {
        (byte)0xF0, (byte)0x9D, (byte)0x94, (byte)0xB8
      });
    String decoded;

    try {
      decoded = decoder.decode(bytes).toString();
    }
    catch (CharacterCodingException e) {
      throw new RuntimeException(e);
    }
  }
}