Introduction

Unicode defines 1,114,112 (0x110000) code points (think of them as characters). UTF-8 is one way to transform a code point (i.e. a number) into a byte sequence. It's the most compact but also then complexest UTF encoding. The other, commonly used UTF formats are:

Note that each UTF format can encode/transform all code points. They just provide different representations.

Notes

UTF-8 is a variable-length encoding. Each code point is represented by one to four 8 bit values.
Beside the standard conform UTF-8 there two (unofficial) variants:
- CESU-8: Uses UTF-16 surrogate pairs and encodes each element pair separately with UTF-8 (instead of encoding the whole code point directly).
- Modified UTF-8: Like CESU-8, but encodes the NUL character (code point: 0) with C0 80 instead of 00. Used by Java and Tcl.
Note that this variants shouldn't be used to exchange data.
UTF-8 has a Byte Order Mark (BOM). If used, it needs to be placed at the beginning of the string.

The BOM is EF BB BF.

Note that UTF-8 is independent of endianess (i.e. little endian or big endian).

Encoding

The design of UTF‑8 is most easily seen in the following table. The xs are replaced by the bits of the code point:

Bits	Last code point	Byte 1	Byte 2	Byte 3	Byte 4
7	U+007F (127)	`0xxxxxxx`
11	U+07FF (2,047)	`110xxxxx`	`10xxxxxx`
16	U+FFFF (65,535)	`1110xxxx`	`10xxxxxx`	`10xxxxxx`
21	U+1FFFFF (2,097,151)	`11110xxx`	`10xxxxxx`	`10xxxxxx`	`10xxxxxx`

Explanation:

One-byte codes are used only for the ASCII values 0 through 127. In this case the UTF-8 code has the same value as the ASCII code. The high-order bit of these codes is always 0.
Code points larger than 127 are represented by multi-byte sequences, composed of a leading byte and one or more continuation bytes. The leading byte has two or more high-order 1s, while continuation bytes all have 10 in the high-order position.
The number of high-order 1s in the leading byte of a multi-byte sequence indicates the number of bytes in the sequence (including the leading byte), so that the length of the sequence can be determined without examining the continuation bytes.

The xs in the table above are filled with the bits of the code point (beginning at the right most byte). The following table shows some examples:

Character		Binary code point	Binary UTF-8	Hexadecimal UTF-8
$	`U+0024`	`00100100`	`00100100`	`24`
¢	`U+00A2`	`00000000 10100010`	`11000010 10100010`	`C2 A2`
€	`U+20AC`	`00100000 10101100`	`11100010 10000010 10101100`	`E2 82 AC`
𤭢	`U+24B62`	`00000010 01001011 01100010`	`11110000 10100100 10101101 10100010`	`F0 A4 AD A2`

Remarks:

Single bytes, leading bytes, and continuation bytes do not share values. This makes the scheme "self synchronizing", allowing the start of a character to be found by backing up at most three bytes.
Although it would be possible to represent up to 8 byte or even 9 byte sequences with UTF-8, it's length was limited to 4 bytes (by RFC 3629). Actually, it was restricted to end at U+10FFFF. (A 4 byte UTF-8 sequence could go up as high as U+1FFFFF.)
There are multiple encodings for a single code point (e.g. by adding empty continuation bytes or representing an ASCII character with a multi byte sequence). However, only the shortest possible form is considered valid UTF-8.

Java code for converting a code point into UTF-8:

private static final int CONTINUATION_BYTE_MARKER =  0x80; // 10xxxxxx
private static final int SIX_BIT_MASK = 0x3F; // 00111111

// NOTE: Since most programming language provide their own UTF-8 encoding facilities, this
//   method isn't optimized for speed. Instead it's implementation focuses on making it easy
//   to understand.
public static byte[] encodeUTF8(int codePoint) {
  if (codePoint <= 127) {
    // MSB is 0 - single byte
    return new byte[] { (byte)codePoint };
  }

  // multi byte sequence

  // NOTE: In November 2003 UTF-8 was restricted by RFC 3629 to end at U+10FFFF, in order to
  //   match the constraints of the UTF-16 character encoding. This removed all 5- and
  //   6-byte sequences.
  if (codePoint > 0x10FFFF) {
    throw new IllegalArgumentException("Invalid code point: " + codePoint);
  }

  byte[] bytes = new byte[4];
  int byteCount = 0;
  int leadingByteMask = 1 << 5; // 00011111

  while (true) {
    // Extract the first (= low order, right most) 6 bits from the code point and create a
    // continuation byte with them.
    byte curByte = (byte)((codePoint & SIX_BIT_MASK) | CONTINUATION_BYTE_MARKER);
    bytes[byteCount] = curByte;

    // Remove the 6 bits we just encoded.
    // NOTE: Use ">>>" (shift zeros into the left most position)
    codePoint = codePoint >>> 6;
    byteCount++;
    if (codePoint <= leadingByteMask) {
      // Remaining bits fit into the leading byte

      // Calculate most significant bits:
      //  1. A "1" for each byte used (including the leading byte)
      //  2. Followed by a "0"
      int msbs;
      switch (byteCount) { // number of continuation bytes
      case 1:
        msbs = 0xC0; // 110xxxxx
        break;
      case 2:
        msbs = 0xE0; // 1110xxxx
        break;
      case 3:
        msbs = 0xF0; // 11110xxx
        break;

      default:
        // Continuation bytes are limited to 3 (see "invalid code point" exception above).
        throw new IllegalStateException();
      }

      curByte = (byte)(msbs | codePoint);
      bytes[byteCount] = curByte;

      byteCount++;
      break;
    }
    else {
      // We need another continuation byte
      leadingByteMask = leadingByteMask >>> 1;
    }
  }

  // NOTE: Bytes are in reversed order. Make it correct.
  switch (byteCount) {
  case 2:
    return new byte[] { bytes[1], bytes[0] };
  case 3:
    return new byte[] { bytes[2], bytes[1], bytes[0] };
  case 4:
    return new byte[] { bytes[3], bytes[2], bytes[1], bytes[0] };
  default:
    // Byte count is limited to 4 (see "invalid code point" exception above).
    throw new IllegalStateException();
  }
}

Example Code

import java.nio.*;
import javax.nio.charset.*;

public class UnicodeTest {
  public static void main(String[] args) {
    CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();

    // Code point: 120120 (mathematical double-struck capital A)
    ByteBuffer bytes = ByteBuffer.wrap(new byte[] {
        (byte)0xF0, (byte)0x9D, (byte)0x94, (byte)0xB8
      });
    String decoded;

    try {
      decoded = decoder.decode(bytes).toString();
    }
    catch (CharacterCodingException e) {
      throw new RuntimeException(e);
    }
  }
}