Introduction

Unicode defines 1,114,112 (0x110000) code points (think of them as characters). UTF-16 is one way to transform a code point (i.e. a number) into a byte sequence. It's the oldest UTF encoding. The other, commonly used UTF formats are:

Note that each UTF format can encode/transform all code points. They just provide different representations.

Notes

Surrogate Pairs

Surrogate Pairs are used to encode characters from the Supplementary Planes in UTF-16.

The pair always consists of two 16 bit values (hence the name "pair"). The first element is called the lead surrogate (former "high surrogate"). The second element is called the trail surrogate (former "low surrogate").

Note: The surrogate pair is not created by simply separating the two bytes that make up the code point. This was done to be able to distinguish whether a 16 bit value is a character from the BMP or a lead surrogate.

Java code for converting a code point into a surrogate pair:

// Only for code points > 0xFFFF
public static char[] toSurrogatePair(int codePoint) {
  // See: http://en.wikipedia.org/wiki/UTF-16#Code_points_U.2B10000_to_U.2B10FFFF

  // 1. 0x10000 is subtracted from the code point, leaving a 20 bit number in the range 0..0xFFFFF.
  codePoint -= 0x10000;
  // 2. The top ten bits (a number in the range 0..0x3FF) are added to 0xD800 to give the first
  //    code unit or lead surrogate, which will be in the range 0xD800..0xDBFF
  int leadSurrogate = (codePoint >> 10) + 0xD800;
  // 3. The low ten bits (also in the range 0..0x3FF) are added to 0xDC00 to give the second code
  //    unit or trail surrogate, which will be in the range 0xDC00..0xDFFF
  int trailSurrogate = (codePoint & 0x3FF) + 0xDC00;

  return new char[] { (char)leadSurrogate, (char)trailSurrogate };
}

public static boolean isLeadSurrogate(char ch) {
  return '\uD800' <= ch && ch <= '\uDBFF';
}

public static boolean isTrailSurrogate(char ch) {
  return '\uDC00' <= ch && ch <= '\uDFFF';
}

Example Code

import java.nio.*;
import javax.nio.charset.*;

public class UnicodeTest {
  public static void main(String[] args) {
    // little endian encoding
    CharsetDecoder decoder = Charset.forName("UTF-16LE").newDecoder();

    // Code point: 120120 (mathematical double-struck capital A)
    ByteBuffer bytes = ByteBuffer.wrap(new byte[] {
        (byte)0x35, (byte)0xD8, (byte)0x38, (byte)0xDD
      });
    String decoded;

    try {
      decoded = decoder.decode(bytes).toString();
    }
    catch (CharacterCodingException e) {
      throw new RuntimeException(e);
    }
  }
}