Unicode defines 1,114,112 (0x110000) code points (think of them as characters). UTF-16 is one way to transform a code point (i.e. a number) into a byte sequence. It's the oldest UTF encoding. The other, commonly used UTF formats are:
Note that each UTF format can encode/transform all code points. They just provide different representations.
UTF-16 can be encoded with little endian (UTF-16LE) or big endian (UTF-16BE). Little endian is more common.
To be able to distinguish these two format, either specify them explicitly or use a Byte Order Mark (BOM) at the beginning of the string.
The BOM for little endian is FF FE
. For big endian it is FE FF
.
Surrogate Pairs are used to encode characters from the Supplementary Planes in UTF-16.
The pair always consists of two 16 bit values (hence the name "pair"). The first element is called the lead surrogate (former "high surrogate"). The second element is called the trail surrogate (former "low surrogate").
Note: The surrogate pair is not created by simply separating the two bytes that make up the code point. This was done to be able to distinguish whether a 16 bit value is a character from the BMP or a lead surrogate.
Java code for converting a code point into a surrogate pair:
// Only for code points > 0xFFFF public static char[] toSurrogatePair(int codePoint) { // See: http://en.wikipedia.org/wiki/UTF-16#Code_points_U.2B10000_to_U.2B10FFFF // 1. 0x10000 is subtracted from the code point, leaving a 20 bit number in the range 0..0xFFFFF. codePoint -= 0x10000; // 2. The top ten bits (a number in the range 0..0x3FF) are added to 0xD800 to give the first // code unit or lead surrogate, which will be in the range 0xD800..0xDBFF int leadSurrogate = (codePoint >> 10) + 0xD800; // 3. The low ten bits (also in the range 0..0x3FF) are added to 0xDC00 to give the second code // unit or trail surrogate, which will be in the range 0xDC00..0xDFFF int trailSurrogate = (codePoint & 0x3FF) + 0xDC00; return new char[] { (char)leadSurrogate, (char)trailSurrogate }; } public static boolean isLeadSurrogate(char ch) { return '\uD800' <= ch && ch <= '\uDBFF'; } public static boolean isTrailSurrogate(char ch) { return '\uDC00' <= ch && ch <= '\uDFFF'; }
import java.nio.*; import javax.nio.charset.*; public class UnicodeTest { public static void main(String[] args) { // little endian encoding CharsetDecoder decoder = Charset.forName("UTF-16LE").newDecoder(); // Code point: 120120 (mathematical double-struck capital A) ByteBuffer bytes = ByteBuffer.wrap(new byte[] { (byte)0x35, (byte)0xD8, (byte)0x38, (byte)0xDD }); String decoded; try { decoded = decoder.decode(bytes).toString(); } catch (CharacterCodingException e) { throw new RuntimeException(e); } } }