Unicode Basics

Unicode (or the "Unicode standard") is basically "just" a big list of characters. The standard defines a unique number for each character, the so called code points.

The number of a code point is usually written in hexadecimal notation with a least four digits and prefixed with U+. For example, U+000A (less than four hex digits) for code point 10, or U+1D538 (more than four digits) for 120,120.

The characters are grouped by blocks (like "Basic Latin", "Greek and Coptic"). Each block belongs to a plane.

Historically there was only one plane: the Basic Multilingual Plane (BMP, plane 0). The code points 0x0000 to 0xFFFF belong this plane. Later additional planes (1 - 16) were added. These planes are called the Supplementary Planes (or sometimes also "Astral Planes"). Each plane again contains 0xFFFF code points.

The plane of each code point is identified by the code point's third byte. So, for example, all code points from the third plane start with U+03xxxx while all code points from the fifth plane start with U+05xxxx. For the zeroth plane (BMP) the plane number is usually omitted (i.e. U+xxxx).

The highest possible code point (as defined by the Unicode standard) is ​0x10FFFF (1,114,111). However, not all number are in use. Some facts:

The terms UTF-8 and UTF-16 (among others) are often used together with Unicode. How do they relate? They are encodings for code points, i.e. they define an algorithm of how to transform a code point into a series of bytes. Note, however, that all encodings can encode all Unicode code points. They just differ in the way they do this.