Unicode (or the "Unicode standard") is basically "just" a big list of characters. The standard defines a unique number for each character, the so called code points.
The number of a code point is usually written in hexadecimal notation with a least four digits and prefixed with
U+
. For example, U+000A
(less than four hex digits) for code point 10, or
U+1D538
(more than four digits) for 120,120.
The characters are grouped by blocks (like "Basic Latin", "Greek and Coptic"). Each block belongs to a plane.
Historically there was only one plane: the Basic Multilingual Plane (BMP, plane 0). The code points
0x0000
to 0xFFFF
belong this plane. Later additional planes (1 - 16) were added. These
planes are called the Supplementary Planes (or sometimes also "Astral Planes"). Each plane again contains
0xFFFF
code points.
The plane of each code point is identified by the code point's third byte. So, for example, all code points from the
third plane start with U+03xxxx
while all code points from the fifth plane start with U+05xxxx
.
For the zeroth plane (BMP) the plane number is usually omitted (i.e. U+xxxx
).
The highest possible code point (as defined by the Unicode standard) is 0x10FFFF
(1,114,111). However,
not all number are in use. Some facts:
U+FFFD
(65,534 - including U+0000
). The
code points U+FFFE
and u+FFFF
are never used in any plane. So with 17 planes up to
1,114,078 code points can be defined.
The terms UTF-8 and UTF-16 (among others) are often used together with Unicode. How do they relate? They are encodings for code points, i.e. they define an algorithm of how to transform a code point into a series of bytes. Note, however, that all encodings can encode all Unicode code points. They just differ in the way they do this.