Unicode defines 1,114,112 (0x110000) code points (think of them as characters). UTF-32 is one way to transform a code point (i.e. a number) into a byte sequence. It's the simplest UTF encoding. The other, commonly used UTF formats are:
Note that each UTF format can encode/transform all code points. They just provide different representations.
UTF-32 can be encoded with little endian (UTF-32LE) or big endian (UTF-32BE). Little endian is more common.
To be able to distinguish these two format, either specify them explicitly or use a Byte Order Mark (BOM) at the beginning of the string.
The BOM for little endian is FF FE 00 00
. For big endian it is 00 00 FE FF
.
import java.nio.*; import javax.nio.charset.*; public class UnicodeTest { public static void main(String[] args) { // little endian encoding CharsetDecoder decoder = Charset.forName("UTF-32LE").newDecoder(); // Code point: 120120/0x1D538 (mathematical double-struck capital A) ByteBuffer bytes = ByteBuffer.wrap(new byte[] { (byte)0x38, (byte)0xD5, (byte)0x01, (byte)0x00 }); String decoded; try { decoded = decoder.decode(bytes).toString(); } catch (CharacterCodingException e) { throw new RuntimeException(e); } } }