Unicode defines 1,114,112 (0x110000) code points (think of them as characters). UTF-32 is one way to transform a code point (i.e. a number) into a byte sequence. It's the simplest UTF encoding. The other, commonly used UTF formats are:
Note that each UTF format can encode/transform all code points. They just provide different representations.
UTF-32 can be encoded with little endian (UTF-32LE) or big endian (UTF-32BE). Little endian is more common.
To be able to distinguish these two format, either specify them explicitly or use a Byte Order Mark (BOM) at the beginning of the string.
The BOM for little endian is FF FE 00 00. For big endian it is 00 00 FE FF.
import java.nio.*;
import javax.nio.charset.*;
public class UnicodeTest {
public static void main(String[] args) {
// little endian encoding
CharsetDecoder decoder = Charset.forName("UTF-32LE").newDecoder();
// Code point: 120120/0x1D538 (mathematical double-struck capital A)
ByteBuffer bytes = ByteBuffer.wrap(new byte[] {
(byte)0x38, (byte)0xD5, (byte)0x01, (byte)0x00
});
String decoded;
try {
decoded = decoder.decode(bytes).toString();
}
catch (CharacterCodingException e) {
throw new RuntimeException(e);
}
}
}