Unicode - Character Encoding – Dhval Mudawal

View on Evernote.com

UTF-8 was designed to be backward compatible with ASCII. UTF-8 is variable-width encoding, a single character can be represented using one to four bytes.

The most significant bit of a single-byte character is always 0. For multi byte characters MSB is always 1 and can be ignored by ASCII readers. This allows backward compatibility with ASCII.
The bits are 110 for two-byte sequences; 1110 for three-byte sequences, and so on.
The remaining bytes in a multi-byte sequence have 10 as their two most significant bits.
A UTF-8 stream contains neither the byte FE nor FF. This makes sure that a UTF-8 stream never looks like a UTF-16 stream starting with U+FEFF (Byte-order mark)

First 128 characters (US-ASCII => U+0000 - U+007F) need one byte. The next 1,920 characters ( U+0080 - U+07FF ) need two bytes to encode. Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters (Chinese, Japanese, Korean) and various historic scripts.

Code point - U+0065 or \u0065

Byte order mark (BOM) - It is needed because of endianness when storing multi byte data. XML avoids this by using UTF-8 by default.

Unicode defines 6 different BOM:

BOM	Encoding	Endian
0x2B 0x2F 0x76 0x38 0x2D (5 bytes)	UTF-7	endianless
0xEF 0xBB 0xBF (3)	UTF-8	endianless
0xFF 0xFE (2)	UTF-16-LE	little endian
0xFE 0xFF (2)	UTF-16-BE	big endian
0xFF 0xFE 0x00 0x00 (4)	UTF-32-LE	little endian
0x00 0x00 0xFE 0xFF (4)	UTF-32-BE	big endian

Non BMP - Emoticons, CJK ( Rare Chinese/Japanese/Korean Characters ) and Math characters not part of Basic Multilingual Plane. UTF-16 divides character sets into multiple planes.

UTF 8 vs 16

UTF-16 has either 2 or 4 bytes. Basic Multilingual Plane (Plane 0, BMP). For non-BMP character sets UTF-16 uses surrogate pairs (low and high), each of two byte.

- UTF-16 uses two to four byte, encoded files are larger than UTF-8.

- UTF-16 is not compatible with ASCII since it uses minimum 2 bytes.

- UTF-16 suffers from endianness as it has two characters, to compensate we need to add byte order mark.

- UTF-8 is byte oriented while UTF-16 is not. UTF-8 is better in recovering from errors compared to UTF-16.

+ U+0800 to U+FFFF ( Chinese, Hindi or Non-BMP ) uses 3 bytes in UTF-8 while just two in UTF-16.

+ It is easier to stick with one encoding as various libraries do not have to do interchange. Also unicode does not have any marker bits except for surrogate,

Base64

Bits	Last code point	Byte 1	Byte 2	Byte 3	Byte 4	Byte 5	Byte 6
7	U+007F	0xxxxxxx
11	U+07FF	110xxxxx	10xxxxxx
16	U+FFFF	1110xxxx	10xxxxxx	10xxxxxx
21	U+1FFFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx
26	U+3FFFFFF	111110xx	10xxxxxx	10xxxxxx	10xxxxxx	10xxxxxx
31	U+7FFFFFFF	1111110x	10xxxxxx	10xxxxxx	10xxxxxx	10xxxxxx	10xxxxxx

The salient features of the above scheme are as follows:

Every valid ASCII character is also a valid UTF‑8 encoded Unicode character with the same binary value. (Thus, valid ASCII text is also valid UTF‑8-encoded Unicode text.)
For every UTF‑8 byte sequence corresponding to a single Unicode character, the first byte unambiguously indicates the length of the sequence in bytes.
All continuation bytes (byte nos. 2 – 6 in the table above) have 10 as their two most-significant bits (bits 7 – 6); in contrast, the first byte never has 10 as its two most-significant bits. As a result, it is immediately obvious whether any given byte anywhere in a (valid) UTF‑8 stream represents the first byte of a byte sequence corresponding to a single character, or a continuation byte of such a byte sequence.
As a consequence of no. 3 above, starting with any arbitrary byte anywhere in a (valid) UTF‑8 stream, it is necessary to back up by only at most five bytes in order to get to the beginning of the byte sequence corresponding to a single character (three bytes in actual UTF‑8 as explained in the next section). If it is not possible to back up, or a byte is missing because of e.g. a communication failure, one single character can be discarded, and the next character be correctly read.
Starting with the second row in the table above (two bytes), every additional byte extends the maximum number of bits by five (six additional bits from the additional continuation byte, minus one bit lost in the first byte).
Prosser’s and Thompson’s scheme was sufficiently general to be extended beyond 6-byte sequences (however, this would have allowed FE or FF bytes to occur in valid UTF-8 text — see under Advantages in section "Compared to single byte encodings" below — and indefinite extension would lose the desirable feature that the length of a sequence can be determined from the start byte only).

UCS-2 / UTF-16

Little-endian UCS-2 or UTF-16 LE is the native format on Windows

Hello-little-endian:

FF FE  4800 6500 6C00 6c00 6F00
header H    e    l    l    o

Save it again as Unicode Big Endian, and you get:

Hello-big-endian:

FE FF  0048 0065 006C 006C 006F
header H    e    l    l    o

Unfortunately, things are not that simple. The BOM is actually a valid Unicode character – what if someone sent a file without a header, and that character was actually part of the file. This is an open issue in Unicode. The suggestion is to avoid U+FEFF except for headers, and use alternative characters instead (there are equivalents).

UCS-2 stores data in a flat 16-bit chunk. UTF-16 allows up to 20 bits split between 2 16-bit characters, known as a surrogate pair. Each character in the surrogate pair is an invalid unicode character by itself, but together a valid one can be extracted.

UTF-8

Code points 0 – 007F are stored as regular, single-byte ASCII.
Code points 0080 and above are converted to binary and stored (encoded) in a series of bytes.
The first “count” byte indicates the number of bytes for the codepoint, including the count byte. These bytes start with 11..0:

110xxxxx (11 -> 2 bytes in sequence, including “count” byte)

1110xxxx (1110 -> 3 bytes in sequence)

11110xxx (11110 -> 4 bytes in sequence)
Bytes starting with 10… are “data” bytes and contain information for the codepoint. A 2-byte example looks like this

110xxxxx 10xxxxxx

This means there are 2 bytes in the sequence. The X’s represent the binary value of the codepoint, which needs to squeeze in the remaining bits.

Observations about UTF-8

No null bytes. All ASCII characters (0-127) are the same. Non-ASCII characters all start with “1” as the highest bit.
ASCII text is stored identically and efficiently.
Unicode characters start with “1” as the high bit, and can be ignored by ASCII-only programs (however, they may be discarded in some cases! See UTF-7 for more details).
There is a time-space tradeoff. There is processing to be done on every Unicode character, but this is a reasonable tradeoff.

Design principle #4

UTF-8 addresses the 80% case well (ASCII), while making the other cases possible (Unicode). UCS-2 addresses all cases equally, but is inefficient in the 80% case for solve for the 99% case. But UCS-2 is less processing-intensive than UTF-8, which requires bit manipulation on all Unicode characters.
Why does XML store data in UTF-8 instead of UCS-2? Is space or processing power more important when reading XML documents?
Why does Windows XP store strings as UCS-2 natively? Is space or processing power more important for the OS internals?

In any case, UTF-8 still needs a header to indicate how the text was encoded. Otherwise, it could be interpreted as straight ASCII with some codepage to handle values above 127. It still uses the U+FEFF codepoint as a BOM, but the BOM itself is encoded in UTF-8 (clever, eh?).

UTF-8 Example

Hello-UTF-8:

EF BB BF 48 65 6C 6C 6F
header   H  e  l  l  o

Again, the ASCII text is not changed in UTF-8. Feel free to use charmap to copy in some Unicode characters and see how they are stored in UTF-8. Or, you can experiment online.

UTF-7

While UTF-8 is great for ASCII, it still stores Unicode data as non-ASCII characters with the high-bit set. Some email protocols do not allow non-ASCII values, so UTF-8 data would not be sent properly. Systems that can handle data with anything in the high bit are “8-bit clean”; systems that require data have values 0-127 (like SMTP) are not. So how do we send Unicode data through them?

Enter UTF-7. The goal is to encode Unicode data in 7 bits (0-127), which is compatible with ASCII. UTF-7 works like this

Codepoints in the ASCII range are stored as ASCII, except for certain symbols (+, -) that have special meaning
Codepoints above ASCII are converted to binary, and stored in base64 encoding (stores binary information in ASCII)

How do you know which ASCII letters are real ASCII, and which are base64 encoded? Easy. ASCII characters between the special symbols “+” and “-“ are considered base64 encoded.

“-” acts like an escape suffix character. If it follows a character, that item is interpreted literally. So, “+-“ is interpreted as “+” without any special encoding. This is how you store an actual “+” symbol in UTF-7.

UTF-7 Example

Wikipedia has some UTF-7 examples, as Notepad can’t save as UTF-7.

“Hello” is the same as ASCII — we are using all ASCII characters and no special symbols:

Byte:     48 65 6C 6C 6F
Letter:   H  e  l  l  o

“£1″ (1 British pound) becomes:

+AKM-1

The characters “+AKM-” means AKM should be decoded in base64 and converted to a codepoint, which maps to 0x00A3 or the British pound symbol. The “1” is kept the same, since it is a ASCII character.

UTF is pretty clever, eh? It’s essentially a Unicode to ASCII conversion that removes any characters that have their highest-bit set. Most ASCII characters will look the same, except for the special characters (- and +) that need to be escaped.

Wrapping it up – what I’ve learned

Unicode does not mean 2 bytes. Unicode defines code points that can be stored in many different ways (UCS-2, UTF-8, UTF-7, etc.). Encodings vary in simplicity and efficiency.
Unicode has more than 65,535 (16 bits) worth of characters. Encodings can specify more characters, but the first 65535 cover most of the common languages.
You need to know the encoding to correctly read a file. You can often guess that a file is Unicode based on the Byte Order Mark (BOM), but confusion can still arise unless you know the exact encoding. Even text that looks like ASCII could actually be encoded with UTF-7; you just don’t know.

http://www.utf8everywhere.org/

http://betterexplained.com/articles/unicode/

http://unicodebook.readthedocs.org/en/latest/unicode_encodings.html

http://stackoverflow.com/questions/172133/utf8-vs-utf16-vs-char-vs-what-someone-explain-this-mess-to-me

http://stackoverflow.com/questions/2241348/what-is-unicode-utf-8-utf-16