How To Teach Unicode To Your Grandma?

Topics

Brief introduction to UTF-8.

Implemention of UTF-8 (RFC 3629).

Brief Introduction To UTF-8

Quiz

30 seconds for each question.

Winner gets a prize.

1. What is the most widely used character encoding?

2. ກ້ວຍມີຫົກກີບດອກເອີ້ນວ່າ?

DecodeError

Your brains are facing EnglishDecodeError.

You have assumed that this talk will be only in english even though I haven't explicitly mentioned it.

With programs, UnicodeError will happen.

Human Encoding/Decoding

What is this?

book.jpg

Object - Names

Object Language Name
\_/ English Book
\_/ Hindi किताब
\_/ Lao ຫນັງສື

Same object migt have different names in different languages.

Two persons can communicate if they can speak a common language.

Otherwise a translator is needed.

Computer Encoding/Decoding

Computer hardware know only 1 & 0 (bytes).

Computers can communicate only in 1 & 0.

Humans use text (str).

So, we need character encodings for text/byte conversion.

ASCII

ASCII encodes 128 characters into seven-bit integers.

binary code hex dec character
0000 1010 0A 10 \n
0100 0001 41 65 A
0110 0001 61 97 a

More encodings

128 characters are not sufficient.

ISO Latin 1 is ASCII extended with 96 more symbols.

Windows added 27 more symbols to produce CP1252.

1 byte is not enough for entire world.

The Need For Unicode

ISO/IEC 10646 is a standard for UCS.

UCS contains has 128,000+ characters & is regularly updated.

Unicode is a superset of UCS.

Unicode defines additional properties of characters.

UCS & Unicode remain in sync and preserve backward compatibility(excpet for Unicode 1.1).

Unicode

Each character has a code point

In [9]: codepoint('a')
Out[9]: 'U+0061'

Each character has an unambiguous name

In [10]: name('a')
Out[10]: 'LATIN SMALL LETTER A'

Unicode Transformation Formats

Encoding format are required to convert Unicode from/to bytes.

Several encoding formats for Unicode UTF-8, UCS-2, UTF-16, UCS-4 and UTF-32.

What encoding system?

bytes encoding system character
1100 1110 1011 0010 utf-8 β
1100 1110 1011 0010 utf-16
1100 1110 1011 0010 utf-16be

There is no way to know. You have to be told.

You don't know which language a person speaks until.

Implemention of UTF-8 (RFC 3629).

Encoding

Variable-width encoding: 1 - 4 bytes

Range U+0000 - U+10FFFF

3 types of bytes

  • Single Byte
  • Multiple Bytes

    1 Leading Byte + 1 or more continuation bytes

Single Byte Characters

Char. range UTF-8 octet sequence Bits
(hexadecimal) (binary) (Dec)
0000 - 007F 0xxxxxxx 7
  • Highest bit is set 0
  • US-ASCII characters are valid UTF-8 characters.
  • These characters don't appear in multi byte characters.
  • The octet values C0, C1, F5 - FF never appear.

Multi Byte Characters

Char. number range UTF-8 octet sequence Bits
(hexadecimal) (binary) (Dec)
0000 0080-0000 07FF 110xxxxx 10xxxxxx 11
0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx 16
0001 0000-0010 FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 21

Leading bytes has the n higher-order bits set to 1, followed by a bit set to 0

How to encode

Calculate bits/octets required from binary value of codepoint.

Prepare higher order bits

Fill bits with lowest order bit in last.

Fill empty slots with zeros.

Demo

Byte Order Mark(U+FEFF)

  1. At the beginning of stream

    Treated as a signature.

    Recognize serialization order of the octets(endianess).

  2. Anywhere else

    Normal "ZERO WIDTH NO-BREAK SPACE" character.

UTF-8 BOM

For UTF-8, the BOM will always appear as the octet sequence EF BB BF.

Usage of BOM is discouraged by Unicode consortium.

Programs(like excel) which don't use UTF-8 as default encoding might cause problem.

Security Concerns

IDN Homograph Attack

  • With homographs(different characters look alike), hackers may deceive about remote system.
  • http://ebаy.com/ and http://ebay.com/ look alike but they connect to different systems.
In [5]: x = 'http://ebаy.com/'

In [6]: y = 'http://ebay.com/'

In [7]: x == y
Out[7]: False

In [22]: name('a')
Out[22]: 'LATIN SMALL LETTER A'

In [23]: name('а')
Out[23]: 'CYRILLIC SMALL LETTER A'

Punycode - RFC 3492

Directory Traversal Attack

  • Hackers can exploit an incautious UTF-8 parser with illegal UTF-8 sequences
  • 0010 1111 | 0010 1110 | 0010 1110 | 0010 1111

    URI containing octet sequence 2F 2E 2E 2F shouldn't be permitted.

  • 0010 1111 | 1100 0000 1010 1110 | 0010 1110 | 0010 1111

    Illegal octet sequence 2F C0 AE 2E 2F might be permitted.

Salient Features Of UTF-8

  • Backward compatibility with ASCII.
  • Clear distinction between single-byte and multi-byte characters.
  • Clear indication of byte sequence length.
  • Prefix property & Self-synchronization.
  • Endian independent.

Resources

Questions?

Feedback: http://tiny.cc/unic

Thanks to @jaseemabid @captn3m0 @kracetheking