Unicode standard#

Unicode is a standard that describes the representation and encoding of almost all languages and other characters.

A few facts about Unicode:

  • version 13.0 (March 2020) describes 143 859 codes

  • each code is a number that corresponds to a certain character

  • standard also defines the encoding - the way of representing the symbol code in bytes

Each character in Unicode has a specific code. This is a number that is usually written as follows: U+0073, where 0073 - hexadecimal digits. Apart from the code, each symbol has its own unique name. For example, letter “s” corresponds to code U+0073 and the name “LATIN SMALL LETTER S”.

Examples of codes, names and corresponding symbols:

  • U+0073, “LATIN SMALL LETTER S” - s

  • U+00F6, “LATIN SMALL LETTER O WITH DIAERESIS” - ö

  • U+1F383, “JACK-O-LANTERN” - 🎃

  • U+2615, “HOT BEVERAGE” - ☕

  • U+1f600, “GRINNING FACE” - 😀

Encodings#

Encodings allow to write character code in bytes.

Unicode supports several encodings:

  • UTF-8

  • UTF-16

  • UTF-32

One of the most popular encoding to date is UTF-8. This encoding uses a variable number of bytes to write Unicode characters.

Examples of Unicode characters and their representation in bytes in UTF-8 encoding:

  • H - 48

  • i - 69

  • 🛀 - 01 f6 c0

  • 🚀 - 01 f6 80

  • ☃ - 26 03