Unicode Transformation Formats UTF

Few people know that the very common "UTF" term is the acronym of Unicode Transformation Format. These are algorithmic mappings, part of the Unicode standard, that map each code point (the absolute numeric representation of a character) to a unique sequence of bytes representing the given character. Notice that the mappings can be used in both directions, converting back and forth different representations.

The standard defines three of these formats, depending on how many bits are used to represent the initial part of the set (the initial 128 characters): 8, 16, or 32. It is interesting to notice that all three forms of encodings need at most 4 bytes of data for each code point.

  • UTF-8 transforms characters into a variable-length encoding of 1 to 4 bytes. UTF-8 is popular for HTML and similar protocols, because it is quite compact when most characters (like tags in HTML) fall within the ASCII subset12.
  • UTF-16 is popular in many operating systems (including Windows) and development environments. It is quite convenient as most characters fit in two bytes, reasonably compact and fast to process.
  • UTF-32 makes a lot of sense for processing (all code points have the same length), but is memory consuming and has limited use in practice.

A problem relating to multi-byte representations (UTF-16 and UTF-32) is which of the bytes comes first? According to the standard, all forms are allowed, so you can have a UTF-16 BE (big-endian13) or LE (little-endian), and the same for UTF-32.

Was this article helpful?

0 0
Project Management Made Easy

Project Management Made Easy

What you need to know about… Project Management Made Easy! Project management consists of more than just a large building project and can encompass small projects as well. No matter what the size of your project, you need to have some sort of project management. How you manage your project has everything to do with its outcome.

Get My Free Ebook

Post a comment