From Code Points to Bytes

The confusion behind Unicode (what makes it complex) is that there are multiple ways to represent the same code point (or Unicode character numerical value) in terms of actual storage, of physical bytes. If the only way to represent all Unicode code points in a simple and uniform way was to use four bytes for each code point10 most developers would perceive this as too expensive in memory and processing terms.

One of the options is to use smaller representations with differing number of bytes (at least 1 or 2, but possibly up to 4) for the various code points of the entire Unicode set. This is also called a variable-length representation. These encodings have names you've probably heard about, like UTF-8 and UTF-16, and I'll examine them in technical detail in the following section.

There is a common misconception that UTF-16 can map directly all code points with two bytes, but since Unicode defines over 100,000 code points you can easily figure out they won't fit. It is true, however, that at times developers use only a subset of Unicode, to make it fit in a 2-bytes-per-char-acters fixed-length representation. In the early days, this subset of Unicode was called UCS-211, now you often see it referenced as Basic Multilingual Plane (BMP). However, this is only a subset of Unicode (one of many Planes).

Was this article helpful?

0 0
Project Management Made Easy

Project Management Made Easy

What you need to know about… Project Management Made Easy! Project management consists of more than just a large building project and can encompass small projects as well. No matter what the size of your project, you need to have some sort of project management. How you manage your project has everything to do with its outcome.

Get My Free Ebook


Post a comment