Unicode Code Points and Graphemes

If I really want to be precise, I should include one more concept beyond that of code points. At times, in fact, multiple code points could be used to represent a single grapheme (a visual character). This is generally not a letter, but a combination of letters or letters and symbols. For example, you if have a sequence of the code point representing the Latin letter a (#$0061) followed by the code point representing the grave accent (#$0300), this should be displayed as a single accented character.

10 In Delphi the Unicode Code Points are represented using the UCS4Char data type, which is covered in the section "32-bit Characters" of Chapter 2.

11 The 2-byte Universal Character Set (UCS-2) is now considered an obsolete character encoding. Still, both UTF-16 and UCS-2, map the code points contained within the BMP in the same way, excluding the 2,048 special surrogate code points.

In Delphi coding terms, if you write the following (available again in the FromAsciiToUnicode example):

var str: String; begi n str := #$0061 + #$0300; ShowMessage (str);

the message will have one single accented character:



1 OK |

In this case we have two Chars, representing two code points, but only one grapheme. The fact is that while in the Latin alphabet you can use a specific Unicode code point to represent the given grapheme (letter a with grave accent), in other alphabets combining Unicode code points is the only way to obtain a given grapheme (and the correct output).

Was this article helpful?

0 0
Project Management Made Easy

Project Management Made Easy

What you need to know about… Project Management Made Easy! Project management consists of more than just a large building project and can encompass small projects as well. No matter what the size of your project, you need to have some sort of project management. How you manage your project has everything to do with its outcome.

Get My Free Ebook

Post a comment