Unicode Code Point Descriptions

On the Unicode Consortium web site, you can find a lot of information, including a text file with a written description of a large number of code points (most of them excluding the unified ideographs for Chinese, Japanese, and Korean). I've used this file to create an extended version of the UnicodeMap program, called UnicodeData. The user interface is based on the same structure, but the program reads and parses the UnicodeData.txt14 file, and adds any available character description to the status bar when moving over the grid:

14 The URL for this file is http: //unicode.org/Public/UNIDATA/UnicodeData.txt. There is second much larger file (I've not used in the demo) for the unified ideographs, avauilable at http: //www.unicode.org/Public/UNIDATA/Unihan.zip.

Parsing the file is not terribly simple, as not all of the Unicode symbols are present. I resorted to creating a StringList with information in the format charnumber=description, extracted from the file. The original file uses semicolons for separating fields and a newline character (alone, not combined with line feed) for each record. After loading the entire file into a string, I use the following code to parse it and move the two descriptions to the information section (as at times only one or the other description is relevant): nPos := 1;

// now parse the Unicode data while nPos < Length (strData) - 2 do begi n strSingleLine := ReadToNewLine (strData, nPos); nLinePos := 1;

strNumber := ReadToSemicolon (

strSingleLine, nLinePos); strDescr1 := ReadToSemicolon (

strSingleLine, nLinePos); Skip8Semi (strSingleLine, nLinePos); strDescr2 := ReadToSemicolon ( strSingleLine, nLinePos);

sUnicodeDescr.Add(strNumber + '=' + strDescr1 + ' ' + strDescr2); end;

This code could be executed in the message handler of a wm_user message posted to the main form in its OnCreate event handler, to let the system start up the main form before doing this lengthy operation. The status bar is updated in the loop above to inform users of the current progress. The loop has some further termination code, to skip parsing characters above $ffff.

The information stored in the string list is extracted when you have to display the description of a character, with this additional code of the StringGrid1MouseMove method:

if Assigned (sUnicodeDescr) then begi n strChar := IntToHex (nChar, 4);

nIndex := sUnicodeDescr.IndexOfName(strChar);

StatusBar1.Simp1eText := StatusBar1.Simp1eText + ' — ' + sUnicodeDescr.ValueFromIndex [nIndex];

end;

Having information about the code points, the program could also create a more logical element tree. This is not too difficult for the various alphabets, but most symbols have a generic name with no indication that they are part of a given group. Coming out with a proper grouping of all Unicode code points is possible from reading the various documents15, but not parsing the UnicodeData.txt file.

Was this article helpful?

0 0
Project Management Made Easy

Project Management Made Easy

What you need to know about… Project Management Made Easy! Project management consists of more than just a large building project and can encompass small projects as well. No matter what the size of your project, you need to have some sort of project management. How you manage your project has everything to do with its outcome.

Get My Free Ebook


Post a comment