Wednesday, August 8, 2012

Unicode Endgame

First step

Sort out the literal strings in the code.

Debugging Text intended to be dumped to debug log files


This is used for trace messages. Dumping error codes and dumping stack traces.  None of which the user will/should (probably not) see.

Debugging messages that will be compiled out could stay as simple char *.  Or equally they could all get pulled up to a common standard as a std::wstring.

My feel is that pulling it all up makes life simpler. But there should not be any cross over between this kind of string and user messages. Perhaps seperation of types enforces that assumption.

Exception Messages 

There is a slew of exceptions that pick up explanitory strings.  There are even just char strings being thrown.  This has to stop. There is actually an exception hierarchy somewhere.  Must dust it off and implement it completely.

But what sort of text is appropriate?

Mostly the text is in two distict classes. User errors to talk to the user about ( which goes out via the ExceptionManager Interface) and debugging text. Much of which is consumed or dumpted to the debug log.


More Definitions for Bits of Unicode


Code Point  http://en.wikipedia.org/wiki/Code_point

Code Unit http://en.wikipedia.org/wiki/Code_unit  (Titled "Character Encoding")

Code Page http://en.wikipedia.org/wiki/Code_page

Character - Vague Idea. See Glyph.

Character Encoding - http://en.wikipedia.org/wiki/Code_unit

Glyph http://en.wikipedia.org/wiki/Glyph

Grapheme & Grapheme Cluster http://useless-factor.blogspot.com.au/2007/08/unicode-implementers-guide-part-4.html


So my current question is about the prefered character encoding for use internally in my app.

http://en.wikipedia.org/wiki/UTF-8 
http://unicode.org/notes/tn12/ Advocates for UTF-16 - Essentially says ... everyone else is doing it so you should too.
http://programmers.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful
Advocates for UTF-8 - Essentially says... crap decisions where made... don't do it as well.

http://utf8everywhere.org/
This answers the question again and extends the above couple of items with more clarity and specifics. Seems to be the same or similar author.  Best collection of specifics I have found anywhere so far.   I like!


The End Result of my days of Unicode research???

Store the text internally as UTF-8.  Handle most of the stuff as ASCII anyway.  Turn on _UNICODE flag to deal with the transparent swapping in the Win32 API now that my hand has been forced.  Only transcode to wide characters where needed for compatability with the API's in use.  Try like hell not to need to process text.

At least this has forced me to deal with this issue explicitly and understand some of the subtle bugs that were lurking.  As usual... ignorance is no protection.

Boost Locale for formatting is pretty heavy duty
http://www.boost.org/doc/libs/1_50_0/libs/locale/doc/html/main.html

Boost (Maybe) Nowide is a much simpler solution (you will still need Boost.Locale installed)
http://cppcms.com/files/nowide/html/

Its trivial to use and "Just Works"(tm).



So my Unicode awareness strategy has been:

1) Turn on UNICODE and _UNICODE build flags.
2) Wrap all literals in one of two Macros

#define _UTF8(x)  x
#define _UTF16(x) L##x

These replace the _T(), TEXT() etc variants that litter various bits of code in the codebase.  There are a couple of places that I have explicitly left these alone. This is where I am including the source from someone elses work and its included "as-is". 

The only other variation is at the interface to XERCES.  There is a lot of text handling already wrapped around this and I need to get my head clear before I simplify all that.  There is a messy transcoding class with an
X() macro that seems to trancode between literals and the XERCES XChar * type.  Which I am guessing is probably a UTF16 wide type used internally in XERCES.  I have just not gotten to this bit yet.

3) Expliclty pick the Win32 API functions that are being used.  So rather than "DrawText" which plays swapsies when _UNICODE is turned on.  I have expliclty used "DrawTextW" in the code and used boost::nowide::widen() to pull my internal UTF8 strings up to UTF16 at the API call sites.

This forces the compiler to find all the locaitons where I am passing std::string to an old API call which I can then address (see below).

4) Naming variables with UTF8 or UTF16 as part of the name to describe what it logically holds.  I know that hungarian coding is dead.. blah blah.  But this is about what is logically stored in the variable not about its type.  This is simply a transitional technique to force me to consider and explicitly recognise whats going on with the logical content of the variables.  There is too much code for me to physically eyeball and think about everything, so I need to force the compiler to play on my team.

So the code starts to look like this:

//Type defs in a header...
typedef t_utf8_str std::string;
typedef t_utf16_str std::wstring;

//some literal in the code flow...
t_utf8_str myLiteralUTF8("Tada!");

//Use the literal for various stuff  (note the UTF8 aware cout from the nowide lib)
boost::nowide::cout << myLiteral; 

//Transcode it only at the interface with the Win32API
t_utf16_str myLiteralUTF16 = boost::nowide::widen(myLiteralUTF8);

//Use and discard it.
HRESULT hr = SomeAPICallW(myLiteralUTF16.c_str(), etc, etc);

This way, there is no accidental uses of wide strings without expliclty knowing that they contain what I think they contain.


I build a couple of regex searches to troll the source and find all the quoted strings and wrap them in the _UTF8() macro.  This forced a bunch more implicit conversions to get picked up and I could then expliclty handle them.

The search regex (in Visual Studio find replace dialog) is

~((\#include)|(_UTF8)|(_UTF16)){[\(\,:b]}{:q}

The Replace regex is

\1_UTF8(\2)

This simply ignores the #include "someheader.h", any _UTF("someLiteral") or _UTF16("Some other literal")  that may be in the source.  This allowed me to step through the source and wrap all the literals easily. 


No comments:

Post a Comment