Tuesday, August 7, 2012

Bloody Unicode

Oh how to describe my pain?  Is it like a summers day or... fuck me! Lyrical is just not where my head is at right now.  The following is an exercise in getting my head straight.

I am, yet again, investigating upgrading a creaking old code base to handle Unicode.

Why, you might ask would I do this stupid thing?  Wellll... simply because I have tried to ignore, hide from or otherwise avoid this issue for long enough and have finally needed to use a library that is is forcing my hand by requreing a Unicode build.  This has caused a cascade of shit of epic proportions.

Don't get me wrong... I think I could avoid this a while longer by simply avoiding this library and going with some second rate half-assed replacement.... but in reality that is just throwing more effort into a worsening situation...that I know will come back to bite me... just with bigger teeth.

So do I avoid some technical debit by adding more bad code... or do I confront the deamon and do the nasty?  Add to this equation a very limited time span to get it all done..


So, firstly, I google for the current state of play in "Unicode Solutions" and am dismally disapointed to find its about the same as last time I looked ( probably a year or two ago) a few half applicable tutorial -article things and a couple of libraries with abysmal documentation. The general wisdom of the herd on stackoverflow or other centers of excerlence is pretty thin....

So, lets start with a problem description:

Problem Description


Unicode is simply a different way of handling character data.  Mmmk!  Your code is a mess of single width character handling shit from about the past 6 generations of coding styles.  MmmmK!

A + B = Problem!

Well.... after looking through the docs and a few battle stories... my conclusion is that I am a little bit fucked.  (Mostly because of the short time to fix... not the actual difficulty)
Just for fun I turn on the Unicode build and see what happens...  watch those errors fly...  about a 1000 give or take.  (I hope there is lots of repetition in there...)  This is mildly embarassing... but I will probably get over it.  Yep. Over it.

Problem Analysis

The code base contains a spicy stew of Win32 C routines, C++ with all manner of string and naked char * handling, along with much use of STL bits and peices.  Toss in some _T Macros here and there and some templates and its about as messy as can be imagined.  Its also been stiched together from all sorts of downloaded code with a mess of styles and pushed out without sufficient resources to clean it all up and get some consistency.  Just your average rotting old code base.

About the only thing that seems to be missing is the use of MBCS or other third party string libraries... I'm probably just not looking hard enough yet.

Error Types

From my first cursory look... it looks like lots of type casts and failures to handle types in function definitions.


The ugly stuff seems to be in the interfaces to libraries where they have hard coded type assumption in the interfaces.  There is a mix of shit there from single byte char * style c-strings to STL style std::string (pointers and refs) through to Win32 style cstrings and the spew of pointer types to similar constructs that Microsoft eternally tries to baffle us with.

There is a mess of errors generated by the use of STL exception classes that cannot handle wide characters.  (How weak is that?)

 Another pile is just where I have hard coded a type rather than used my own type definition.  Should be easy to fix.

There is another big pile where I interface with the XERCES XML parser and XSD which should be sorted out once I get the compiler flags untangled.

Time for some thinking and planning...


Logically, the actual app should only really have three types of problem.

1)  Input data (Command Line, Form Fields, read from files or read from port stream)

2) Internal representation (Exceptions text, Code Literals and Constants, data in play)

3) Outputable strings ( Screen messages, written to data, written to ports)


So the question is really:
How to handle these cases elegantly...
How to pull all the crap code up to standard...
and finally how to do it quickly.

Input Data

Most of this stuff is user data.  Like in any app... never trust the user to do something nice.  Assume the worst, parse and clean... then reject everything out of hand that doesn't suit my assumptions.  DO NOT BE KIND TO THE USER.  Don't help them.  Don't quietly fix their mistakes and finally do not fix their assumptions.  Report all errors in mind numbing detail. Make them fix it.

But on to the real problem.  Command line data arrives from the OS as a char * with a null terminator.  Could it contain Unicode characters?  Don't know.   googled to Yes... but with some issues on Windows ( and various other freaky issues on various other platforms... never saw that coming...)
OK, so I could get Unicode characters... presented as a char *?  WTF? OK, Use GetCommandLineW rather than picking up the c string from the params field.  (Should I use it explicitly or use GetCommandLine and use the _UNICODE compiler macro to make the switch...hmmmm)

See...
http://stackoverflow.com/questions/7660651/passing-command-line-unicode-argument-to-java-code/#9043883
http://stackoverflow.com/questions/388490/unicode-characters-in-windows-command-line-how
http://msdn.microsoft.com/en-us/library/windows/desktop/ms683156%28v=vs.85%29.aspx

Form Fields are not a problem at the moment because I'm not dealing with the GUI.. per say.  Although I am dealing with keyboard input... but as keystrokes without caring much about the character... then again...

Reading Unicode from Files


Ok, so a file is stored as ANSI text characters or is it?  Go educate yourself... or confuse yourself even more.
http://www.codeproject.com/Articles/12759/Reading-or-writing-a-line-from-to-an-ANSI-Unicode

Clearly I am not yet on top of this subject... so lets start with some general definitions.

Reading Unicode from a Port Stream

When I say a port stream, what do I mean?  Keyboard data? Parallel Port characters? Serial Port Packets? IP Network packets?

Keyboard Data - This is not Unicode. Its keyboard scan codes.  Does it represent Unicode in the typists head? Possible.  Most of this crap is handled by the OS and GUI for forms apps. But for low level game code... its all you baby!

Parallel Port Data - This is barely a character, let alone an encoded character. Think of it as flag data.  It still has the possibility to be interpreted as something higher level... but if you are trying to pass encoded characters bi-directionally via a Parallel Port... you probably have bigger problems and can handle sequential parsing and encoding without breaking a sweat.  I only use Parallel Ports for passing event flags, so I doubt this will be a problem any time soon. (How many times have I heard the echo of those words come back to slap me in the head....)

Serial Port Data - Currently, not using the Serial Port in this project but its been talked about. Since its simply a binary stream, its more a problem of coordinating the transport layer.  Once you have the stream moving happily then you can agree on th encoding that the bits represent.  Again, similar to the Parallel Port... if you are doing this sort of thing.. then how the data in encoded is fairly trivial.  Just pack the packets, pass the stream...catch the stream in the buffer then unpack the packet.... At no point should you need to deal with many "unknowns"  There is little chance of a User trying to pass you a poorly formed packet.  If so, reject it all, complain loudly and make them fix it.

IP Network Data - Well hell... it could be anything.  The point being that within the context of my app... its probably going to be structured event data with very little text content. So ... I will probably treat it as above. Buffer it, unpack the packet. Then put any text data into a wide string and handle it as requred. (By passing it to the log files or data files)

WTF is Unicode again?

Even after all the books, websites and tutorials I've read... (I know the real definitions, it's the definitions that are in-use by the rest of the world that's making the mess)

Unicode - is a fucking mess created by people trying to sort out a bigger mess, grafted on top of the many different messes created by ANSI centric language and library designers for the past 50~ish years. Then lots of half literate bastards have written poorly worded articles, tutorials, books, docs, libraries and operating systems which use the term loosely and thus add to the scale of the mess exponentially.

Unicode data - data stored in various incompatable formats called variously ANSI, UTF-8 UTF-16, UTF-32, Unicode (meainging UTF-16 Little Endian), Unicode Big Endian (Meaning UTF-16 Big Endian) But in reality its just data... its how its used that really makes it Unicode or not.

Unicode string - some sort of internal software construct containing (in the programmers head) data that through some mechanism may turn out to be related in some fashion to Unicode application, system, files, books, triple box set or drinking game.   In reality, its just data until its "interpreted" by something or shown to someone.


Hmmm.... I feel like there is some clarity.

In reality, there are only three problems in my app.  The first is to detect and handle Unicode at all Input sites.  The second is to deal with an internal represeantation of data without messing up its potential Unicode structure. Finally, displaying Unicode where appropriate in all its glory.

The third part is in reality the ugly issue.  My app is hard coded with the assumption of Left to Right directionality in all sorts of places.  Since I'm manually building most of the screens, there is no help from the OS with anything like a GUI.  If I want to deal with Right to Left directionality in any way gracefully, it will probably mean a mess of work. Which in reality may be irrelevant as my customer base is Academics and is probably mostly english literate... whether they like it or not.  Sooooo.... is this really a case that is worth any of my time to deal with?

Can I just deal with it by wrapping it up in an object and deal with it later when someone complains?

Should I ignore it completely? Hell even my output files are fundamentally left justified.  Just about everything assumes Left to Right.

More thinking required....


Mostly, I think I can deal with the input and the storage of Unicode text.  I can probably deal with output of any Unicode text if its Left to Right without much change. (Assuming font choices do not make a messs of anything) but dealing with Right to Left layout is just going to make a complete mess of everything.  Every screen will need to be individually re-thought, laid out and then tested to make sure it works correctly. Then we start to deal with all the issues of line break rules, wrapping etc.

All this without answering the question about if I even care.

My best guess is that the bulk of my client base will be able to deal with english either gracefully or not.  This is based on my email list... which while not all English speaking countries... have generally been literate in English.  Also that equivilent systems are primarily availible in English.  This is not to say that this is "Right", just that most of the other system developers have taken the easy options and stayed with the european languages of English, French, Spanish etc.

So is it just that my product has defined the clients or that the clients will define the product.  I have has some interest from China, so I would expect to have to deal with other language sets sometime soon.  They are Left to Right arn't they?  Educate thy self...

http://en.wikipedia.org/wiki/CJK_characters
http://en.wikipedia.org/wiki/Han_unification

Nope. Top to bottom... but can be written left to right... thankfully.

Ok, so while its a politically charged situation... from my point of view it will still be some form of string that shows up (due to the OS forcing the client to use Unicode) so the semantic issues are not my problem. 


Where does that leave me?

Latin, Greek and Cyrilic Based Writing systems - Not a problem. See http://en.wikipedia.org/wiki/Cyrillic_script and http://en.wikipedia.org/wiki/Greek_alphabet
Chineese, Japanese, Korean, Vietnamese - (Top to bottom but can be left to right)Not too much of a problem.
Arabic, Hebrew - (Right to Left) - Big problem. (others here http://en.wikipedia.org/wiki/Right-to-left )  (Can I solve this simply by ignoring it and making my users
Bi-directional Text (BiDi) - Who would do this?  What sort of insane person tries to build systems to mush all this crap together and then builds a sub system within it to further mangle everything so different frankenstein bits can live in the same freak'n phrase.  (Yes I understand why they would do this ... in the cold light of day... but then to actually implement a system around this crap.... this is the insano bit)

So can I support all these different systems or am I way down the rabbit hole already.  In reality, most of these systems are irrelevant to my problem space.  Simply because any text the user feeds in will be used literally, so it doesn't matter.  The only bit that does matter is where that text is used to issue commands to the system. In this case, I can only accept commands in English.  Simply because currently that's the only language I can test.

The explanation of those commands can be delivered in any langauge.. but thats a documentation issue.

Hmmm... All the XML in the script files in essentially in english.... There is no way I can translate all that and support 100 different language variants of the file format... that is just stupidity on a grand scale.

So, to deal with that... translate all the GUI text for the editor and force any non-english speakers to only use the GUI for editing. Unless they want to figure out the equivilent artifact in English within the file format... yuck. But thats about the only reasonable option I think I have.

Is it too late to ask for the Red pill?


Fonts, Locales and OS Language Settings

Do I even want to go here?   How much of this pain can I just chuck back in the Users lap? 

I need coffee...



So in summary, there is a mess of stuff to do.  Some of it has to do with my code being littered with bad habits. The rest is to do with all the habitual assumptions... which have turned out to be very english centric and are in effect not portable and thus... bad.

My ToDo list is something like this:

1) Extract all the text for errors, exceptions and user messages so it could be internationalised if required.
2) Handle input text in a Unicode aware way.
3) Store text internally as Unicode characters. Use only wide aware function to manipulate it.
4) Output text as Unicode characters to the screen. (Including all error messages, dialogs and log files)
5) Put translation of the docs on the todo list so it can conceptually be done later.
6) Figure out how to test all this shit.
7) Actually test it.


I think, in general that exceptions should not be carrying much text. (if any!) My feel is that this is one of the bad habits that I need to remove from the code base.

Looking over the resource files that already contain all the strings for the XML files... I am not feeling happy.  Since all this stuff is compiled in to the exe... perhaps I should extract it all and load it dynamically.  (Everything else is data driven...so why not the error text?)  Although this opens up the possibility of the user trying to change the functionality of the system by hacking the data files... ugh. Yet another debugging condition to consider when supporting them over the phone..... yuck. Ok, perhaps compiling  the strings in is safer/less complicated.  It does make any translation a bit more complex and the exe becomes language specific.

Hmmm.... games within games.








No comments:

Post a Comment