[wxJSON 1.1] ANSI / Unicode interoperability

Talk here about issues with one of the components hosted at wxCode, or suggest features for it.
Post Reply
luccat
Knows some wx things
Knows some wx things
Posts: 31
Joined: Tue Oct 23, 2007 9:10 am
Location: Italy

[wxJSON 1.1] ANSI / Unicode interoperability

Post by luccat »

Hi all,
I am ready to upload the new file release for version 1.1 but there is still an issue in the organization of the wxJSON writer and reader.
The problem is ANSI / Unicode interoperability.

When I wrote version 0.3 which introduced Unicode support I tried to achieve interoperability between ANSI and Unicode builds so that JSON text trasmitted from an unicode app to an ANSI one (and viceversa) retains its original content.

The first solution (wxJSON <=1.0) was that in ANSI builds the JSON parser store unicode escaped sequences of unrepresentable characters such as, for example, a greek letter in a latin-1 environment. Such sequences are in the form '\uXXXX' where 'XXXX' is a four-digit hex number representing the Unicode code point of the charater.
I thought this would be an elegant solution but I was wrong. Because this solution uses 4 hex-digits, only the first Unicode plane can be represented (the so-called BMP).
Moreover, writing the JSON string that contains such sequences to UTF-8 streams does not revert to UTF-8.
I also think that such sequences are not valid JSON text.
Finally, not all characters can be exchanged from ANSI and Unicode without misinterpretation so there is not full interoperability.


The second solution (wxJSON 1.1) uses a very different approach: JSON text read from UTF-8 streams are stored in a temporary memory buffer and then converted to wxString in one single call using wxString::FromUTF8().
In Unicode the conversion always succeeds but in ANSI it may fail due to the presence of unrepresentable characters.
The solution is to store in the ANSI wxString object the UTF-8 buffer by copying it.
The problem is now in the writer: when writing to UTF-8 stream, the writer
converts the wxString object to UTF-8 but as the ANSI wxString already contains UTF-8 code units the conversion will change the original content. Here is an example:
If the input stream contains a character encoded as UTF-8:

Code: Select all

0xce 0xb1		// the greek letter alfa U+03B1
in a Latin-1 ANSI app. they are stored as they appear in UTF-8. The two bytes represent a capital I circumflex (Unicode code point 206) and a plus-minus sign (Unicode code point 177).
Both characters have the same encoding in a Latin-1 charset and Unicode.

When the writer writes these characters to a stream, it converts them to UTF-8 thus obtaining the following:

Code: Select all

0xC3 0x8E 0xC2 0xB1 
which is not the same as the input. On the other hand, there is no chances for the writer to know that those two characters do not represent the actual two Latin-1 characters but a single UTF-8 code point (the greek letter alpha). So this solution is not the right one: no Unicode / ANSI interoperability is achieved.


A third solution is the one suggested to me by Piotr Likus:
"who cares about the internal encoding of wxString? Just copy the UTF-8 stream to wxString, without any conversion, when you read streams and just copy the wxString::c_str() buffer to the stream when you write."

Well, this would be surely the fastest solution and it would work if wxString objects only contain strings read from UTF-8 streams but there is still a problem: what happens if the strings were constructed in the application? Excample:

Code: Select all

  wxJSONValue root;
  root.Append( "a latin-1 string; àèì");
  wxMemoryOutputStream os;
  os.Write( root, os );
We cannot simply write the wxString::c_str() data to streams because the string contains locale dependent character encoding.
Streams must have UTF-8 encoding so we have to convert it to UTF-8.
So the writer has to distinguish two situations:

1. strings that contain the actual locale dependent character encoding: they need to be converted to UTF-8.
2. strings that contain UTF-8 code units read from a UTF-8 stream: no conversion has to be done.

Another solution could be: "who cares about ANSI / Unicode interoperability? No one" BTW, this would be the fastest and simpliest solution.


I am looking for a solution... hints are welcome

Luciano
Post Reply