Unicode Madness Topic is solved

If you are using the main C++ distribution of wxWidgets, Feel free to ask any question related to wxWidgets development here. This means questions regarding to C++ and wxWidgets, not compile problems.
Natulux
Filthy Rich wx Solver
Filthy Rich wx Solver
Posts: 242
Joined: Thu Aug 03, 2017 12:20 pm

Unicode Madness

Post by Natulux »

Hey guys,

at the moment I am stuck with character encodings. I need to be sure to write an UTF-8 encoded wxString to my https handler and in return I get an UTF-8 body. There are many different sources, servers, sockets and a file in play, so I sometimes do not know exactly, what I have.

I read in the doc https://docs.wxwidgets.org/3.0/overview_unicode.html and under MSW my wchar_t is supposed to be encoded as UCS-2:
under Microsoft Windows, UCS-2 (simplified version of UTF-16 without support for surrogate characters) is used as wchar_t is 2 bytes on this platform
Yet, Vadim wrote on another occassion: https://groups.google.com/forum/#!topic ... _NGjVOoow0
Well, we do write in the documentation that the encoding of the strings is
that of the current locale and that it's never UTF-8 under MSW, so I hope
that most people would be aware of it. Most programs also won't have that
many literal strings in the first place...
I guess (but I dont know) that my current locale is ISO-8859-1.
When I use ToUTF8() or wxConvUTF8 flag on any of my strings, they can not me displayed without error: "Jürgen" becomes "Jürgen" which to my knowledge means, that I used UTF8 encoding on that string twice.

How can I make sure I send and receive UTF8 and what encoding does my wxString have, if I provide none? Can I check that?

Thank you!
Natu
User avatar
doublemax
Moderator
Moderator
Posts: 19116
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: Unicode Madness

Post by doublemax »

"Jürgen" becomes "Jürgen" which to my knowledge means, that I used UTF8 encoding on that string twice.
No. "ü" is the UTF-8 encoded version of "ü".

I didn't read the whole thread from Google groups, but i think it's about string literals, which is a different issue.
How can I make sure I send and receive UTF8 and what encoding does my wxString have, if I provide none? Can I check that?
On the receiving side, just try to decode the byte array using wxString::FromUTF8. Is the result string is not empty, there's a very high change, it was UTF8 encoded.

Maybe this post helps:
viewtopic.php?t=19565
Use the source, Luke!
Natulux
Filthy Rich wx Solver
Filthy Rich wx Solver
Posts: 242
Joined: Thu Aug 03, 2017 12:20 pm

Re: Unicode Madness

Post by Natulux »

doublemax wrote: Tue Jul 09, 2019 11:19 am
"Jürgen" becomes "Jürgen" which to my knowledge means, that I used UTF8 encoding on that string twice.
No. "ü" is the UTF-8 encoded version of "ü".
Yes, it is. But if I encode this UTF8 string again with ToUTF8, I get ""Jürgen". And the third time its "Jürgen". So the "junk" is increasing every time I encode that multibyte char again. And I wonder, if "Jürgen" is supposed to be UTF8 or if it rather is an UTF8 string, encoded again as UTF8.

Because if it really is UTF8, it is neither readable, nor does it suffice on my webserver, which expects UTF8, but cant read "Jürgen".
doublemax wrote: Tue Jul 09, 2019 11:19 am I didn't read the whole thread from Google groups, but i think it's about string literals, which is a different issue.
Ah yes, that would make a whole lot of sence. String literals should use the system locale.
doublemax wrote: Tue Jul 09, 2019 11:19 am
How can I make sure I send and receive UTF8 and what encoding does my wxString have, if I provide none? Can I check that?
On the receiving side, just try to decode the byte array using wxString::FromUTF8. Is the result string is not empty, there's a very high change, it was UTF8 encoded.

Maybe this post helps:
viewtopic.php?t=19565:

Code: Select all

wxString str(buf,wxConvUTF8);
client->Write(str.c_str(),str.Length());
this code does not convert your string to utf8. It creates a wxString from a buffer with utf-8 encoded data.

you'd need something like this:

Code: Select all

wxString s(wxT("üöäÜÖÄ"));
wxCharBuffer buf=s.mb_str(wxConvUTF8);
client->Write(buf.data(), strlen(buf.data())+1);
'+1' to include the trailing 0-byte.
So the UTF8 flag in wxString contructor means READING a utf8 encoded buffer, while the same flag in .mb_str() means WRITING with the encoding? And wxString::ToUTF8() should fullfill the same purpose, right?
The receiving webserver is in PHP, so I cant pass it to FromUTF8. It isnt even developed by me :-/
You should look at wxString as a "black box" that stores a string in a unicode-aware way (always assuming using a unicode build of wxwidgets). You should not worry about how wxString stores its data internally.

But when sending strings over a network, you might want to convert it into a "standard" format that any other computer can understand, even if it runs a different operating system on a different cpu. To convert the string to UTF-8 is one way of doing that.

So for sending over a network, you convert your string to utf-8 and on the receiving side, you create a wxString from the utf-8 data.
To my experience, the encoding I use does matter. (See my Jürgen example above which actually changes the black box buffer) According to this, when I just pass along a wxString, say as a REST POST body, I would get it formatted with my system locale? And if I send it with mb_str(wxConvUTF8) I can be sure that it is UTF8 as desired? But how come than, that I can convert a string multiple times to UTF8 and it visually changes?

Best
Natu
ONEEYEMAN
Part Of The Furniture
Part Of The Furniture
Posts: 7459
Joined: Sat Apr 16, 2005 7:22 am
Location: USA, Ukraine

Re: Unicode Madness

Post by ONEEYEMAN »

Hi,
So when you send this string "Jürgen", what does your PHP server receives?
Can you print the results (to console/log)?

Thank you.
User avatar
doublemax
Moderator
Moderator
Posts: 19116
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: Unicode Madness

Post by doublemax »

So the UTF8 flag in wxString contructor means READING a utf8 encoded buffer, while the same flag in .mb_str() means WRITING with the encoding? And wxString::ToUTF8() should fullfill the same purpose, right?
wxConvUTF8 is not a flag. It's a global instance of a wxMBConv which is used for encoding and decoding.
https://docs.wxwidgets.org/trunk/classwx_m_b_conv.html
The receiving webserver is in PHP, so I cant pass it to FromUTF8. It isnt even developed by me :-/
Just encode the wxString to a UTF8 byte buffer, then send those bytes.
But how come than, that I can convert a string multiple times to UTF8 and it visually changes?
That's normal. In UTF8 every byte > 127 will be encoded as a sequence of 2 or more bytes (which are > 127). So if you UTF8 encode the result again, these bytes will get encoded again, and so on...
Use the source, Luke!
Natulux
Filthy Rich wx Solver
Filthy Rich wx Solver
Posts: 242
Joined: Thu Aug 03, 2017 12:20 pm

Re: Unicode Madness

Post by Natulux »

ONEEYEMAN wrote: Tue Jul 09, 2019 2:16 pm Hi,
So when you send this string "Jürgen", what does your PHP server receives?
Can you print the results (to console/log)?

Thank you.
I can ask the developer to test that with me and log that. From previous tests I guess that the server actually and literally gets "Jürgen" if I send "Jürgen". I guess the better way to go about this is to use UTF8 HEX instead? When sending this with javascript (quasar) to the same PHP server (which works fine), the axios plugin sends the so called unicode chars as UTF8 encoded hex: "J%C3%BCrgen" (IIRC)
I don't know how to do that with wxWidgets though.
doublemax wrote: Tue Jul 09, 2019 2:19 pmwxConvUTF8 is not a flag. It's a global instance of a wxMBConv which is used for encoding and decoding.
https://docs.wxwidgets.org/trunk/classwx_m_b_conv.html
ok, I think I understand how to convert the encoding. I am still confused though, why I can not display an UTF8 string. Is wxLogMessage and Google Chrome unable to display UTF8 encoded text? I rather thought that when I see "Jürgen" it is an error.
doublemax wrote: Tue Jul 09, 2019 2:19 pmJust encode the wxString to a UTF8 byte buffer, then send those bytes.
Thanks, I will try that!
doublemax wrote: Tue Jul 09, 2019 2:19 pm
But how come than, that I can convert a string multiple times to UTF8 and it visually changes?
That's normal. In UTF8 every byte > 127 will be encoded as a sequence of 2 or more bytes (which are > 127). So if you UTF8 encode the result again, these bytes will get encoded again, and so on...
Every byte (every 2 or 4 bit) is/are encoded and UtF8 multibyte chars are therefore encoded partly and that strechtes the string. Or something like that. ;-)
But I meant in regard to your BlackBox example: If I really could see a wxString as a blackbox and dont care about its encoding, I should be able to UTF8 encode it at one point and (maybe in another function) encode that string again, without creating a malfunctioning string.

So the way to handle this is: Dont care about the internal encoding throughtout the app. Use UTF8 encoding when sending data to server/files and use decode when receiving data from server/files (if UTF8 is expected of course). Did I get that right?

Thank you for you help! :-)
Best
Natu

[EDIT] I can use (const char *) as byte buffer variable, am I right? Example:

Code: Select all

const char *utf8byteBuffer = SourceString.mb_str(wxConvUTF8);
User avatar
doublemax
Moderator
Moderator
Posts: 19116
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: Unicode Madness

Post by doublemax »

But I meant in regard to your BlackBox example: If I really could see a wxString as a blackbox and dont care about its encoding, I should be able to UTF8 encode it at one point and (maybe in another function) encode that string again, without creating a malfunctioning string.
UTF8 encoding a string does not modify the string itself, it's always written into another buffer.
I can use (const char *) as byte buffer variable, am I right? Example:

Code: Select all

const char *utf8byteBuffer = SourceString.mb_str(wxConvUTF8);
Be careful with that. mb_str and related functions return a temporary buffer which gets automatically destroyed quickly.

https://wiki.wxwidgets.org/Converting_e ... to_char.2A
Use the source, Luke!
Natulux
Filthy Rich wx Solver
Filthy Rich wx Solver
Posts: 242
Joined: Thu Aug 03, 2017 12:20 pm

Re: Unicode Madness

Post by Natulux »

doublemax wrote: Wed Jul 10, 2019 10:31 amUTF8 encoding a string does not modify the string itself, it's always written into another buffer.
So basically if I were to give my data to a wxString constructor everytime, this wouldn't happen?

Code: Select all

wxString string1("SomeUTFString", wxConvUTF8);
wxString string2(String1.ToUTF8()); //Still UTF8 String (?)
string2 = wxString(string2, wxConvUTF8); //Still UTF8 String (?)
doublemax wrote: Wed Jul 10, 2019 10:31 amBe careful with that. mb_str and related functions return a temporary buffer which gets automatically destroyed quickly.
This would help?

Code: Select all

wxString utf8ByteBuffer((const char *)SourceString.mb_str(wxConvUTF8));
Something different (src: https://www.utf8-zeichentabelle.de/unic ... 8-table.pl):

From my wxString("Jürgen", wxConvUTF8) == "Jürgen" example:
From the UTF8 table:

' ü ' has the unicode codeposition: U+C3BC
' Ã ' has the unicode codeposition: U+00C3
' ¼ ' has the unicode codeposition: U+00BC

So obviously, if the unicode codepositions are abbreviated to ' C3 ' and ' BC ', then a parser could mistake one char C3BC with two chars C3 and BC.
That is exactly what is happening to me. The PHP server and google Chrome read my from wxWidgets encoded UTF8 char as two chars. The utf8 encoding did work all the time (using the unicode codeposition, instead of the UTF8 hex presentation), just the way the encoding is written and read again doesnt match.
I just need to find a way to make them recognize the format. Maybe url-encode the utf8 string?

Best
Natu
User avatar
doublemax
Moderator
Moderator
Posts: 19116
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: Unicode Madness

Post by doublemax »

Code: Select all

wxString string1("SomeUTFString", wxConvUTF8);
This creates a wxString from UTF8 data.

Code: Select all

wxString string2(String1.ToUTF8()); //Still UTF8 String (?)
string2 = wxString(string2, wxConvUTF8); //Still UTF8 String (?)
These don't make sense at all.

Code: Select all

wxString utf8ByteBuffer((const char *)SourceString.mb_str(wxConvUTF8));
No. You should never store UTF8 encoded data in a wxString.

Store it in a wxCharBuffer and then process the data on a byte-level.

Code: Select all

wxString s( wxT("some string äöüÄÖÜ") );
wxCharBuffer buffer = s.ToUTF8();
bar( buffer.data(), strlen(buffer.data()) ); 
Use the source, Luke!
Natulux
Filthy Rich wx Solver
Filthy Rich wx Solver
Posts: 242
Joined: Thu Aug 03, 2017 12:20 pm

Re: Unicode Madness

Post by Natulux »

Natulux wrote: Wed Jul 10, 2019 1:00 pm Something different (src: https://www.utf8-zeichentabelle.de/unic ... 8-table.pl):

From my wxString("Jürgen", wxConvUTF8) == "Jürgen" example:
From the UTF8 table:

' ü ' has the unicode codeposition: U+C3BC
' Ã ' has the unicode codeposition: U+00C3
' ¼ ' has the unicode codeposition: U+00BC

So obviously, if the unicode codepositions are abbreviated to ' C3 ' and ' BC ', then a parser could mistake one char C3BC with two chars C3 and BC.
That is exactly what is happening to me. The PHP server and google Chrome read my from wxWidgets encoded UTF8 char as two chars. The utf8 encoding did work all the time (using the unicode codeposition, instead of the UTF8 hex presentation), just the way the encoding is written and read again doesnt match.
I just need to find a way to make them recognize the format. Maybe url-encode the utf8 string?
I was partly wrong.
' ü ' is in UTF8 HEX: C3 BC
And read as unicode signs, they are ' Ã ' and ' ¼ '
The wxWidgets docu states, that a wxString may hold unicode encoded characters. That is, why the utf8 hex string is interpreted as two unicode signs and this behaviour is actually quite logical.
(Same with e.g. Förster (Förster)
' ö ' = UTF8 HEX: ' C3 B6 ' which translates to ' Ã ' and ' ¶ ' in unicode)

This means: ToUTF8() as well as mb_str(wxConvUTF8) work as expected. But they are formatted in a way, that the PHP server parses it as unicode, even though it is supposed to expect UTF8.
I found that I can use url encoding instead, to get the values uniterpreted by (e.g.) wxLogMessage.

Code: Select all

wxURI urlEnc (sUsername);
wxString encoded_url = urlEnc.BuildURI();
This wouldn't take multi line strings though, and the body must not contain special signs (like in this thread: viewtopic.php?t=9233)

I wonder what the difference between an url encoded unicodechar and an utf8 encoded unicodechar literally is?
Is there a way to make the response of mb_str(wxConvUTF8) visible, without interpreting it?
doublemax wrote: Wed Jul 10, 2019 1:21 pm

Code: Select all

wxString utf8ByteBuffer((const char *)SourceString.mb_str(wxConvUTF8));
No. You should never store UTF8 encoded data in a wxString.

Store it in a wxCharBuffer and then process the data on a byte-level.

Code: Select all

wxString s( wxT("some string äöüÄÖÜ") );
wxCharBuffer buffer = s.ToUTF8();
bar( buffer.data(), strlen(buffer.data()) ); 
I keep that in mind, thanks!
User avatar
doublemax
Moderator
Moderator
Posts: 19116
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: Unicode Madness

Post by doublemax »

What exactly are you doing with the utf8 encoded data and how to you process it on the receiving side?
Use the source, Luke!
Natulux
Filthy Rich wx Solver
Filthy Rich wx Solver
Posts: 242
Joined: Thu Aug 03, 2017 12:20 pm

Re: Unicode Madness

Post by Natulux »

doublemax wrote: Thu Jul 11, 2019 9:24 am What exactly are you doing with the utf8 encoded data and how to you process it on the receiving side?
I use the chilkat CkHttp class to send a https POST to PHP REST server. The body of my POST holds data, which is a json string. Example

Code: Select all

{
	"username":"Jürgen"
}
The text data encoding is supposed to be UTF8.
Now, I have two projects querying the same server with the same request url.

The first is Quasar (js framework) with Axios plugin, sending this request with a body like this:
username=J%C3%BCrgen&password=1234
This works.

The second is my wxWidgets application, with which I try to send something similar.
This time, I struggle with UTF8 though (btw.: The json format is fine. If I only send ascii letters, it works too.)
Atm it looks like this:
{
"username":"Jürgen",
"password":"1234"
}
As I posted before, ' Jürgen ' is actually UTF8, perceived as unicode encoding and I dont know why. Especially because I don't know which format this literally is, before wxLogMessage trys to view it for me...

I need to get my (raw) format like in axios: ' J%C3%BCrgen '
User avatar
evstevemd
Part Of The Furniture
Part Of The Furniture
Posts: 2409
Joined: Wed Jan 28, 2009 11:57 am
Location: United Republic of Tanzania

Re: Unicode Madness

Post by evstevemd »

Natulux wrote: Thu Jul 11, 2019 12:10 pm I use the chilkat CkHttp class to send a https POST to PHP REST server. The body of my POST holds data, which is a json string. Example

Code: Select all

{
	"username":"Jürgen"
}
What happens if you send using wxHttp and libcurl?
Chief Justice: We have trouble dear citizens!
Citizens: What it is his honor?
Chief Justice:Our president is an atheist, who will he swear to?
User avatar
doublemax
Moderator
Moderator
Posts: 19116
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: Unicode Madness

Post by doublemax »

So the only thing missing is the URL encoding?

Code: Select all

wxString urlHexEncode( const char *in )
{
  wxString out;
  out.Alloc( wxStrlen(in) * 2 );

  char c;
  while( (c = *in++) != 0 )
  {
    if( (c >= '0' && c <= '9') || ( c >= 'A' && c <= 'Z' ) || ( c >= 'a' && c <= 'z' )
      || c == '-' || c == '.' || c == '_' || c == '~' )
    {
      out += c;
    }
    else if( c == ' ' )
      out += '+';
    else
      out += wxString::Format( wxT("%%%02X"), (unsigned char)c );
  }
  return out;
}

Code: Select all

wxString test( wxT("Jürgen") );
wxString encoded = urlHexEncode( test.ToUTF8() );
wxLogMessage( wxT("%s"), encoded );
Use the source, Luke!
Natulux
Filthy Rich wx Solver
Filthy Rich wx Solver
Posts: 242
Joined: Thu Aug 03, 2017 12:20 pm

Re: Unicode Madness

Post by Natulux »

evstevemd wrote: Thu Jul 11, 2019 12:27 pmWhat happens if you send using wxHttp and libcurl?
IIRC, wxHTTP is not able to send https, so unfortunately this is no option for me.
doublemax wrote: Thu Jul 11, 2019 12:44 pm So the only thing missing is the URL encoding?

Code: Select all

wxString urlHexEncode( const char *in )
{
  wxString out;
  out.Alloc( wxStrlen(in) * 2 );

  char c;
  while( (c = *in++) != 0 )
  {
    if( (c >= '0' && c <= '9') || ( c >= 'A' && c <= 'Z' ) || ( c >= 'a' && c <= 'z' )
      || c == '-' || c == '.' || c == '_' || c == '~' )
    {
      out += c;
    }
    else if( c == ' ' )
      out += '+';
    else
      out += wxString::Format( wxT("%%%02X"), (unsigned char)c );
  }
  return out;
}

Code: Select all

wxString test( wxT("Jürgen") );
wxString encoded = urlHexEncode( test.ToUTF8() );
wxLogMessage( wxT("%s"), encoded );
Thanks for that. Even though I dont understand why the formatting of your function works, this does produce the right string format. :-)
If I use it on a wxJSONValue, written to wxString (multiline) it crashes the app unfortunately. Maybe this doesnt work with String in String, as the json format needs it?

But either case: As it turns out, I can actually send my utf8 data as is. The PHP server needs to activly utf8_decode my body, which it does not need for the data from other sources, whom I guess use unicode codepoints (not sure if escaped or unescaped though).
But it is working with my unescaped UTF8 for now and Im am satisfied with that.

Thank you very much, I learned a lot about character encoding in this process. (Even though I would still like to know, which raw format the unescaped utf8 string has, that wxWidgets produces ;-) )

All the best
Natu
Post Reply