Unicode Madness Topic is solved

If you are using the main C++ distribution of wxWidgets, Feel free to ask any question related to wxWidgets development here. This means questions regarding to C++ and wxWidgets, not compile problems.
User avatar
doublemax
Moderator
Moderator
Posts: 13988
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: Unicode Madness

Post by doublemax » Fri Jul 12, 2019 9:48 am

If I use it on a wxJSONValue, written to wxString (multiline) it crashes the app unfortunately. Maybe this doesnt work with String in String, as the json format needs it?
I'd need more information to be sure, but the resulting string contains '%' characters, if you use it for printf-like functions, these will be interpreted as format identifiers. In a debug build, you should get an assert in that case, not a plain crash.
Even though I would still like to know, which raw format the unescaped utf8 string has, that wxWidgets produces
I'm not sure what you mean, there is nothing special wxWidgets' UTF8 encoding.
Use the source, Luke!

ONEEYEMAN
Part Of The Furniture
Part Of The Furniture
Posts: 3404
Joined: Sat Apr 16, 2005 7:22 am
Location: USA, Ukraine

Re: Unicode Madness

Post by ONEEYEMAN » Fri Jul 12, 2019 2:05 pm

Hi,
Natulux wrote:
Fri Jul 12, 2019 9:11 am
evstevemd wrote:
Thu Jul 11, 2019 12:27 pm
What happens if you send using wxHttp and libcurl?
IIRC, wxHTTP is not able to send https, so unfortunately this is no option for me.
What about {wx}cURL?

User avatar
evstevemd
Part Of The Furniture
Part Of The Furniture
Posts: 2252
Joined: Wed Jan 28, 2009 11:57 am
Location: United Republic of Tanzania
Contact:

Re: Unicode Madness

Post by evstevemd » Sat Jul 13, 2019 5:26 am

Natulux wrote:
Fri Jul 12, 2019 9:11 am
IIRC, wxHTTP is not able to send https, so unfortunately this is no option for me.
You are right. There was this PR to add https but was never merged and don't know why.
Natulux wrote:
Fri Jul 12, 2019 9:11 am
Thanks for that. Even though I dont understand why the formatting of your function works, this does produce the right string format. :-)
If I use it on a wxJSONValue, written to wxString (multiline) it crashes the app unfortunately. Maybe this doesnt work with String in String, as the json format needs it?
I dunno about wxJSONValue but iy you think it is library issue try wxSimpleJSON
Chief Justice: We have trouble dear citizens!
Citizens: What it is his honor?
Chief Justice:Our president is an atheist, who will he swear to?
[Ubuntu 15.04/Windows 10 Pro - GCC/MinGW, CodeLite IDE et al]

Natulux
I live to help wx-kind
I live to help wx-kind
Posts: 188
Joined: Thu Aug 03, 2017 12:20 pm

Re: Unicode Madness

Post by Natulux » Tue Aug 06, 2019 1:31 pm

Ahh, sry guys. When it comes to the second page, I often miss that there were new answers. I didn't want to ignore you. Please let me warm up this post again.

My last message said, that I can send utf8 as is, which was only right for unicode letters from u+0000 to u+00FF. I tested the turkish letter "ğ" which is u+011F and the PHP server wasnt able to read it.

And at this point I still dont understand. I can send "ö" as UTF8 "ö" to the server, the server can use for example utf8_decode() to decode it to ISO-8859-1 and then display it as "ö".
However, if I send "ğ" as UTF8 "ÄŸ" to the server, it cant translate it to ISO-8859-1, because u+011F is outside of u+00FF (not in the table). How is the server supposed to get a readable "ğ" (as wxWidgets and this forum can), which format is it supposed to be? Plain unicode codepoint?
Or is the readable "ğ" just "unescaped utf8" and "ÄŸ" is "escaped utf8" and we need to find something to convert the escaping, but not the encoding?
Point is, its not just for displaying. A sign like this in a password causes a missmatch for the two presentations of it...


A second point:
The chilkat method to send my https request takes a " const char * " byte array as body input. But my input is a wide char. Examples:

Code: Select all

wxString s("\u00F6");
wxMessageBox("s: " + s); //out: s: ö	- is const char *
wxString s1("\u011F");
wxMessageBox("s1: " + s1); //out: s1: ?	- is no const char *
wxString s2(_T("\u011F"));
wxMessageBox("s2: " + s2); //out: s2: ğ	- is const wchar_t *
I read my data from a UTF8 formatted file. How can I convert that utf8 data to a "const char *" byte array without loosing the wide chars?


ok, now your points:
doublemax wrote:I'd need more information to be sure, but the resulting string contains '%' characters, if you use it for printf-like functions, these will be interpreted as format identifiers. In a debug build, you should get an assert in that case, not a plain crash.
Tbh for this project, I develop in release. I do not have wxWidgets292 debug build and thought I wouldn't need it. A print like functions causes a crash in this case. But thats my fault.
I guess if we can clarify my problems above, I would use and modify your function to generate the format the server needs.
doublemax wrote:I'm not sure what you mean, there is nothing special wxWidgets' UTF8 encoding.
I was reffering to the escaped or unescaped utf8 formats, which can be found in the net, but I couldn't find a good explanation for it.
At the moment I think I understood: If I can read UTF8, it is "unescaped utf8" and if it has a presentation like "ÄŸ" it is "escaped utf8".
So internally, without the effort to try to make it readable, unescaped utf8 is the real sign, while escaped utf8 is the unicode codepoint (like for c++: \u011F).

Is that about right?
ONEEYEMAN wrote:
Fri Jul 12, 2019 2:05 pm
What about {wx}cURL?
I sometimes used curl.exe as a console test enviroment, but again if I use curl.exe and https it tells me:

Code: Select all

curl: (1) Protocol "https" not supported or disabled in libcurl
Does wxCurl support https? I couldn't find an info at a quick glance.

evstevemd wrote:I dunno about wxJSONValue but iy you think it is library issue try wxSimpleJSON
I might have a look on it. But I think it is the json format per se.
A simple json would be

Code: Select all

{ "key" : "value" }
If I make a string out of that, it automatically generates escape signs:

Code: Select all

 "{ \"key\" : \"value\" }" 
I just thought, that might cause a problem here.

Enough text. Thanks!
Natu

User avatar
doublemax
Moderator
Moderator
Posts: 13988
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: Unicode Madness

Post by doublemax » Tue Aug 06, 2019 3:49 pm

There is no such thing as "escaped" or "unescaped" UTF8.

"ğ" is Unicode 0x011F
0x011F converted into UTF8 encoding is 0xC4 0x9F.

There is no ambiguity anywhere.
However, if I send "ğ" as UTF8 "ÄŸ" to the server, it cant translate it to ISO-8859-1,
That's normal. Why would you try to interpret it as ISO-8859-1, if you know it's UTF-8?
I read my data from a UTF8 formatted file. How can I convert that utf8 data to a "const char *" byte array without loosing the wide chars?
If you have already decoded the data into a wxString, then just pass string.ToUTF8() to the method expecting a const char*.

If you still have the "raw" undecoded UTF8 data, pass it directly.
Use the source, Luke!

Natulux
I live to help wx-kind
I live to help wx-kind
Posts: 188
Joined: Thu Aug 03, 2017 12:20 pm

Re: Unicode Madness

Post by Natulux » Wed Aug 07, 2019 6:37 am

doublemax wrote:
Tue Aug 06, 2019 3:49 pm
There is no such thing as "escaped" or "unescaped" UTF8.

"ğ" is Unicode 0x011F
0x011F converted into UTF8 encoding is 0xC4 0x9F.

There is no ambiguity anywhere.
I have those terms from PHP. I guess that flag is just badly named and means: utf8 or unicode code points.
doublemax wrote:
Tue Aug 06, 2019 3:49 pm
However, if I send "ğ" as UTF8 "ÄŸ" to the server, it cant translate it to ISO-8859-1,
That's normal. Why would you try to interpret it as ISO-8859-1, if you know it's UTF-8?
I just meant: ISO-8859-1 is my local encoding and can be displayed on my machine. To display any UTF-8, I would normally use wxString::FromUTF8() and I would get my local encoding, right? I can't display a utf-8 string ("ğ" would be "ÄŸ").
However I can display "ğ", even though it cant be decoded to local. I just need to use unicode. So a string would be decoded into my local, except for unicode signs, which must remain unicode (or be converted to unicode).

Btw: I can display unicode only with the macro _T().

Code: Select all

wxMessageBox("\u011f"); //doesn't work
wxMessageBox(_T("\u011f")); //works
If I read unicode into a wxString, how can I display that? Because _T(wxStringVar) doesn't work...

doublemax wrote:
Tue Aug 06, 2019 3:49 pm
I read my data from a UTF8 formatted file. How can I convert that utf8 data to a "const char *" byte array without loosing the wide chars?
If you have already decoded the data into a wxString, then just pass string.ToUTF8() to the method expecting a const char*.

If you still have the "raw" undecoded UTF8 data, pass it directly.
I see. Then I can't use the https class the way I do because it doesn't allow me to not convert the data send. It takes the byte array which is expected to represent a string and converts it to the format given (I gave it the format "utf-8"). The result is a body encoded to utf-8 twice.

Cheers
Natu

User avatar
doublemax
Moderator
Moderator
Posts: 13988
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: Unicode Madness

Post by doublemax » Wed Aug 07, 2019 7:20 am

I just meant: ISO-8859-1 is my local encoding and can be displayed on my machine. To display any UTF-8, I would normally use wxString::FromUTF8() and I would get my local encoding, right?
Not exactly. wxString stores Unicode characters internally and they can be displayed inside wxWidgets directly. No further conversion needed.
I see. Then I can't use the https class the way I do because it doesn't allow me to not convert the data send.
You mentioned you used Chilkat? Then this might help:
http://www.chilkatsoft.com/refdoc/vcCkH ... ml#prop117
Use the source, Luke!

Natulux
I live to help wx-kind
I live to help wx-kind
Posts: 188
Joined: Thu Aug 03, 2017 12:20 pm

Re: Unicode Madness

Post by Natulux » Wed Aug 07, 2019 9:08 am

doublemax wrote:
Wed Aug 07, 2019 7:20 am
I just meant: ISO-8859-1 is my local encoding and can be displayed on my machine. To display any UTF-8, I would normally use wxString::FromUTF8() and I would get my local encoding, right?
Not exactly. wxString stores Unicode characters internally and they can be displayed inside wxWidgets directly. No further conversion needed.
I already noticed that wxString is quite strong in comparison to several other string classes. But every class that is smarter than you are is just a good thing, until something doesn't work as expected! :D
(Because then you need to do the tedious knowledge hunting about how things work and that kind of technical knowledge is quite rare)
doublemax wrote:
Wed Aug 07, 2019 7:20 am
I see. Then I can't use the https class the way I do because it doesn't allow me to not convert the data send.
You mentioned you used Chilkat? Then this might help:
http://www.chilkatsoft.com/refdoc/vcCkH ... ml#prop117
Oh, you know chilkat? You never stop to surprise me. ;-)
I didn't know about that Parameter. That might actually solve my problem, I was solely using the conversion functions, which did the same for me.

This finally works:

Send request:

Code: Select all

CkHttpRequest req;
CkByteData byteData;
byteData.appendStr(sSubmit.mb_str(wxConvUTF8));
req.LoadBodyFromBytes(byteData);
//I used LoadBodyFromString(const char *bodyStr, const char *charset) before and that wouldn't work with UTF8
Read response:

Code: Select all

CkString sckBody;
m_ckHttpResponse->get_BodyStr(sckBody);
wxString sBody = sckBody.getStringUtf8();
wxString sBodyUTF8 = wxString::FromUTF8(sBody);
if(sBodyUTF8 != wxEmptyString)
{
	sBody = sBodyUTF8;
}
Well, thank you for your time, doublemax!
Cheers
Natu

User avatar
doublemax
Moderator
Moderator
Posts: 13988
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: Unicode Madness

Post by doublemax » Wed Aug 07, 2019 10:41 am

Oh, you know chilkat? You never stop to surprise me.
I bought and use their Email component. Saved me so much time. Worth every cent.
Use the source, Luke!

Natulux
I live to help wx-kind
I live to help wx-kind
Posts: 188
Joined: Thu Aug 03, 2017 12:20 pm

Re: Unicode Madness

Post by Natulux » Wed Aug 07, 2019 12:13 pm

doublemax wrote:
Wed Aug 07, 2019 10:41 am
Oh, you know chilkat? You never stop to surprise me.
I bought and use their Email component. Saved me so much time. Worth every cent.
I have the feeling that chilkat is able to solve everything, if you find yourself able to puzzle the right examples and classes to your usecase. I had some serious head scratching with chilkat, but when it works it works really fine.

I'll keep your recommendation about chilkt email in the back of my head ;-)

Post Reply