wxWidgets unsupported charset - what to do?
wxWidgets unsupported charset - what to do?
Some time ago - viewtopic.php?f=1&t=44782 - I had run into a character set not supported by wxWidgets' built-in facilities.
At the time, I never did resolve the issue because it seemed to arise infrequently enough for me to be able to simply ignore it.
Meanwhile, the same problem has come up again several time, albeit with different character sets: GBK & ISO-2022-JP.
Since wxWidgets does not seem to support these char sets directly, I have looked at std::string support for char set conversion, I have looked at libiconv & ICU but am at a loss to figure out how to best proceed. There seem to be issues with either option.
This time around, I now would very much like to resolve the issue properly, but with the little I know about the underlying issues of fonts, character set encodings and character set support, I am at a loss and would dearly love to get some help and pointers on how to best proceed.
At the time, I never did resolve the issue because it seemed to arise infrequently enough for me to be able to simply ignore it.
Meanwhile, the same problem has come up again several time, albeit with different character sets: GBK & ISO-2022-JP.
Since wxWidgets does not seem to support these char sets directly, I have looked at std::string support for char set conversion, I have looked at libiconv & ICU but am at a loss to figure out how to best proceed. There seem to be issues with either option.
This time around, I now would very much like to resolve the issue properly, but with the little I know about the underlying issues of fonts, character set encodings and character set support, I am at a loss and would dearly love to get some help and pointers on how to best proceed.
Environment: Win 10/11 64-bit & Mint 21.1
MSVC Express 2019/2022
wxWidgets 3.2.2
MSVC Express 2019/2022
wxWidgets 3.2.2
Re: wxWidgets unsupported charset - what to do?
Looking into the wx sources, ISO-2022-JP seems to be supported and Wikipedia lists CP936 (which is also supported) as an alias for GBK.Since wxWidgets does not seem to support these char sets directly,
I don't know how to integrate ICU, i looked into it a few times and always decided it was not worth the effort
Can you explain in more detail where the texts come from, how do you know their encoding, in which byte format you receive them and what you need to do with them?
Use the source, Luke!
Re: wxWidgets unsupported charset - what to do?
The data comes directly from the MIME encoded e-mail source where I get strings such as L"=?GBK?B?vfDN/g==?=" or L"=?ISO-2022-JP?Q?=1B$B6b0R=1B=28J_via_curl-library?=" which I want/need to translate to displayable strings.doublemax wrote: Can you explain in more detail where the texts come from, how do you know their encoding, in which byte format you receive them and what you need to do with them?
For either of these 2 examples,
Code: Select all
wxFontEncoding text_encoding = wxFontMapper::Get()->CharsetToEncoding( ar_wsCharSet );
wxFontEncoding system_encoding = wxLocale::GetSystemEncoding();
text_encoding = wxFONTENCODING_CP936 or wxFONTENCODING_ISO2022_JP (0x00000059), respectively
system_encoding = wxFONTENCODING_CP1252 (0x00000021) for either one
The code at this point:
Code: Select all
if ( text_encoding == wxFONTENCODING_UTF8 )
{
wsContent = wxString::FromUTF8(AfterBOM(ar_wsInStr.mb_str(wxConvLocal)));
}
else if( text_encoding == wxFONTENCODING_DEFAULT )
{
wsContent = ar_wsInStr;
}
else if (system_encoding != text_encoding)
{
wxEncodingConverter converter;
bool can_convert = converter.Init(text_encoding, system_encoding);
if (can_convert)
{
wsContent = converter.Convert(ar_wsInStr);
}
else
{
/* What can we do ?? */
// at least log the error return the raw bytes
wxLogError( _("Can't decode charset '%s' ."), ar_wsCharSet );
...
}
The decoding is done by
Code: Select all
/* Perform Q decoding */
{
/* Check if we have pattern */
wxRegEx pattern(
_T("=\\077([[:alnum:].\\055_]+)\\077[qQ]\\077([^\\077]*)\\077="),
wxRE_ADVANCED);
while ((pattern.Matches(decoded_str)) &&
(pattern.GetMatchCount() == 3))
{
/* Extract the encoded string */
wxString str_content = pattern.GetMatch(decoded_str, 2);
/* Replace all spaces */
str_content.Replace(_T("_"), _T(" "));
/* Handle all =xx paterns */
int index;
while (((index = str_content.Find(_T("="))) != wxNOT_FOUND) &&
(index < int(str_content.length()-2)))
{
unsigned long val_long;
str_content.Mid(index+1, 2).ToULong(&val_long, 16);
char val[2];
*((unsigned char*)val) = (unsigned char)val_long;
val[1] = 0;
str_content = str_content.Mid(0,index) << wxString(val, wxConvLocal) <<
str_content.Mid(index+3);
}
/* Convert to local charset, if necessary */
str_content = myCharsetConverter::ConvertCharset(str_content,
pattern.GetMatch(decoded_str, 1));
/* Recode string before replacement */
str_content.Replace(_T("\\"), _T("\\\\"));
str_content.Replace(_T("&"), _T("\\&"));
/* Replace in result */
pattern.ReplaceFirst(&decoded_str, str_content);
}
}
/* Perform B decoding */
{
/* Check if we have pattern */
wxRegEx pattern(_T("=\\077([[:alnum:].\\055_]+)\\077[bB]\\077([^\\077]*)\\077="),
wxRE_ADVANCED);
// it seems we can have multiple 'B' strings - need to decode & concatenate them all
while ((pattern.Matches(decoded_str)) &&
(pattern.GetMatchCount() == 3))
{
// for multiple Bencode string segments we MUST replace any spaces
// other wise these make their way into the outout string
int i = decoded_str.Replace( _T("?= =?"), _T("?==?") );
/* Extract the encoded string */
wxString str_content = pattern.GetMatch(decoded_str, 2);
/* Perform a base64 decoding of the string */
std::vector<unsigned char> buffer;
std::string std_string = (const char*)str_content.mb_str(wxConvLocal);
mimetic::Base64::Decoder b64;
mimetic::decode(std_string.begin(),
std_string.end(),
b64,
std::back_inserter(buffer));
/* Flush content in a string */
str_content = _T("");
for (unsigned char* p = &buffer[0]; p <= &buffer[buffer.size()-1]; p++)
{
// complains because of hi bit set in some UTF-8 chars
// stops the rest from working - works OK on Windows/MSVC 2015
str_content.Append(*p, 1);
}
wxString wsConvert = pattern.GetMatch(decoded_str, 1);
/* Convert to local charset, if necessary */
str_content = myCharsetConverter::ConvertCharset(str_content,
pattern.GetMatch(decoded_str, 1));
/* Recode string before replacement */
str_content.Replace(_T("\\"), _T("\\\\"));
str_content.Replace(_T("&"), _T("\\&"));
/* Replace in result */
pattern.ReplaceFirst(&decoded_str, str_content);
}
}
/* Return the decoded string */
return decoded_str;
Environment: Win 10/11 64-bit & Mint 21.1
MSVC Express 2019/2022
wxWidgets 3.2.2
MSVC Express 2019/2022
wxWidgets 3.2.2
Re: wxWidgets unsupported charset - what to do?
I'm not sure if wxEncodingConverter is the right tool for the job.
Try something like this:
Try something like this:
Code: Select all
const char ISO2022JP_DATA[] = "fill with real data";
wxString s(ISO2022JP_DATA, wxCSConv("ISO-2022-JP"));
wxLogMessage( s );
Use the source, Luke!
Re: wxWidgets unsupported charset - what to do?
Working on the ISO-2022-JP charset for now, I have
Both s1, s2 & s4 end up a empty strings, s3 contains some garbled translation
Meanwhile, I have now gotten some strings encoded in GB2312 -finding a way to handle these strings is getting more important
Granted, from what I have found out over the past couple of days, IS)-2022-JP and its cousins are a real mare's nest of ASCII shifting back & forth to word encoded Japanese characters
Code: Select all
/* What can we do ?? */
// at least log the error return the raw bytes
wxLogError( _("Can't decode charset '%s' ."), ar_wsCharSet );
// the ISO-2022-JP byte sequence --V
// 1b 24 42 36 62 30 52 1b 28 4a 20 76 69 61 20 63 75 72 6c 2d 6c 69 62 72 61 72 79 00
if( ar_wsCharSet.IsSameAs( _T("ISO-2022-JP" )) )
{
int l = ar_wsInStr.Len();
char ISO2022JP_DATA[128] = {0};
int i = 0;
for ( i = 0; i < l; i++ )
{
ISO2022JP_DATA[i] = ar_wsInStr.GetChar(i);
}
wxCSConv conv("ISO-2022-JP");
bool bOk = conv.IsOk();
wxASSERT( bOk );
wxString s1( ISO2022JP_DATA, conv);
wxString s2( ISO2022JP_DATA, wxCSConv("ISO-2022-JP"));
wxString s3( (const wchar_t *)ar_wsInStr, wxCSConv("ISO-2022-JP"));
wxString s4( (const char *)ar_wsInStr, wxCSConv("ISO-2022-JP"));
wxLogMessage( _T("ISO-2022-JP: %s, %s, %s, %s"), s1, s2, s3, s4 );
}
Meanwhile, I have now gotten some strings encoded in GB2312 -finding a way to handle these strings is getting more important
Granted, from what I have found out over the past couple of days, IS)-2022-JP and its cousins are a real mare's nest of ASCII shifting back & forth to word encoded Japanese characters
Environment: Win 10/11 64-bit & Mint 21.1
MSVC Express 2019/2022
wxWidgets 3.2.2
MSVC Express 2019/2022
wxWidgets 3.2.2
Re: wxWidgets unsupported charset - what to do?
I did a little detective work by tracing through the conversion code.
As far as i can tell, the conversion actually works internally, but then wx does a back-conversion and checks if the result is the same as the input. And because these strings were not identical, it fails the whole conversion and returns an empty string.
After the back-conversion of the byte sequence you posted, the 10th byte '0x4a' will be '0x42', the other bytes are identical. I don't know how to interpret that. Maybe the conversion is not symmetrical? In that case this would be a bug in wxWidgets.
If you only need a Windows solution, you could call the Windows function that performs the actual conversion directly and check if the result is correct.
My quick and dirty test code:
This returns: "金威 via curl-library"
As far as i can tell, the conversion actually works internally, but then wx does a back-conversion and checks if the result is the same as the input. And because these strings were not identical, it fails the whole conversion and returns an empty string.
After the back-conversion of the byte sequence you posted, the 10th byte '0x4a' will be '0x42', the other bytes are identical. I don't know how to interpret that. Maybe the conversion is not symmetrical? In that case this would be a bug in wxWidgets.
If you only need a Windows solution, you could call the Windows function that performs the actual conversion directly and check if the result is correct.
My quick and dirty test code:
Code: Select all
const char ISO2022JP_DATA[] = { 0x1b, 0x24, 0x42, 0x36, 0x62, 0x30, 0x52, 0x1b, 0x28, 0x4a, 0x20, 0x76, 0x69, 0x61, 0x20, 0x63, 0x75, 0x72, 0x6c, 0x2d, 0x6c, 0x69, 0x62, 0x72, 0x61, 0x72, 0x79, 0x00 };
//wxString s(ISO2022JP_DATA, wxCSConv("ISO-2022-JP"), strlen(ISO2022JP_DATA) );
//wxLogMessage(s);
TCHAR destbuffer[256];
memset(destbuffer, 0, 256*sizeof(TCHAR) );
::MultiByteToWideChar (
50222, // code page for ISO-2022-JP
0, // flags
ISO2022JP_DATA, // input string
-1, // its length (NUL-terminated)
destbuffer, // output string
512 // size of output buffer
);
wxLogMessage( wxString(destbuffer) );
Use the source, Luke!
Re: wxWidgets unsupported charset - what to do?
Thank you; I had tried to trace through the conversion, but got lost.
In the process I have learned a lot, but more importantly, I have realized how much more I don't understand about the internal workings of the various string and char types and their transformations back and forth.
I had had a brief look at Win's MultiByteToWideChar function, but never really got my head clear as to just how I need to go from the B or Q encoded output to the the input of any of the available conversion options.
In this particular case, I had looked at the standard for ISO-2022-JP and the wording had me wondering whether to create char arrays or wChar_t arrays as input. From your trial, evidently 8bit arrays are the thing.
In any case, the output string you got from MultiByteToWideChar agrees with the glyphs shown in Thunderbird
For my own use, a Windows only version would be quite adequate. In the code I had uploaded to Github, I had made an effort to make it Linux compatible, but between the learning curve even for a Windows version and my very limited lack - read: even steeper learning curve for Linux - I very likely have to stick to the Windows version unless I can find something that covers both, hence my efforts to keep using the wxWidgets code.
One of the pages I was consulting. may throw some light on the problem you found in the original string. The bytes 8 - 10 seem to represent an escape sequence 0x1b, 0x28, 0x4a => ESC ( J, while the string you got back has the escape sequence 0x1b, 0x28, 0x42 => ESC ( B
The page @: https://www.sljfaq.org/afaq/encodings.h ... SO-2022-JP
says, among much else
For now, I will test the Windows only version and see how far it gets me.
If I can test a patch or fix for wxWidgets, please let me know. My current environment is MSVC 2015 cpmpiling with the MSVC 2010 tool chain using wxWidgets 3.1
In the process I have learned a lot, but more importantly, I have realized how much more I don't understand about the internal workings of the various string and char types and their transformations back and forth.
I had had a brief look at Win's MultiByteToWideChar function, but never really got my head clear as to just how I need to go from the B or Q encoded output to the the input of any of the available conversion options.
In this particular case, I had looked at the standard for ISO-2022-JP and the wording had me wondering whether to create char arrays or wChar_t arrays as input. From your trial, evidently 8bit arrays are the thing.
In any case, the output string you got from MultiByteToWideChar agrees with the glyphs shown in Thunderbird
For my own use, a Windows only version would be quite adequate. In the code I had uploaded to Github, I had made an effort to make it Linux compatible, but between the learning curve even for a Windows version and my very limited lack - read: even steeper learning curve for Linux - I very likely have to stick to the Windows version unless I can find something that covers both, hence my efforts to keep using the wxWidgets code.
One of the pages I was consulting. may throw some light on the problem you found in the original string. The bytes 8 - 10 seem to represent an escape sequence 0x1b, 0x28, 0x4a => ESC ( J, while the string you got back has the escape sequence 0x1b, 0x28, 0x42 => ESC ( B
The page @: https://www.sljfaq.org/afaq/encodings.h ... SO-2022-JP
says, among much else
So perhaps the wxWidgets test should allow for eitherThe text begins in ASCII by default, and it must be switched back to ASCII at the end. It is also recommended for newlines to always be encoded in ASCII. If this is e-mail, the escape codes must be used in the Subject: or From: lines if they contain Japanese, again switching back to ASCII when done.
ESC ( B should be preferred over ESC ( J. The latter is a legacy code whose use is discouraged today. Also, avoid ESC ( I unless half-width katakana are wanted.
For now, I will test the Windows only version and see how far it gets me.
If I can test a patch or fix for wxWidgets, please let me know. My current environment is MSVC 2015 cpmpiling with the MSVC 2010 tool chain using wxWidgets 3.1
Environment: Win 10/11 64-bit & Mint 21.1
MSVC Express 2019/2022
wxWidgets 3.2.2
MSVC Express 2019/2022
wxWidgets 3.2.2
Re: wxWidgets unsupported charset - what to do?
FWIW, using your sample code and adjusting the code page as necessary, I have been able to decode and convert ISO-2022-JP, GB2312 & GBK; the latter two use the same MultiByteToWideChar code page 936
What I have now:
Out of curioity, I also tried the wxWidgets code for GBK & GB2312 and it works - now that I have the correct input
in this case, s1, s2, s4 give the correct result, s3 just echoes the input
Thank you
What I have now:
Code: Select all
int l = ar_wsInStr.Len();
char ISO2022JP_DATA[128] = {0};
int i = 0;
for ( i = 0; i < l; i++ )
{
ISO2022JP_DATA[i] = ar_wsInStr.GetChar(i);
}
TCHAR destbuffer[256];
memset(destbuffer, 0, 256*sizeof(TCHAR) );
::MultiByteToWideChar (
936, // code page for ISO-2022-JP
0, // flags
ISO2022JP_DATA, // input string
-1, // its length (NUL-terminated)
destbuffer, // output string
512 // size of output buffer
);
wsContent = wxString(destbuffer) ;
Code: Select all
wxString s1( ISO2022JP_DATA, conv);
wxString s2( ISO2022JP_DATA, wxCSConv("GB2312"));
wxString s3( (const wchar_t *)ar_wsInStr, wxCSConv("GB2312"));
wxString s4( (const char *)ar_wsInStr, wxCSConv("GB2312"));
wxLogMessage( _T("GB2312: %s, %s, %s, %s"), s1, s2, s3, s4 );
Thank you
Environment: Win 10/11 64-bit & Mint 21.1
MSVC Express 2019/2022
wxWidgets 3.2.2
MSVC Express 2019/2022
wxWidgets 3.2.2