wxWidgets unsupported charset - what to do?

If you are using the main C++ distribution of wxWidgets, Feel free to ask any question related to wxWidgets development here. This means questions regarding to C++ and wxWidgets, not compile problems.
Post Reply
Widgets
Ultimate wxWidgets Guru
Ultimate wxWidgets Guru
Posts: 534
Joined: Thu Jun 01, 2006 4:36 pm
Location: Right here!

wxWidgets unsupported charset - what to do?

Post by Widgets »

Some time ago - viewtopic.php?f=1&t=44782 - I had run into a character set not supported by wxWidgets' built-in facilities.

At the time, I never did resolve the issue because it seemed to arise infrequently enough for me to be able to simply ignore it.
Meanwhile, the same problem has come up again several time, albeit with different character sets: GBK & ISO-2022-JP.

Since wxWidgets does not seem to support these char sets directly, I have looked at std::string support for char set conversion, I have looked at libiconv & ICU but am at a loss to figure out how to best proceed. There seem to be issues with either option.

This time around, I now would very much like to resolve the issue properly, but with the little I know about the underlying issues of fonts, character set encodings and character set support, I am at a loss and would dearly love to get some help and pointers on how to best proceed.
Environment: Win 10/11 64-bit & Mint 21.1
MSVC Express 2019/2022
wxWidgets 3.2.2
User avatar
doublemax
Moderator
Moderator
Posts: 19116
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: wxWidgets unsupported charset - what to do?

Post by doublemax »

Since wxWidgets does not seem to support these char sets directly,
Looking into the wx sources, ISO-2022-JP seems to be supported and Wikipedia lists CP936 (which is also supported) as an alias for GBK.

I don't know how to integrate ICU, i looked into it a few times and always decided it was not worth the effort :)

Can you explain in more detail where the texts come from, how do you know their encoding, in which byte format you receive them and what you need to do with them?
Use the source, Luke!
Widgets
Ultimate wxWidgets Guru
Ultimate wxWidgets Guru
Posts: 534
Joined: Thu Jun 01, 2006 4:36 pm
Location: Right here!

Re: wxWidgets unsupported charset - what to do?

Post by Widgets »

doublemax wrote: Can you explain in more detail where the texts come from, how do you know their encoding, in which byte format you receive them and what you need to do with them?
The data comes directly from the MIME encoded e-mail source where I get strings such as L"=?GBK?B?vfDN/g==?=" or L"=?ISO-2022-JP?Q?=1B$B6b0R=1B=28J_via_curl-library?=" which I want/need to translate to displayable strings.

For either of these 2 examples,

Code: Select all

wxFontEncoding text_encoding = wxFontMapper::Get()->CharsetToEncoding( ar_wsCharSet );
wxFontEncoding system_encoding = wxLocale::GetSystemEncoding();
returns
text_encoding = wxFONTENCODING_CP936 or wxFONTENCODING_ISO2022_JP (0x00000059), respectively
system_encoding = wxFONTENCODING_CP1252 (0x00000021) for either one
The code at this point:

Code: Select all

  if ( text_encoding == wxFONTENCODING_UTF8 )
  {
    wsContent = wxString::FromUTF8(AfterBOM(ar_wsInStr.mb_str(wxConvLocal)));
  }
  else if( text_encoding == wxFONTENCODING_DEFAULT )
  {
    wsContent = ar_wsInStr;
  }
  else if (system_encoding != text_encoding)
  {
    wxEncodingConverter converter;
    bool can_convert = converter.Init(text_encoding, system_encoding);
    if (can_convert) 
    {
      wsContent = converter.Convert(ar_wsInStr);
    }
    else
    {
      /* What can we do ?? */
      // at least log the error return the raw bytes
      wxLogError( _("Can't decode charset '%s'  ."), ar_wsCharSet );
      ...
    }
which fails for these two cases.

The decoding is done by

Code: Select all

/* Perform Q decoding */
  {
    /* Check if we have pattern */
    wxRegEx pattern(
      _T("=\\077([[:alnum:].\\055_]+)\\077[qQ]\\077([^\\077]*)\\077="),
      wxRE_ADVANCED);
    while ((pattern.Matches(decoded_str)) &&
            (pattern.GetMatchCount() == 3))
    {
      /* Extract the encoded string */
      wxString str_content = pattern.GetMatch(decoded_str, 2);

      /* Replace all spaces */
      str_content.Replace(_T("_"), _T(" "));

      /* Handle all =xx paterns */
      int index;
      while (((index = str_content.Find(_T("="))) != wxNOT_FOUND) &&
            (index < int(str_content.length()-2)))
      {
        unsigned long val_long;
        str_content.Mid(index+1, 2).ToULong(&val_long, 16);
        char val[2];
        *((unsigned char*)val) = (unsigned char)val_long;
        val[1] = 0;

        str_content = str_content.Mid(0,index) << wxString(val, wxConvLocal) <<
          str_content.Mid(index+3);
      }
      /* Convert to local charset, if necessary */
      str_content = myCharsetConverter::ConvertCharset(str_content,
        pattern.GetMatch(decoded_str, 1));

      /* Recode string before replacement */
      str_content.Replace(_T("\\"), _T("\\\\"));
      str_content.Replace(_T("&"), _T("\\&"));
      /* Replace in result */
      pattern.ReplaceFirst(&decoded_str, str_content);
    }
  }

  /* Perform B decoding */
  {
    /* Check if we have pattern */
    wxRegEx pattern(_T("=\\077([[:alnum:].\\055_]+)\\077[bB]\\077([^\\077]*)\\077="),
      wxRE_ADVANCED);
    // it seems we can have multiple 'B' strings - need to decode & concatenate them all
    while ((pattern.Matches(decoded_str)) &&
            (pattern.GetMatchCount() == 3))
    {
      // for multiple  Bencode string segments we MUST replace any spaces
      // other wise these make their way into the outout string
      int i = decoded_str.Replace( _T("?= =?"), _T("?==?") );
      /* Extract the encoded string */
      wxString str_content = pattern.GetMatch(decoded_str, 2);

      /* Perform a base64 decoding of the string */
      std::vector<unsigned char> buffer;
      std::string std_string = (const char*)str_content.mb_str(wxConvLocal);
      mimetic::Base64::Decoder b64;
      mimetic::decode(std_string.begin(),
                      std_string.end(),
                      b64,
                      std::back_inserter(buffer));
      /* Flush content in a string */
      str_content = _T("");
      for (unsigned char* p = &buffer[0]; p <= &buffer[buffer.size()-1]; p++)
      {
        // complains because of hi bit set in some UTF-8 chars
        // stops the rest from working - works OK on Windows/MSVC 2015
        str_content.Append(*p, 1);
      }
      wxString wsConvert = pattern.GetMatch(decoded_str, 1);
      /* Convert to local charset, if necessary */
      str_content = myCharsetConverter::ConvertCharset(str_content,
        pattern.GetMatch(decoded_str, 1));

      /* Recode string before replacement */
      str_content.Replace(_T("\\"), _T("\\\\"));
      str_content.Replace(_T("&"), _T("\\&"));
      /* Replace in result */
      pattern.ReplaceFirst(&decoded_str, str_content);
    }
  }
  /* Return the decoded string */
  return decoded_str;
Environment: Win 10/11 64-bit & Mint 21.1
MSVC Express 2019/2022
wxWidgets 3.2.2
User avatar
doublemax
Moderator
Moderator
Posts: 19116
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: wxWidgets unsupported charset - what to do?

Post by doublemax »

I'm not sure if wxEncodingConverter is the right tool for the job.

Try something like this:

Code: Select all

const char ISO2022JP_DATA[] = "fill with real data";
wxString s(ISO2022JP_DATA, wxCSConv("ISO-2022-JP"));
wxLogMessage( s );
Use the source, Luke!
Widgets
Ultimate wxWidgets Guru
Ultimate wxWidgets Guru
Posts: 534
Joined: Thu Jun 01, 2006 4:36 pm
Location: Right here!

Re: wxWidgets unsupported charset - what to do?

Post by Widgets »

Working on the ISO-2022-JP charset for now, I have

Code: Select all

     /* What can we do ?? */
      // at least log the error return the raw bytes
      wxLogError( _("Can't decode charset '%s'  ."), ar_wsCharSet );
      // the ISO-2022-JP byte sequence --V
      // 1b 24 42 36 62 30 52 1b 28 4a 20 76 69 61 20 63 75 72 6c 2d 6c 69 62 72 61 72 79 00
      if( ar_wsCharSet.IsSameAs( _T("ISO-2022-JP" )) )
      {
        int l = ar_wsInStr.Len();
        char ISO2022JP_DATA[128] = {0};
        int i = 0;
        for ( i = 0; i < l; i++ )
        {
          ISO2022JP_DATA[i] = ar_wsInStr.GetChar(i);
        }
        wxCSConv conv("ISO-2022-JP");
        bool bOk = conv.IsOk();
        wxASSERT( bOk );
        wxString s1( ISO2022JP_DATA, conv);
        wxString s2( ISO2022JP_DATA, wxCSConv("ISO-2022-JP"));
        wxString s3( (const wchar_t *)ar_wsInStr, wxCSConv("ISO-2022-JP"));
        wxString s4( (const char *)ar_wsInStr, wxCSConv("ISO-2022-JP"));
        wxLogMessage( _T("ISO-2022-JP: %s, %s, %s, %s"), s1, s2, s3, s4 );
      }
Both s1, s2 & s4 end up a empty strings, s3 contains some garbled translation :-(
Meanwhile, I have now gotten some strings encoded in GB2312 -finding a way to handle these strings is getting more important :-)
Granted, from what I have found out over the past couple of days, IS)-2022-JP and its cousins are a real mare's nest of ASCII shifting back & forth to word encoded Japanese characters
Environment: Win 10/11 64-bit & Mint 21.1
MSVC Express 2019/2022
wxWidgets 3.2.2
User avatar
doublemax
Moderator
Moderator
Posts: 19116
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: wxWidgets unsupported charset - what to do?

Post by doublemax »

I did a little detective work by tracing through the conversion code.

As far as i can tell, the conversion actually works internally, but then wx does a back-conversion and checks if the result is the same as the input. And because these strings were not identical, it fails the whole conversion and returns an empty string.

After the back-conversion of the byte sequence you posted, the 10th byte '0x4a' will be '0x42', the other bytes are identical. I don't know how to interpret that. Maybe the conversion is not symmetrical? In that case this would be a bug in wxWidgets.

If you only need a Windows solution, you could call the Windows function that performs the actual conversion directly and check if the result is correct.

My quick and dirty test code:

Code: Select all

const char ISO2022JP_DATA[] = { 0x1b, 0x24, 0x42, 0x36, 0x62, 0x30, 0x52, 0x1b, 0x28, 0x4a, 0x20, 0x76, 0x69, 0x61, 0x20, 0x63, 0x75, 0x72, 0x6c, 0x2d, 0x6c, 0x69, 0x62, 0x72, 0x61, 0x72, 0x79, 0x00 };
  //wxString s(ISO2022JP_DATA, wxCSConv("ISO-2022-JP"), strlen(ISO2022JP_DATA) );
  //wxLogMessage(s);

  TCHAR destbuffer[256];
  memset(destbuffer, 0, 256*sizeof(TCHAR) );

  ::MultiByteToWideChar (
                          50222,              // code page for ISO-2022-JP
                          0,                  // flags
                          ISO2022JP_DATA,     // input string
                          -1,                 // its length (NUL-terminated)
                          destbuffer,         // output string
                          512                 // size of output buffer
                        );

  wxLogMessage( wxString(destbuffer) );
This returns: "金威 via curl-library"
Use the source, Luke!
Widgets
Ultimate wxWidgets Guru
Ultimate wxWidgets Guru
Posts: 534
Joined: Thu Jun 01, 2006 4:36 pm
Location: Right here!

Re: wxWidgets unsupported charset - what to do?

Post by Widgets »

Thank you; I had tried to trace through the conversion, but got lost.
In the process I have learned a lot, but more importantly, I have realized how much more I don't understand about the internal workings of the various string and char types and their transformations back and forth.

I had had a brief look at Win's MultiByteToWideChar function, but never really got my head clear as to just how I need to go from the B or Q encoded output to the the input of any of the available conversion options.

In this particular case, I had looked at the standard for ISO-2022-JP and the wording had me wondering whether to create char arrays or wChar_t arrays as input. From your trial, evidently 8bit arrays are the thing.
In any case, the output string you got from MultiByteToWideChar agrees with the glyphs shown in Thunderbird

For my own use, a Windows only version would be quite adequate. In the code I had uploaded to Github, I had made an effort to make it Linux compatible, but between the learning curve even for a Windows version and my very limited lack - read: even steeper learning curve for Linux - I very likely have to stick to the Windows version unless I can find something that covers both, hence my efforts to keep using the wxWidgets code.

One of the pages I was consulting. may throw some light on the problem you found in the original string. The bytes 8 - 10 seem to represent an escape sequence 0x1b, 0x28, 0x4a => ESC ( J, while the string you got back has the escape sequence 0x1b, 0x28, 0x42 => ESC ( B

The page @: https://www.sljfaq.org/afaq/encodings.h ... SO-2022-JP
says, among much else
The text begins in ASCII by default, and it must be switched back to ASCII at the end. It is also recommended for newlines to always be encoded in ASCII. If this is e-mail, the escape codes must be used in the Subject: or From: lines if they contain Japanese, again switching back to ASCII when done.

ESC ( B should be preferred over ESC ( J. The latter is a legacy code whose use is discouraged today. Also, avoid ESC ( I unless half-width katakana are wanted.
So perhaps the wxWidgets test should allow for either :-)
For now, I will test the Windows only version and see how far it gets me.
If I can test a patch or fix for wxWidgets, please let me know. My current environment is MSVC 2015 cpmpiling with the MSVC 2010 tool chain using wxWidgets 3.1
Environment: Win 10/11 64-bit & Mint 21.1
MSVC Express 2019/2022
wxWidgets 3.2.2
Widgets
Ultimate wxWidgets Guru
Ultimate wxWidgets Guru
Posts: 534
Joined: Thu Jun 01, 2006 4:36 pm
Location: Right here!

Re: wxWidgets unsupported charset - what to do?

Post by Widgets »

FWIW, using your sample code and adjusting the code page as necessary, I have been able to decode and convert ISO-2022-JP, GB2312 & GBK; the latter two use the same MultiByteToWideChar code page 936
What I have now:

Code: Select all

 int l = ar_wsInStr.Len();
 char ISO2022JP_DATA[128] = {0};
 int i = 0;
 for ( i = 0; i < l; i++ )
 {
     ISO2022JP_DATA[i] = ar_wsInStr.GetChar(i);
  }
  TCHAR destbuffer[256];
   memset(destbuffer, 0, 256*sizeof(TCHAR) );

 ::MultiByteToWideChar (
                                936,              // code page for ISO-2022-JP
                                0,                  // flags
                                ISO2022JP_DATA,     // input string
                                -1,                 // its length (NUL-terminated)
                                destbuffer,         // output string
                                512                 // size of output buffer
                              );
  wsContent = wxString(destbuffer) ;
Out of curioity, I also tried the wxWidgets code for GBK & GB2312 and it works - now that I have the correct input

Code: Select all

wxString s1( ISO2022JP_DATA, conv);
        wxString s2( ISO2022JP_DATA, wxCSConv("GB2312"));
        wxString s3( (const wchar_t *)ar_wsInStr, wxCSConv("GB2312"));
        wxString s4( (const char *)ar_wsInStr, wxCSConv("GB2312"));
        wxLogMessage( _T("GB2312: %s, %s, %s, %s"), s1, s2, s3, s4 );
in this case, s1, s2, s4 give the correct result, s3 just echoes the input
Thank you =D>
Environment: Win 10/11 64-bit & Mint 21.1
MSVC Express 2019/2022
wxWidgets 3.2.2
Post Reply