Regex problem with UTF-8 string in mime decoding Topic is solved

Talk here about issues with one of the components hosted at wxCode, or suggest features for it.
Post Reply
Widgets
Ultimate wxWidgets Guru
Ultimate wxWidgets Guru
Posts: 534
Joined: Thu Jun 01, 2006 4:36 pm
Location: Right here!

Regex problem with UTF-8 string in mime decoding

Post by Widgets »

I am using the wxCode version of the wxEmail package wxEMail-1.0-Beta-2 in a project of mine, compiling under MSVC 2015 community edition using the v140_xp tool set and wxWidgets 3.1.0 staticly linked libraries. <<<<<< edit: v100 tool set - same as compiling under MSVC 2010
The code which seems to give me issues looks like:
calling test routine:

Code: Select all

  // UTF-8
  wxString wsT2 = _T("Dom.eu - Raport z eksportu ogłoszeń nieruchomości");
  wsT = _T("=?utf-8?B?RG9tLmV1IC0gcmFwb3J0IHogZWtzcG9ydHUgb2fFgm9zemXFhCBuaWVydWNob21vxZtjaQ==?=");
  wsDecoded = wxRfc2047::Decode( wsT );
  CPPUNIT_ASSERT( wsDecoded.IsSameAs( _T("Dom.eu - raport z eksportu ogłoszeń nieruchomości") ) );

  wsT = _T("=?UTF-8?B?VGhpcyB3ZWVrIE9uIERlbWFuZCDigJMgVGhlIExpZ2h0IEJldHdlZW4gT2NlYW5zICYgSW5mZXJubw==?=");
  wsDecoded = wxRfc2047::Decode( wsT );
  CPPUNIT_ASSERT( wsDecoded.IsSameAs( _T("This week On Demand – The Light Between Oceans & Inferno") ) );
called code - taken from the wxEmail package - with the Q decoding part remove for clarity since it won't be used:

Code: Select all

wxString wxRfc2047::Decode(const wxString& encoded_str)
{
   /* Initialise decoded string */
   wxString decoded_str = encoded_str;

   /* Remove all white spaces between encoded parts, if any */
   {
      wxRegEx pattern(_T("(=\\077[[:alnum:].\\055_]+\\077[qQbB]\\077[^\\077]*\\077=)[[:blank:]]+(=\\077[[:alnum:].\\055_]+\\077[qQbB]\\077[^\\077]*\\077=)"), wxRE_ADVANCED);
      pattern.ReplaceAll(&decoded_str, _T("\\1\\2"));
   }
   /* Perform Q decoding */
   	; not used
   }
   /* Perform B decoding */
   {
      /* Check if we have pattern */
      wxRegEx pattern(_T("=\\077([[:alnum:].\\055_]+)\\077[bB]\\077([^\\077]*)\\077="), wxRE_ADVANCED);
      while ((pattern.Matches(decoded_str)) &&
             (pattern.GetMatchCount() == 3))
      {
         /* Extract the encoded string */
         wxString str_content = pattern.GetMatch(decoded_str, 2);

         /* Perform a base64 decoding of the string */
         std::vector<unsigned char> buffer;
         std::string std_string = (const char*)str_content.mb_str(wxConvLocal);
         mimetic::Base64::Decoder b64;
         mimetic::decode(std_string.begin(),
                         std_string.end(),
                         b64,
                         std::back_inserter(buffer));
         /* Flush content in a string */
         str_content = _T("");
         for (unsigned char* p = &buffer[0]; p <= &buffer[buffer.size()-1]; p++)
         {
            str_content.Append(*p, 1);
         }

         /* Convert to local charset, if necessary */
         str_content = wxCharsetConverter::ConvertCharset(str_content, pattern.GetMatch(decoded_str, 1));	// << point A

         /* Replace in result */
         wxString wsContent = str_content;	// try to convert to  wxWidgets before call to check result
//         pattern.ReplaceFirst(&decoded_str, str_content);
         pattern.ReplaceFirst(&decoded_str, wsContent);            //   <<<<<<< problem seems to start here
      }
   }
   /* Return the decoded string */
   return decoded_str;
}
For the first input string I get the expected output but for the second string the code goes into an infinite loop and hangs.
With the second string, I get the properly decoded result at point.
After calling pattern.ReplaceFirst(&decoded_str, wsContent);
decoded_str = "This week On Demand – The Light Between Oceans =?UTF-8?B?VGhpcyB3ZWVrIE9uIERlbWFuZCDigJMgVGhlIExpZ2h0IEJldHdlZW4gT2NlYW5zICYgSW5mZXJubw==?= Inferno"

It very much looks like the problem is with the '&' in wsContent, but, not being a wxRegex or regex guru, I am completely lost as to how to fix it for the general case.
Because the original un-decoded string is now still part of the output, the test at the top of the loop succeeds again and around the loop we go ......

Any help will be most appreciated.
TIA
Arnold
Environment: Win 10/11 64-bit & Mint 21.1
MSVC Express 2019/2022
wxWidgets 3.2.2
User avatar
doublemax
Moderator
Moderator
Posts: 19102
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: Regex problem with UTF-8 string in mime decoding

Post by doublemax »

I didn't test the code, but the fact that it totally ignores the encoding given in the string and uses wxConvLocal internally is quite suspicious. Maybe you can find some other code for RFC2047 decoding "in the wild".

If you intend anything serious with regards to email sending and receiving, i don't think you'll get very far with wxEMail. Unfortunately i can't recommend any good open source library for that purpose. Personally i ended up buying CkMail.
https://www.chilkatsoft.com/email-library.asp
Use the source, Luke!
Widgets
Ultimate wxWidgets Guru
Ultimate wxWidgets Guru
Posts: 534
Joined: Thu Jun 01, 2006 4:36 pm
Location: Right here!

Re: Regex problem with UTF-8 string in mime decoding

Post by Widgets »

doublemax wrote:I didn't test the code, but the fact that it totally ignores the encoding given in the string and uses wxConvLocal internally is quite suspicious. Maybe you can find some other code for RFC2047 decoding "in the wild".
Thank you very much for your reply, doublemax.
Initially I was very disappointed by the move to wxCode, because I really don't think it is related to that part of wxWidgets at all.
I understand your concern about the conversion, which is why I tried different tests (including finding another UTF-8 encoded string), but by now I am convinced that the issue is really in either my not understanding the inputs required by wxRegex:ReplaceFirst() or else a problem inside that function.

I had made up a modified to post to explain, but finally decided to build a small minimalist example to test my hypothesis. I have attached it to this post and it does show the same problem. In it, I have reverted back to MSVC 2010 - the MSVC 2015 effort was a later development in my efforts to perhaps get more information from a more recent compiler system, but, of course, it did not help much.

In this test project I had to add some of the MIME decoding code just to keep it all a close to the 'real' thing as possible.

Just to recapitulate: the output, IMO, from the replacement, should simply not include the full decoded_str, where the initial content of it is the content as input to Decode(). It all ought to have been replaced by the string str_content, as happens with the first test string :-)
doublemax wrote: If you intend anything serious with regards to email sending and receiving, i don't think you'll get very far with wxEMail. Unfortunately i can't recommend any good open source library for that purpose. Personally i ended up buying CkMail.
https://www.chilkatsoft.com/email-library.asp
Yes, we have, I believe, touched on this before, but my app was too far advanced to consider a full rebuild :-)
Right now, I am only interested in checking the contents of my POP3 server, though if I ever move to IMAP, I might actually have to 'reboot' my thinking on this.
In all my search for a solution to this issue, I have found another 'version' of wxEmail from Eran Ifrah of Codelite fame at: https://github.com/eranif/wxEmail.
It uses libCurl to send emails, & I have compiled, if not really used that project as well, with success.
regexTest.zip
(122.68 KiB) Downloaded 187 times
Environment: Win 10/11 64-bit & Mint 21.1
MSVC Express 2019/2022
wxWidgets 3.2.2
User avatar
doublemax
Moderator
Moderator
Posts: 19102
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: Regex problem with UTF-8 string in mime decoding

Post by doublemax »

So is this the original code from wxEmail or did you already add parts? If yes, which ones.
Use the source, Luke!
Widgets
Ultimate wxWidgets Guru
Ultimate wxWidgets Guru
Posts: 534
Joined: Thu Jun 01, 2006 4:36 pm
Location: Right here!

Re: Regex problem with UTF-8 string in mime decoding

Post by Widgets »

doublemax wrote:So is this the original code from wxEmail or did you already add parts? If yes, which ones.
The sample test program consists merely of a somewhat shortened extract from the function wxString wxRfc2047::Decode(const wxString& encoded_str) in rfc2047.cpp - as it appears in wxEMail\src\codec\rfc2047.cpp - called by the main frame with the appropriate test strings.

My code is of course more extensive, but this sample reproduces the error I get both from my app as well as the original wxEMailPop3 client attempting to download existing messages from the same POP3 server.
Essentially, one can say it happens in the original code from wxEMail.
Environment: Win 10/11 64-bit & Mint 21.1
MSVC Express 2019/2022
wxWidgets 3.2.2
Widgets
Ultimate wxWidgets Guru
Ultimate wxWidgets Guru
Posts: 534
Joined: Thu Jun 01, 2006 4:36 pm
Location: Right here!

Re: Regex problem with UTF-8 string in mime decoding

Post by Widgets »

In looking through the code some more, I realized the in the function wxRfc2047::Decode(), the code for Q decoding looked very similar to that for B decoding, with one difference.
In the section for Q decoding, there was a bit of extra code which was not present in the B decode section and, what is more to the point, it concerned the replacement of any '&' with an escape sequence prior to the call to ReplaceFirst().

Once I added equivalent code to the B decoding section, the problem was fixed and the input string is now decoded properly.
Environment: Win 10/11 64-bit & Mint 21.1
MSVC Express 2019/2022
wxWidgets 3.2.2
Post Reply