Page 1 of 1

Regex problem with UTF-8 string in mime decoding

Posted: Tue Feb 07, 2017 3:31 am
by Widgets
I am using the wxCode version of the wxEmail package wxEMail-1.0-Beta-2 in a project of mine, compiling under MSVC 2015 community edition using the v140_xp tool set and wxWidgets 3.1.0 staticly linked libraries. <<<<<< edit: v100 tool set - same as compiling under MSVC 2010
The code which seems to give me issues looks like:
calling test routine:

Code: Select all

  // UTF-8
  wxString wsT2 = _T("Dom.eu - Raport z eksportu ogłoszeń nieruchomości");
  wsT = _T("=?utf-8?B?RG9tLmV1IC0gcmFwb3J0IHogZWtzcG9ydHUgb2fFgm9zemXFhCBuaWVydWNob21vxZtjaQ==?=");
  wsDecoded = wxRfc2047::Decode( wsT );
  CPPUNIT_ASSERT( wsDecoded.IsSameAs( _T("Dom.eu - raport z eksportu ogłoszeń nieruchomości") ) );

  wsT = _T("=?UTF-8?B?VGhpcyB3ZWVrIE9uIERlbWFuZCDigJMgVGhlIExpZ2h0IEJldHdlZW4gT2NlYW5zICYgSW5mZXJubw==?=");
  wsDecoded = wxRfc2047::Decode( wsT );
  CPPUNIT_ASSERT( wsDecoded.IsSameAs( _T("This week On Demand – The Light Between Oceans & Inferno") ) );
called code - taken from the wxEmail package - with the Q decoding part remove for clarity since it won't be used:

Code: Select all

wxString wxRfc2047::Decode(const wxString& encoded_str)
{
   /* Initialise decoded string */
   wxString decoded_str = encoded_str;

   /* Remove all white spaces between encoded parts, if any */
   {
      wxRegEx pattern(_T("(=\\077[[:alnum:].\\055_]+\\077[qQbB]\\077[^\\077]*\\077=)[[:blank:]]+(=\\077[[:alnum:].\\055_]+\\077[qQbB]\\077[^\\077]*\\077=)"), wxRE_ADVANCED);
      pattern.ReplaceAll(&decoded_str, _T("\\1\\2"));
   }
   /* Perform Q decoding */
   	; not used
   }
   /* Perform B decoding */
   {
      /* Check if we have pattern */
      wxRegEx pattern(_T("=\\077([[:alnum:].\\055_]+)\\077[bB]\\077([^\\077]*)\\077="), wxRE_ADVANCED);
      while ((pattern.Matches(decoded_str)) &&
             (pattern.GetMatchCount() == 3))
      {
         /* Extract the encoded string */
         wxString str_content = pattern.GetMatch(decoded_str, 2);

         /* Perform a base64 decoding of the string */
         std::vector<unsigned char> buffer;
         std::string std_string = (const char*)str_content.mb_str(wxConvLocal);
         mimetic::Base64::Decoder b64;
         mimetic::decode(std_string.begin(),
                         std_string.end(),
                         b64,
                         std::back_inserter(buffer));
         /* Flush content in a string */
         str_content = _T("");
         for (unsigned char* p = &buffer[0]; p <= &buffer[buffer.size()-1]; p++)
         {
            str_content.Append(*p, 1);
         }

         /* Convert to local charset, if necessary */
         str_content = wxCharsetConverter::ConvertCharset(str_content, pattern.GetMatch(decoded_str, 1));	// << point A

         /* Replace in result */
         wxString wsContent = str_content;	// try to convert to  wxWidgets before call to check result
//         pattern.ReplaceFirst(&decoded_str, str_content);
         pattern.ReplaceFirst(&decoded_str, wsContent);            //   <<<<<<< problem seems to start here
      }
   }
   /* Return the decoded string */
   return decoded_str;
}
For the first input string I get the expected output but for the second string the code goes into an infinite loop and hangs.
With the second string, I get the properly decoded result at point.
After calling pattern.ReplaceFirst(&decoded_str, wsContent);
decoded_str = "This week On Demand – The Light Between Oceans =?UTF-8?B?VGhpcyB3ZWVrIE9uIERlbWFuZCDigJMgVGhlIExpZ2h0IEJldHdlZW4gT2NlYW5zICYgSW5mZXJubw==?= Inferno"

It very much looks like the problem is with the '&' in wsContent, but, not being a wxRegex or regex guru, I am completely lost as to how to fix it for the general case.
Because the original un-decoded string is now still part of the output, the test at the top of the loop succeeds again and around the loop we go ......

Any help will be most appreciated.
TIA
Arnold

Re: Regex problem with UTF-8 string in mime decoding

Posted: Tue Feb 07, 2017 6:13 pm
by doublemax
I didn't test the code, but the fact that it totally ignores the encoding given in the string and uses wxConvLocal internally is quite suspicious. Maybe you can find some other code for RFC2047 decoding "in the wild".

If you intend anything serious with regards to email sending and receiving, i don't think you'll get very far with wxEMail. Unfortunately i can't recommend any good open source library for that purpose. Personally i ended up buying CkMail.
https://www.chilkatsoft.com/email-library.asp

Re: Regex problem with UTF-8 string in mime decoding

Posted: Tue Feb 07, 2017 9:10 pm
by Widgets
doublemax wrote:I didn't test the code, but the fact that it totally ignores the encoding given in the string and uses wxConvLocal internally is quite suspicious. Maybe you can find some other code for RFC2047 decoding "in the wild".
Thank you very much for your reply, doublemax.
Initially I was very disappointed by the move to wxCode, because I really don't think it is related to that part of wxWidgets at all.
I understand your concern about the conversion, which is why I tried different tests (including finding another UTF-8 encoded string), but by now I am convinced that the issue is really in either my not understanding the inputs required by wxRegex:ReplaceFirst() or else a problem inside that function.

I had made up a modified to post to explain, but finally decided to build a small minimalist example to test my hypothesis. I have attached it to this post and it does show the same problem. In it, I have reverted back to MSVC 2010 - the MSVC 2015 effort was a later development in my efforts to perhaps get more information from a more recent compiler system, but, of course, it did not help much.

In this test project I had to add some of the MIME decoding code just to keep it all a close to the 'real' thing as possible.

Just to recapitulate: the output, IMO, from the replacement, should simply not include the full decoded_str, where the initial content of it is the content as input to Decode(). It all ought to have been replaced by the string str_content, as happens with the first test string :-)
doublemax wrote: If you intend anything serious with regards to email sending and receiving, i don't think you'll get very far with wxEMail. Unfortunately i can't recommend any good open source library for that purpose. Personally i ended up buying CkMail.
https://www.chilkatsoft.com/email-library.asp
Yes, we have, I believe, touched on this before, but my app was too far advanced to consider a full rebuild :-)
Right now, I am only interested in checking the contents of my POP3 server, though if I ever move to IMAP, I might actually have to 'reboot' my thinking on this.
In all my search for a solution to this issue, I have found another 'version' of wxEmail from Eran Ifrah of Codelite fame at: https://github.com/eranif/wxEmail.
It uses libCurl to send emails, & I have compiled, if not really used that project as well, with success.
regexTest.zip
(122.68 KiB) Downloaded 189 times

Re: Regex problem with UTF-8 string in mime decoding

Posted: Tue Feb 07, 2017 11:14 pm
by doublemax
So is this the original code from wxEmail or did you already add parts? If yes, which ones.

Re: Regex problem with UTF-8 string in mime decoding

Posted: Wed Feb 08, 2017 3:12 am
by Widgets
doublemax wrote:So is this the original code from wxEmail or did you already add parts? If yes, which ones.
The sample test program consists merely of a somewhat shortened extract from the function wxString wxRfc2047::Decode(const wxString& encoded_str) in rfc2047.cpp - as it appears in wxEMail\src\codec\rfc2047.cpp - called by the main frame with the appropriate test strings.

My code is of course more extensive, but this sample reproduces the error I get both from my app as well as the original wxEMailPop3 client attempting to download existing messages from the same POP3 server.
Essentially, one can say it happens in the original code from wxEMail.

Re: Regex problem with UTF-8 string in mime decoding

Posted: Wed Feb 08, 2017 5:29 pm
by Widgets
In looking through the code some more, I realized the in the function wxRfc2047::Decode(), the code for Q decoding looked very similar to that for B decoding, with one difference.
In the section for Q decoding, there was a bit of extra code which was not present in the B decode section and, what is more to the point, it concerned the replacement of any '&' with an escape sequence prior to the call to ReplaceFirst().

Once I added equivalent code to the B decoding section, the problem was fixed and the input string is now decoded properly.