Regex problem with UTF-8 string in mime decoding
Posted: Tue Feb 07, 2017 3:31 am
I am using the wxCode version of the wxEmail package wxEMail-1.0-Beta-2 in a project of mine, compiling under MSVC 2015 community edition using the v140_xp tool set and wxWidgets 3.1.0 staticly linked libraries. <<<<<< edit: v100 tool set - same as compiling under MSVC 2010
The code which seems to give me issues looks like:
calling test routine:
called code - taken from the wxEmail package - with the Q decoding part remove for clarity since it won't be used:
For the first input string I get the expected output but for the second string the code goes into an infinite loop and hangs.
With the second string, I get the properly decoded result at point.
After calling pattern.ReplaceFirst(&decoded_str, wsContent);
decoded_str = "This week On Demand – The Light Between Oceans =?UTF-8?B?VGhpcyB3ZWVrIE9uIERlbWFuZCDigJMgVGhlIExpZ2h0IEJldHdlZW4gT2NlYW5zICYgSW5mZXJubw==?= Inferno"
It very much looks like the problem is with the '&' in wsContent, but, not being a wxRegex or regex guru, I am completely lost as to how to fix it for the general case.
Because the original un-decoded string is now still part of the output, the test at the top of the loop succeeds again and around the loop we go ......
Any help will be most appreciated.
TIA
Arnold
The code which seems to give me issues looks like:
calling test routine:
Code: Select all
// UTF-8
wxString wsT2 = _T("Dom.eu - Raport z eksportu ogłoszeń nieruchomości");
wsT = _T("=?utf-8?B?RG9tLmV1IC0gcmFwb3J0IHogZWtzcG9ydHUgb2fFgm9zemXFhCBuaWVydWNob21vxZtjaQ==?=");
wsDecoded = wxRfc2047::Decode( wsT );
CPPUNIT_ASSERT( wsDecoded.IsSameAs( _T("Dom.eu - raport z eksportu ogłoszeń nieruchomości") ) );
wsT = _T("=?UTF-8?B?VGhpcyB3ZWVrIE9uIERlbWFuZCDigJMgVGhlIExpZ2h0IEJldHdlZW4gT2NlYW5zICYgSW5mZXJubw==?=");
wsDecoded = wxRfc2047::Decode( wsT );
CPPUNIT_ASSERT( wsDecoded.IsSameAs( _T("This week On Demand – The Light Between Oceans & Inferno") ) );
Code: Select all
wxString wxRfc2047::Decode(const wxString& encoded_str)
{
/* Initialise decoded string */
wxString decoded_str = encoded_str;
/* Remove all white spaces between encoded parts, if any */
{
wxRegEx pattern(_T("(=\\077[[:alnum:].\\055_]+\\077[qQbB]\\077[^\\077]*\\077=)[[:blank:]]+(=\\077[[:alnum:].\\055_]+\\077[qQbB]\\077[^\\077]*\\077=)"), wxRE_ADVANCED);
pattern.ReplaceAll(&decoded_str, _T("\\1\\2"));
}
/* Perform Q decoding */
; not used
}
/* Perform B decoding */
{
/* Check if we have pattern */
wxRegEx pattern(_T("=\\077([[:alnum:].\\055_]+)\\077[bB]\\077([^\\077]*)\\077="), wxRE_ADVANCED);
while ((pattern.Matches(decoded_str)) &&
(pattern.GetMatchCount() == 3))
{
/* Extract the encoded string */
wxString str_content = pattern.GetMatch(decoded_str, 2);
/* Perform a base64 decoding of the string */
std::vector<unsigned char> buffer;
std::string std_string = (const char*)str_content.mb_str(wxConvLocal);
mimetic::Base64::Decoder b64;
mimetic::decode(std_string.begin(),
std_string.end(),
b64,
std::back_inserter(buffer));
/* Flush content in a string */
str_content = _T("");
for (unsigned char* p = &buffer[0]; p <= &buffer[buffer.size()-1]; p++)
{
str_content.Append(*p, 1);
}
/* Convert to local charset, if necessary */
str_content = wxCharsetConverter::ConvertCharset(str_content, pattern.GetMatch(decoded_str, 1)); // << point A
/* Replace in result */
wxString wsContent = str_content; // try to convert to wxWidgets before call to check result
// pattern.ReplaceFirst(&decoded_str, str_content);
pattern.ReplaceFirst(&decoded_str, wsContent); // <<<<<<< problem seems to start here
}
}
/* Return the decoded string */
return decoded_str;
}
With the second string, I get the properly decoded result at point.
After calling pattern.ReplaceFirst(&decoded_str, wsContent);
decoded_str = "This week On Demand – The Light Between Oceans =?UTF-8?B?VGhpcyB3ZWVrIE9uIERlbWFuZCDigJMgVGhlIExpZ2h0IEJldHdlZW4gT2NlYW5zICYgSW5mZXJubw==?= Inferno"
It very much looks like the problem is with the '&' in wsContent, but, not being a wxRegex or regex guru, I am completely lost as to how to fix it for the general case.
Because the original un-decoded string is now still part of the output, the test at the top of the loop succeeds again and around the loop we go ......
Any help will be most appreciated.
TIA
Arnold