UTF8 conversion issue

Widgets · Post by **Widgets** » Mon Jul 19, 2021 10:56 pm

Running under Win 10, & using MAVC 2019 Community edition, I have created an XML file using Exiftool, using the command line
exiftool -lang de -listx >t:\test-de.xml

The file contains some UTF8 strings

Code: Select all

0x42 0x69 0x6C 0x64 0x67 0x72 0xC3 0xB6 0xC3 0x9F 0x65
  B    i    l    d    g    r     ö         ß        e

as displayed in a hex editor
As the attachment shows, when I receive the same string within my wxWidgets app, it contains exactly the same data.
However, if I try to convert from UTF8, using wsTest = wxString::FromUTF8( wsUTF8), wsTest always ends up empty, implying the conversion failed.

I have tried tracing into the wxWidgets code, but soon got lost, because I am just not familiar enough with the topic.
From what I understand of UTF8 strings, the hex display from MSVC 2019 (2nd attachment) shows the data to be UTF-8 and even the UTF-8 table at https://www.utf8-chartable.de/ seems to agree, as far as what the 2 UTF8 characters represent.
The first attachment shows how the original UTF8 string is displayed without conversion.

Question: where is my understanding of either UTF-8 or the handling of same by wxWidgets off the mark

Post by **doublemax** » Tue Jul 20, 2021 5:15 am

Code: Select all

    char utf8data[] = "\x42\x69\x6C\x64\x67\x72\xC3\xB6\xC3\x9F\x65";
    wxString s = wxString::FromUTF8(utf8data);
    wxLogMessage("decoded utf8: %s", s);

This code works for me.

Is the UTF8 data 0-terminated? If not, you have to pass the length of the data to the FromUTF8() call.

If that's not it, please show more of your parsing/loading code.

Widgets · Post by **Widgets** » Tue Jul 20, 2021 5:20 pm

It seems I am really getting into a very dark & foggy area with this. Quite possibly the root causes could also be at the app I use to get the data.
Once I clarify at which end which part of the problem lies, I will pursue things there. Right now I need to understand the wxWidgets and MSVC IDE end.

There evidently is an important difference in how the string is constructed.

As far as I can tell, the UTF-8 strings are terminated properly as there is no problem displaying them. The terminating '\00' byte does not show when I expand them in the MSVC IDE, but neither does it show for the converted string in the example.
I have tried to pass in the correct string length to the conversion, but have not gotten a better result, I'm afraid

Post by **doublemax** » Tue Jul 20, 2021 6:32 pm

Widgets wrote: ↑Tue Jul 20, 2021 5:20 pm I have tried to pass in the correct string length to the conversion, but have not gotten a better result, I'm afraid

In which container do you store the UTF8 data and how do you pass it to the conversion?
(aka: show some code)

Widgets · Post by **Widgets** » Tue Jul 20, 2021 7:50 pm

I had started to show some code with explanations, but thought it was getting too long of an answer

Basically Exiftool (ET) extracts metadata from JPGs and I end up getting a line of data from ET such as

Code: Select all

wsLine	 =  "BildgrÃ¶Ãe                       : 2028x1799\r" 
Item label -----^          item value: -----------^

In my case, I am using the tool on a Win PC, set for a US English environment, but I am requesting ET to give me the data for a 'de' language environment.
The specs for ET say, as far as I understand them, that the string encoding will be UTF-8. IOW, the label comes from ET, only the value originates from the image file.
ET (an external command line app) is run via an adaptation of the exec sample wxPipedFrame code and arrives in this app via the stdin stream (as far as the app is concerned) from the external app. This stream is parsed byte/by byte into an wxArrayString, one of the lines will be wsLine as shown above.
The code to parse the line is:

Code: Select all

    wxString wsDisplayName;
    wxString wsTagNameUtf8;
    wxString wsTagName = wsLine.BeforeFirst( ':' );
    wsTagName = wsTagName.Trim();   // trim end of string
    wxString wsValue = wsLine.AfterFirst( ':' );
    wsValue = wsValue.Trim( false ); // trim front of string
    wsTagNameUtf8 = wxString::FromUTF8( wsTagName );
    // special code to allow me to concentrate on this specific line =================
    wsT = wsTagName;
    wsT.MakeLower();
    if ( wsT.Contains( "ildgr" ) )
    {
      int iLen = wsTagName.Length();
      wxString wsTag0 = wsTagName + "\0\0";
      int iLen0 = wsTag0.Length();
      wxString wsTagNameUtf8_0 = wxString::FromUTF8( wsTag0 );
      wxString wsTagNameUtf8_1 = wxString::FromUTF8( wsTagName, iLen );
      wxString wsUtf8 = wsTagName.utf8_str();
      char utf8data[] = "\x42\x69\x6C\x64\x67\x72\xC3\xB6\xC3\x9F\x65";
      wxString s8 = wxString::FromUTF8(utf8data);
      wxString s1 = utf8data;
      wxString s18 = wxString::FromUTF8(s1);
      wxLogMessage("decoded utf8: |%s| - |%s| - |%s| - |%s| - |%s|EOL", 
        s8, s1, s18, wsTagNameUtf8_0, wsTagNameUtf8_1  );
      wsTagNameUtf8 = wxString::FromUTF8( wsTagName );
    }

The output in the app log window - a plain multi-line wxTextCtrl - is:

Code: Select all

12:07:45: decoded utf8: |Bildgröße| - |BildgrÃ¶ÃŸe| - |Bildgröße| - || - ||EOL

In the app, the string is intended to be displayed in a property sheet, where it also shows up just as mangled or missing if I attempt a conversion .

FWIW, before I tried to use the wxPipedFrame version, the same issue arises if I run Exiftool via wxExec and capture the output in string arrays

There is at least one other option to look at the in memory representation in the debugger, which is what I will explore next

Widgets · Post by **Widgets** » Tue Jul 20, 2021 8:16 pm

As a precaution, I reran the wxExec based version and it does NOT show the problem.
Time to dig into the differences.

Some of the default command line arguments also changed in the transition

Post by **doublemax** » Tue Jul 20, 2021 9:06 pm

Your UTF8 handling is wrong on so many levels, i don't even know where to start...

Code: Select all

wxString wsTagName = wsLine.BeforeFirst( ':' );

The most important part is missing: How does the data get into wsLine?

This stream is parsed byte/by byte into an wxArrayString, one of the lines will be wsLine as shown above.

This sounds very wrong. It's important that you handle the UTF8 data byte by byte, *not* with wxString! Collect the characters from the stream in a unsigned char buffer, or a std::string. Once you received a whole line, use wxString::FromUTF8() to convert it into a wxString. From that point on, you must never use any UTF8 related calls on that wxString again, because it's not UTF8 any more. e.g. something like this is always wrong:

Code: Select all

wsTagNameUtf8 = wxString::FromUTF8( wsTagName );

Widgets · Post by **Widgets** » Wed Jul 21, 2021 8:56 pm

First off, my apologies fro being slow to respond - I have had a number of issues with just getting logged in to reply.

doublemax wrote: ↑Tue Jul 20, 2021 9:06 pm Your UTF8 handling is wrong on so many levels, ...

I cannot agree more.

Until now all of those dirty details were handled by the man behind the curtain. And from where I sit even now, there seem to be a number of these curtains and who knows how many men ... especially under Windows, with a command line tool ( with a request to report its output in a language different from the current environment) and the infamous DOS box handling of anything non-English, non-ASCII.

After reading your post, I have reworked my code to retrieve the input stream data along the lines you suggested - though the original code was not far off - and I now seem to get what I was expecting. Still I am ready for more surprises as I learn more about the related issues.

For the record and your analysis, the code to get my array of strings from the input stream is now:

Code: Select all

void MyFrame::DoGetStdIn( wxArrayString &ar_wasIn )
{
  wxString wsPart;
  char buffer[4096];
  while ( m_in.IsOk() && m_in.CanRead() )
  {
    buffer[m_in.Read(buffer, WXSIZEOF(buffer) - 1).LastRead()] = '\0';
    // we have a buffer full of data
    int i = 0;
    int in = m_in.LastRead();
    for( i = 0; buffer[i] != '\0'; i++ )
    {
      // assume EOL is represented by '\r' followed by '\n
      if ( (buffer[i] == '\r' ) && (buffer[i+1] == '\n' ) )
      {
        i++;
        wsPart += buffer[i];
        // do NOT add the {ready.. response to the data array
        if( wsPart.StartsWith( _T("{ready" ) ) )
        {
          return;
        }
        ar_wasIn.Add( wsPart );
        wsPart  = "";
        continue;
      }
      wsPart += buffer[i];
    }
  }
}

Against your recommendation of handling UTF-8 at the lowest level, I have not done so here.
The main reason for this is that I am not yet sure that the utility I am using - Exiftool - will report the data as UTF-8 with the foreign language option.
The data displayed with my input code does seem to support that assumption, but I feel I will still have to confirm that with the author of the utility - and more work using other available options.
Other aspects in all of this are:
- the fact that the utility has other options to set the character encoding,
- issues of the DOS box handling of encoded string and even the
- string interpretation within the MVCS IDE itself during debugging
- my obvious inexperience with character sets and encoded strings

Post by **doublemax** » Wed Jul 21, 2021 9:34 pm

I have had a number of issues with just getting logged in to reply

Unfortunately everyone has that problem.

I have reworked my code to retrieve the input stream data along the lines you suggested - though the original code was not far off - and I now seem to get what I was expecting.

That's a little surprising, because it's still fundamentally wrong.

Code: Select all

wsPart += buffer[i];

You're just putting every single byte as a character into a wxString. Without decoding. A wxString should only contain Unicode characters, from that point on you can see it as a "black box" and don't care about character encoding any more, until you have to save it to a file.

I really don't understand why you don't just collect bytes until you receive a cr/lf/zero and then decode the line from UTF8 into a "proper" wxString.

Widgets · Post by **Widgets** » Thu Jul 22, 2021 1:18 am

Experience with running Exitool and Exiv2 as external command line utilities to extract/display or edit metadata from mainly JPGs as persuaded me that what I get coming in from these is not always a UTF-8 string. This is especially true for images which contain metadata which include - for my purposes - mainly names, locations, descriptions etc in some of the more common European languages. Of particular interest is Germany, though I must also at least consider some of the surrounding countries as well.
Some of these test were naturally done in the plain DOS shell, a modified DOS shell, supposedly able to handle UTF-8 as well as a number of Powershell variants. None, IIRC, worked even close to well, for the images I tested with.

In images where I have added or edited the data, that is not so much of a problem since I have control over the encoding, though even there I am still learning

Interfacing with Exiftool presents a specific problem in that, in order to avoid delay from loading the executable for every call, it has a feature which allows it to remain resident, but that makes it necessary to use code derived from the piped exec sample app, rather than simply a call to wxExec, which handles the gathering of the data returned from the utility by itself. (Though, your comments here may explain some of the issues I had when I was using wxExec in early development.)

When I looked over the code from the piped frame example, I was concerned that the plain stack resident character buffer of 4K bytes would not be sufficient for some of the lengthy output I was expecting and hence I felt I had to manage that part differently from the example which pumps the data directly (and without any conversion) to a text control. Even playing around with code derived from that sample, and using Exiftool as the executable invoked - with a variety of options - for testing - showed some issues, though I cannot be sure all the details at this time, without revisiting those tests for confirmation.
Converting everything incoming into UTF-8 at the lowest level had me concerned, because, unless I can make the assumption that the application will delivers all data in UTF-8 encoding - or that I can force the application to ensure that assumption is correct, inevitably, will lead to my ending up with data converted to UTF-8, which originally was some different encoding, most likely some ISO8859 variant.
For a worst case scenario, I will likely have to resort to a hex editor to confirm any suspicions.

At this stage, I have to accept you comment, that "A wxString should only contain Unicode characters".
What had me confused was the fact that there are a number of functions in the wxString class which say" wxString can be converted to:" and "Can be created from"
Adapting to this new outlook will take some work, but it still leaves my with the question how do I handle potential input not in UTF-8 encoding after the conversion at the low level routine. Do I need to verify the incoming text is UTF-8 (and how) and raise and exception if it isn't? ...?
In any case, I very much appreciate your time and explanation.

Post by **doublemax** » Thu Jul 22, 2021 7:57 am

At this stage, I have to accept you comment, that "A wxString should only contain Unicode characters".
What had me confused was the fact that there are a number of functions in the wxString class which say" wxString can be converted to:" and "Can be created from"

Converting does not mean that its internal encoding changes, you can only create a wxString from encoded data, or create a new buffer that contains the encoded data.

When you look at the de-/encoding methods, you'll notice that they never have wxString as both input and output. So it's not an in-place conversion, it's more like an import and export.

Code: Select all

static wxString wxString::FromUTF8( const char *s )
const wxScopedCharBuffer wxString::ToUTF8() const

Adapting to this new outlook will take some work, but it still leaves my with the question how do I handle potential input not in UTF-8 encoding after the conversion at the low level routine. Do I need to verify the incoming text is UTF-8 (and how) and raise and exception if it isn't? ...?

As UTF8 decoding fails if it detects an invalid byte sequence, it's safe to try to decode any incoming data as UTF8 first. If the result is empty (and you didn't make any mistake), you can try another decoding. If it's not empty, it's almost guaranteed that the data was UTF8 encoded.

Unfortunately from that point on, it's impossible to detect the encoding of incoming 8 byte data. Usually you only have two choices here: Assume the local encoding of the machine the user runs on (wxConvLocal), or another common encoding like wxConvISO8859_1 (Latin1, used in Western Europe).

However, i looked at the Exiftool docs, and it clearly states that it uses UTF8 as default output encoding. But it is of course possible that there are JPGs out there, where the information is just not properly encoded (= wrong in the file itself).
https://exiftool.org/faq.html#Q10

Widgets · Post by **Widgets** » Thu Jul 22, 2021 3:41 pm

doublemax wrote: ↑Thu Jul 22, 2021 7:57 am
When you look at the de-/encoding methods, you'll notice that they never have wxString as both input and output. So it's not an in-place conversion, it's more like an import and export.

That really makes sense, though because 'my way' seemed to work in so many cases, I never had to get to the bottom of all this until now and thus I had never really reviewed all of those conversion details.
Your posts really clarified a whole lot of nuances I had never considered before.

As UTF8 decoding fails if it detects an invalid byte sequence, it's safe to try to decode any incoming data as UTF8 first. If the result is empty (and you didn't make any mistake), you can try another decoding. If it's not empty, it's almost guaranteed that the data was UTF8 encoded.

Unfortunately from that point on, it's impossible to detect the encoding of incoming 8 byte data. Usually you only have two choices here: Assume the local encoding of the machine the user runs on (wxConvLocal), or another common encoding like wxConvISO8859_1 (Latin1, used in Western Europe).

That is the way I will have to go from here on. Specific implementation details are still a bit hazy, but it certainly sounds logical & workable

However, i looked at the Exiftool docs, and it clearly states that it uses UTF8 as default output encoding. But it is of course possible that there are JPGs out there, where the information is just not properly encoded (= wrong in the file itself).
https://exiftool.org/faq.html#Q10

In this particular instance, I could try the language encoding Exiftool officially supports, assuming that any user with a non-English locale, would have things set up properly and the language translation strings within Exiftool are correct. Beyond that I would have to direct the user to inspect the iamge data directly, something Exiftool can also assist with.

wxWidgets Discussion Forum

UTF8 conversion issue Topic is solved

UTF8 conversion issue

Re: UTF8 conversion issue

Re: UTF8 conversion issue

Re: UTF8 conversion issue

Re: UTF8 conversion issue

Re: UTF8 conversion issue

Re: UTF8 conversion issue

Re: UTF8 conversion issue

Re: UTF8 conversion issue

Re: UTF8 conversion issue

Re: UTF8 conversion issue

Re: UTF8 conversion issue