Problem (sometimes) reading utf-8 file

If you are using the main C++ distribution of wxWidgets, Feel free to ask any question related to wxWidgets development here. This means questions regarding to C++ and wxWidgets, not compile problems.
squinn
Knows some wx things
Knows some wx things
Posts: 28
Joined: Tue May 16, 2023 8:46 pm

Problem (sometimes) reading utf-8 file

Post by squinn »

Hi,
In my code, I have a place where I read in an arbitrary, user-specified file. This works fine on ascii files, but quietly does nothing on utf-8 files that have "funky", non-ascii, characters (e.g., upside down question mark rather than apostrophe). If I strip out all of the funky characters, it reads in fine. So, I wrote a very simple program (posted below) to analyze this problem specifically, and... everything works. It reads in the "funky" file with no problems. I included all the same headers in the simple program as are in my real program, but otherwise made no wxwidgets calls. The actual calls for opening/reading the files are identical to my real program.

So, I'm confused. I'm thinking perhaps somewhere along the way in my real program, wxwidgets is doing something with the locale setting, but that's just a total guess. Any help much appreciated.

I've attached the simple program (that works, funky chars or no).
crap.cpp
(698 Bytes) Downloaded 39 times
ONEEYEMAN
Part Of The Furniture
Part Of The Furniture
Posts: 7449
Joined: Sat Apr 16, 2005 7:22 am
Location: USA, Ukraine

Re: Problem (sometimes) reading utf-8 file

Post by ONEEYEMAN »

Hi,
Can you reproduce the problem in the minimal sample?

If you can - can you post a diff to the minimal sample that reproduce the problem?

Thank you.
squinn
Knows some wx things
Knows some wx things
Posts: 28
Joined: Tue May 16, 2023 8:46 pm

Re: Problem (sometimes) reading utf-8 file

Post by squinn »

Thank you for the quick response (on a weekend!).

No, I was unable to reproduce the "failed" (actually, it quietly does nothing) load of a utf-8 file in the minimal sample.

I should have included the actual code from my real app, but it is completely without context, so I didn't know if it would confuse things, but I'm including it below. This is the callback on a button to "extract" stuff from a file. Again, it works fine with pure ascii files, but quietly doesn't load a utf-8 file if it has non-ascii characters (no poroblem if no non-ascii characters). Yet, the simple program I uploaded has no problem with the non-ascii file...

--

void MyAddRemoveList::OnExtract(wxCommandEvent& event) {

wxArrayInt id_to_extract;
ostringstream ss;

// Assumes the File list is single select.
if (list->GetSelections(id_to_extract)) {
if (list_t == list_types::files) {
ifstream name(list->GetString(id_to_extract[0]));

if (!name) {
wxMessageBox("No file specified, or file empty.", "Error",
wxOK | wxICON_INFORMATION);
}
else {
ss << name.rdbuf();
MyAcceptIgnoreDlg(ss.str());
}
}
}
}
ONEEYEMAN
Part Of The Furniture
Part Of The Furniture
Posts: 7449
Joined: Sat Apr 16, 2005 7:22 am
Location: USA, Ukraine

Re: Problem (sometimes) reading utf-8 file

Post by ONEEYEMAN »

Hi,
Did you try to debug it?
Where the failure occurred?

Thank you.
squinn
Knows some wx things
Knows some wx things
Posts: 28
Joined: Tue May 16, 2023 8:46 pm

Re: Problem (sometimes) reading utf-8 file

Post by squinn »

So, there was no actual failure. The following line just returned with an empty result. It basically just skipped reading the file, with an empty result, with no failure. Yet, if I edit out the non-ascii characters, it slurps it in just fine. And, again, my simple program that I included has no problem reading in the file with the non-ascii chars. It's just when I attempt the same type of read of a file within my real wxwidgets app, it quietly does nothing.

ss << name.rdbuf();
ONEEYEMAN
Part Of The Furniture
Part Of The Furniture
Posts: 7449
Joined: Sat Apr 16, 2005 7:22 am
Location: USA, Ukraine

Re: Problem (sometimes) reading utf-8 file

Post by ONEEYEMAN »

Hi,
Can you put the code that doesn't work inside the code that works, build and run it - what happens?

Also, what is you platform and wxWidgets version?

Thank you.
squinn
Knows some wx things
Knows some wx things
Posts: 28
Joined: Tue May 16, 2023 8:46 pm

Re: Problem (sometimes) reading utf-8 file

Post by squinn »

I just tried that (putting the code from my real program into the simple program), and it still works - i.e. it still reads the file with the non-ascii characters with no problem.

I'm running wxWidgets 3.0, POP Os 22.04. I should have mentioned that I'm pretty new to both C++ and wxWidgets,
Kvaz1r
Super wx Problem Solver
Super wx Problem Solver
Posts: 357
Joined: Tue Jun 07, 2016 1:07 pm

Re: Problem (sometimes) reading utf-8 file

Post by Kvaz1r »

In that case problem somewhere else - try to create small code to reproduce the behaviour.
User avatar
doublemax
Moderator
Moderator
Posts: 19103
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: Problem (sometimes) reading utf-8 file

Post by doublemax »

If you get "funky" characters or "?", it's an encoding problem. And if you're sure that the input data is UTF-8, it means that no UTF-8 decoding took place. However this usually leads to some results that are easy to recognize once you've seen them a couple of times.

Code: Select all

Morgenröte -> Morgenröte
The German umlaut "ö" was replaced by the UTF-8 encoding which needs two bytes.

Is that what you're getting?
Use the source, Luke!
squinn
Knows some wx things
Knows some wx things
Posts: 28
Joined: Tue May 16, 2023 8:46 pm

Re: Problem (sometimes) reading utf-8 file

Post by squinn »

Thanks for the response.

Actually I'm working with two different examples - one file has the upside down question mark in place of an apostrophe, and one file has several actual utf-8 encoded fractions (i.e. not just number-slash-number like 1/2).

Both of these files can be read in successfully using my paired down example that I posted here, but neither can be read from basically the same line of code in my real program. The read doesn't fail, it just quietly moves on leaving an empty "ss.str()" (from "ss << name.rdbuf();").

I output the "uncorrected" files from my sample program, and the upside down question mark and the real fractions are there, as expected.

If I edit out or change these characters, they then are read in with no problem in my real program.

The only difference between the two programs is my real program is using wxWidgets and making lots of wx-related calls, whereas my example posted here is not. Well, I am also making some postgres calls (through libpqxx). Needless to say, I'm pretty lost here...
User avatar
doublemax
Moderator
Moderator
Posts: 19103
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: Problem (sometimes) reading utf-8 file

Post by doublemax »

You need to find the exact line where it fails. If you can't use a debugger to single-step through the code, put log outputs at crucial parts of the code.
Use the source, Luke!
squinn
Knows some wx things
Knows some wx things
Posts: 28
Joined: Tue May 16, 2023 8:46 pm

Re: Problem (sometimes) reading utf-8 file

Post by squinn »

Not sure if anyone is still looking at this...

I have written a very simple wxWidgets app that exhibits the same issue as my main program. I'm uploading this along with the simple code I originally uploaded for this question (thankfully renamed). I'm also uploading makefiles for the two programs, and data files.

"badprog" will first read in the "gooddata" file and output the results, followed by an attempt to read in the "baddata" file, but will quietly do nothing.

"goodprog" successfully reads in both of the files. The difference between the two is that while they're both compiled and linked with wxWidgets, badprog is an actual wxWidgets app, whereas good prog makes no actual wxWidgets calls.

If anyone is so inclined, these are both tiny, should compile w/o problems, and the data file names are embedded, so all one need do is compile and run them. The two data files are identical, except "baddata" has an upside down question mark in place of an apostrophe, whereas "gooddata" has the apostrophe. (Note I had to add .txt to the makefiles and data files in order to upload them).

I'm sure this will end up not being the case, but at this point, this seems like a wxWidgets issue to me (probably me doing something wrong with wxWidgets?).

Here are the files I will upload:
badprog.cpp
badmake.txt
baddata.txt

(It seems I can only upload 3 files at a time, so I'll upload these in a followup post).
goodprog.cpp
goodmake.txt
gooddata.txt
badprog.cpp
(2.25 KiB) Downloaded 35 times
badmake.txt
(2.45 KiB) Downloaded 25 times
baddata.txt
(264 Bytes) Downloaded 30 times
squinn
Knows some wx things
Knows some wx things
Posts: 28
Joined: Tue May 16, 2023 8:46 pm

Re: Problem (sometimes) reading utf-8 file

Post by squinn »

Here are the three "good" files.
goodprog.cpp
(767 Bytes) Downloaded 35 times
goodmake.txt
(2.46 KiB) Downloaded 34 times
gooddata.txt
(263 Bytes) Downloaded 38 times
User avatar
doublemax
Moderator
Moderator
Posts: 19103
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: Problem (sometimes) reading utf-8 file

Post by doublemax »

For me the "badcode" did not silently fail, it displayed the content of both text files, just not UTF8 decoded.

Code: Select all

  wxString s(buff2.str().c_str(), wxConvUTF8);
  wxMessageBox(s, "utf8-decoded");
Adding this also displays the "baddata" content correctly.
Use the source, Luke!
squinn
Knows some wx things
Knows some wx things
Posts: 28
Joined: Tue May 16, 2023 8:46 pm

Re: Problem (sometimes) reading utf-8 file

Post by squinn »

OK, thank you! Do you think my problem could be a versioning issue? I'm using wxWidgets 3.0.

I will use your suggestion. Thanks again.
Post Reply