Converting failure of a text file (from utf8 to unicode). Topic is solved

If you are using the main C++ distribution of wxWidgets, Feel free to ask any question related to wxWidgets development here. This means questions regarding to C++ and wxWidgets, not compile problems.
Post Reply
User avatar
PATRICKMULOT
Earned a small fee
Earned a small fee
Posts: 15
Joined: Tue Apr 25, 2017 6:07 pm

Converting failure of a text file (from utf8 to unicode).

Post by PATRICKMULOT »

Hello all,
I have an error during opening and attempting to convert a text file from utf8 to Unicode.

here is my code :

Code: Select all

command = pmBasesDir + wxFILE_SEP_PATH + selectedBase + wxT( ".ged" );
wxTextFile base;
if ( base.Open( command, wxMBConvUTF8() ) )
{
 ...
}
when I open my file without wxMBConvUTF8(), i can read it without error, but the content (French characters) is not converted.
When I open it with the wxMBConvUTF8, I get this message :
Failed to convert file "/home/patrick/gw-7.00-alpha-linux/bases/givry.ged" to unicode
if someone can help me, thanks per advance.
I currently work with wxWidgets 3.0.3, C++ and wxSqlite3.

Patrick.
ASUS K73SV - Intel Core i7-2630QM CPU @ 2.00GHz × 8 - Ram 4 Gb - HDD 320+500 Gb
Multi-Boot : Ubuntu 16.04 LTS 64 bits - Ubuntu 14.04 LTS 64 bits
VirtualBox : Windows XP SP3 32 bits - Windows 7 premium 64 bits
Patrick MULOT.
Lenovo-V17-G2-ITL - 11th Gen Intel Core i5-1135G7 CPU @ 2.40GHz × 8 - Ram 16 Gb - SSD 512 Go + 2 x 1 To
Ubuntu 22.04 LTS 64 bits - wxWidgets 3.2.2.1 - C++ - wxSqlite3
User avatar
doublemax
Moderator
Moderator
Posts: 19117
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: Converting failure of a text file (from utf8 to unicode).

Post by doublemax »

Most likely the file is not UTF-8 encoded or it contains an illegal byte sequence. Can you post the file?
Use the source, Luke!
User avatar
PATRICKMULOT
Earned a small fee
Earned a small fee
Posts: 15
Joined: Tue Apr 25, 2017 6:07 pm

Re: Converting failure of a text file (from utf8 to unicode).

Post by PATRICKMULOT »

hi doublemax,
here is a part of this file.
thank you for your help.
Attachments
partOfGivry.txt
(14.86 KiB) Downloaded 85 times
Patrick MULOT.
Lenovo-V17-G2-ITL - 11th Gen Intel Core i5-1135G7 CPU @ 2.40GHz × 8 - Ram 16 Gb - SSD 512 Go + 2 x 1 To
Ubuntu 22.04 LTS 64 bits - wxWidgets 3.2.2.1 - C++ - wxSqlite3
User avatar
doublemax
Moderator
Moderator
Posts: 19117
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: Converting failure of a text file (from utf8 to unicode).

Post by doublemax »

If you load this into any UTF-8 capable text editor, you'll see immediately that this is not a proper UTF-8 file. It looks a little bit like a UTF-8 text that was read using a local encoding (leaving the utf-8 byte codes unchanged) and then saved as UTF-8 (encoding the utf-8 codes a second time).
Use the source, Luke!
User avatar
PATRICKMULOT
Earned a small fee
Earned a small fee
Posts: 15
Joined: Tue Apr 25, 2017 6:07 pm

Re: Converting failure of a text file (from utf8 to unicode).

Post by PATRICKMULOT »

Hi doublemax,
How do I know how this file is coded ? Is there a software capable of determining the encoding of a file ?
If it is the Ansi or 8859-1, what should I use instead of wxMBConvUTF8, I am not very stuck in c++ or wxWidgets.
Patrick MULOT.
Lenovo-V17-G2-ITL - 11th Gen Intel Core i5-1135G7 CPU @ 2.40GHz × 8 - Ram 16 Gb - SSD 512 Go + 2 x 1 To
Ubuntu 22.04 LTS 64 bits - wxWidgets 3.2.2.1 - C++ - wxSqlite3
User avatar
doublemax
Moderator
Moderator
Posts: 19117
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: Converting failure of a text file (from utf8 to unicode).

Post by doublemax »

Actually i think the file is broken. Where does it come from? Is there any other application that can load and display it correctly? How did you create that partial file that you uploaded?
Use the source, Luke!
coderrc
Earned some good credits
Earned some good credits
Posts: 141
Joined: Tue Nov 01, 2016 2:46 pm

Re: Converting failure of a text file (from utf8 to unicode).

Post by coderrc »

I had a similar issue with linux being unable to read files from windows.
Adding new locales did the trick for me.
to add iso8859 I did:
edit locale files:
/etc/locale.alias
add
american en_US.ISO-8859-1
just above the bokmal line

/etc/locale.gen
un-comment the lines (vim: 152gg)
en_US ISO-8859-1
en_US.ISO-8859-15 ISO-8859-15

run command
locale-gen

ensure iso8859 is now available
locale -a

It is likely that you will need a different language locale, so check your input file in a hex editor to see how it is encoded, then add that to your linux machine.

then in c++ code, use the standard converter first, then if it throws, use a named locale converter

Code: Select all

struct my_codecvt : std::codecvt<internal_type, external_type, state_type>
{
	~my_codecvt()	{ }
};

Code: Select all

	try {
		try {
			std::wstring_convert<my_codecvt<wchar_t, char, std::mbstate_t>> converter;
			retval = converter.from_bytes(szAscii);		
		}
		catch (...)
		{
			try {
				std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> converter;
				retval = converter.from_bytes(szAscii);
			}
			catch (...) {
				std::wstring_convert<deletable_facet<std::codecvt_byname<wchar_t, char, std::mbstate_t>>> converter(new deletable_facet<std::codecvt_byname<wchar_t, char, std::mbstate_t>>("en_US.iso885915"));
				retval = converter.from_bytes(szAscii);
			}
		}
		
	}
	catch (...) {
		std::wstringstream cls;
		cls << szAscii;
		retval = cls.str();
	}
User avatar
PATRICKMULOT
Earned a small fee
Earned a small fee
Posts: 15
Joined: Tue Apr 25, 2017 6:07 pm

Re: Converting failure of a text file (from utf8 to unicode).

Post by PATRICKMULOT »

hi doublemax,
The text file is generated by the GeneWeb software (French genealogy software). Normally the extension of the generated file is ".ged", and it is readable by all word processors. I sent it with the extension ".txt" because the forum software does not accept the extension ".ged". In addition I sent only part of the file (generated with Gedit) because it makes 1.6Mo.
I have just tested with LibreOffice 5.1.6.2 and some characters are displayed correctly, others do not. For example "Félicité GÉNIN" is translated as "Félicit?? GÉNIN". The two question marks are white on a black background.
Patrick MULOT.
Lenovo-V17-G2-ITL - 11th Gen Intel Core i5-1135G7 CPU @ 2.40GHz × 8 - Ram 16 Gb - SSD 512 Go + 2 x 1 To
Ubuntu 22.04 LTS 64 bits - wxWidgets 3.2.2.1 - C++ - wxSqlite3
User avatar
doublemax
Moderator
Moderator
Posts: 19117
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: Converting failure of a text file (from utf8 to unicode).

Post by doublemax »

In addition I sent only part of the file (generated with Gedit) because it makes 1.6Mo.
How exactly did you extract that part?

Can the original file be downloaded from somewhere as a whole?

Edit: After reading up in the topic, is it possible that the file is ANSEL encoded? If yes, wxWidgets does not have a decoder for this built in.
Use the source, Luke!
User avatar
PATRICKMULOT
Earned a small fee
Earned a small fee
Posts: 15
Joined: Tue Apr 25, 2017 6:07 pm

Re: Converting failure of a text file (from utf8 to unicode).

Post by PATRICKMULOT »

For extraction, I opened the original file with the editor Gedit, selected some of the records and saved this selection in a new text file.
For the original file, I compressed it to file 'Givry.ged.zip'. It's a 'small' file now (195.7 Kb instead of 1.6 Mb). In this zip file you will find a file named 'Givry.ged'. It is readable with any word processor.
Givry.ged.zip
(191.08 KiB) Downloaded 79 times
Patrick MULOT.
Lenovo-V17-G2-ITL - 11th Gen Intel Core i5-1135G7 CPU @ 2.40GHz × 8 - Ram 16 Gb - SSD 512 Go + 2 x 1 To
Ubuntu 22.04 LTS 64 bits - wxWidgets 3.2.2.1 - C++ - wxSqlite3
User avatar
doublemax
Moderator
Moderator
Posts: 19117
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: Converting failure of a text file (from utf8 to unicode).

Post by doublemax »

After some more research i now have a general idea what's happening.

First of all, the new, complete file looks much better when loaded into a generic editor, something must have gone wrong when you saved it only partially. But when loaded as a whole it still contains come invalid UTF-8 sequences.
ged.png
ged.png (43.92 KiB) Viewed 2802 times
See the C3/A0 and C3/A9 in the screenshot. Normally this would be valid UTF-8 sequences, but in the file they're divided by the line break and the "3 CONC" header. Which means you probably have to load the file as "raw" and decode the individual parts after you have parsed the file and extracted them.

If you don't care about 100% correct decoding and only want to load the file "somehow", you can tell the UTF-8 decoder to be less strict:

Code: Select all

if ( base.Open( command, wxMBConvUTF8( wxMBConvUTF8::MAP_INVALID_UTF8_TO_PUA ) ) )
You can also try wxMBConvUTF8::MAP_INVALID_UTF8_TO_OCTAL for an alternative behavior.
User avatar
PATRICKMULOT
Earned a small fee
Earned a small fee
Posts: 15
Joined: Tue Apr 25, 2017 6:07 pm

Re: Converting failure of a text file (from utf8 to unicode).

Post by PATRICKMULOT »

hi doublemax,
I tried the first solution, and my problem is solved. In fact, for now, I do not need 'NOTE', 'CONT' and 'CONC' records.
And now all the names of person and their first names are well converted.
I will notify the software maintainers of this problem so that it is resolved quickly.
Thank you so much for your help. I put this request in 'SOLVED'. Thank you.
Patrick MULOT.
Lenovo-V17-G2-ITL - 11th Gen Intel Core i5-1135G7 CPU @ 2.40GHz × 8 - Ram 16 Gb - SSD 512 Go + 2 x 1 To
Ubuntu 22.04 LTS 64 bits - wxWidgets 3.2.2.1 - C++ - wxSqlite3
Post Reply