Chinese text conversion

If you are using the main C++ distribution of wxWidgets, Feel free to ask any question related to wxWidgets development here. This means questions regarding to C++ and wxWidgets, not compile problems.
Post Reply
vanarieleyen
Knows some wx things
Knows some wx things
Posts: 47
Joined: Thu Aug 29, 2019 3:55 am
Location: China, Shenzhen

Chinese text conversion

Post by vanarieleyen » Wed Jan 08, 2020 7:33 am

Hello,

I have a file named
ÀðÓãÔ¾ÁúÃÅ3.bmp
in a Windows 8 OS
and a data file that references this file as
鲤鱼跃龙门3.bmp
My program reads this data file and wants to open the image file and fails because it can't find the file.

Apparently I need to convert the filename in the data file to match the name of the file on the hd.

How do I do this?

User avatar
doublemax
Moderator
Moderator
Posts: 14786
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: Chinese text conversion

Post by doublemax » Wed Jan 08, 2020 3:06 pm

ÀðÓãÔ¾ÁúÃÅ3.bmp
This looks like UTF8 encoding

Code: Select all

const char test[] = "ÀðÓãÔ¾ÁúÃÅ3.bmp";
wxString s = s.FromUTF8(test);
wxLogMessage(s);
Try this and check if creates the same result.
Use the source, Luke!

vanarieleyen
Knows some wx things
Knows some wx things
Posts: 47
Joined: Thu Aug 29, 2019 3:55 am
Location: China, Shenzhen

Re: Chinese text conversion

Post by vanarieleyen » Thu Jan 09, 2020 1:03 am

The suggestion that you gave didn't work, I receive an empty string, so it probably isn't utf8

However, after more searching I found a very handy website on which I could try out several conversions:
http://string-functions.com/encodedecode.aspx

After trying out some encodings I got the correct result using:
input: iso-8859-1
output: gb2312

To get the correct output I use the following code:

Code: Select all

wxString str("ÀðÓãÔ¾ÁúÃÅ3.bmp", wxCSConv(wxT("gb2312")));
However, I need it the other way around and that doesn't seem to work.

In the data file I have the file names in Chinese. This name needs to be encoded to iso-8859-1, to do this I use the following code;

Code: Select all

					
wxString iso(layer.props.at("FileName").utf8_str(), wxCSConv(wxT("iso-8859-1")) );
wxLogDebug(iso);
However the result of this conversion is (which seems to be utf8):
鲤鱼跃龙门3.bmp
and it should be:
ÀðÓãÔ¾ÁúÃÅ3.bmp
When I do this conversion on the website that I mentioned it works fine, how should I do this in wxWidgets?

vanarieleyen
Knows some wx things
Knows some wx things
Posts: 47
Joined: Thu Aug 29, 2019 3:55 am
Location: China, Shenzhen

Re: Chinese text conversion

Post by vanarieleyen » Thu Jan 09, 2020 8:43 am

I spend a lot of time investigating this problem and have come to the following conclusion:

The string that I want to convert to iso-8859-1 is encoded in gb2312.
gb2312 takes 2 bytes for each character.

To convert from gb2312 to iso8859-1 I need to do something like:

Code: Select all

wxString output( input.mb_str(wxConvGB2312), wxCSConv("iso-8859-1");
However there is no wxConvGB2312 implemented.

Is there another way to accomplish this?

User avatar
doublemax
Moderator
Moderator
Posts: 14786
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: Chinese text conversion

Post by doublemax » Thu Jan 09, 2020 8:51 am

Once you have a decoded GB2312 (or any other encoding) data into a wxString, it doesn't have a specific encoding any more. It's just Unicode.

Try this:

Code: Select all

wxString input( wxT("鲤鱼跃龙门3.bmp") );
wxCharBuffer buf = input.mb_str(wxConvGB2312);
You should not store the result of the encoding in a wxString.
Use the source, Luke!

vanarieleyen
Knows some wx things
Knows some wx things
Posts: 47
Joined: Thu Aug 29, 2019 3:55 am
Location: China, Shenzhen

Re: Chinese text conversion

Post by vanarieleyen » Thu Jan 09, 2020 9:12 am

Hello doublemax,

The problem is that wxConvGB2312 is not defined in strconv.h, I can only use wxConvUTF8

The issue is that I have the filename in a data file, this filename contains Chinese characters. This filename refers to a file that (on my pc) is shown in iso-8859-1. When I try to open this file using the filename from the data file I get a file not found error.

So what I am trying to accomplish is first a test to see if the file is found, when not I try to convert the name in the data file from gb2312 to iso-8859-1 and use the resulting name to open the file.

some background:
In my working environment are Chinese and English pc's running. When I receive something from my Chinese colleagues I always have this problem. On their pc however, the filenames are shown correctly, it is when they are copied to my (English) windows system that the filenames are automatically renamed to the iso-8859-1 version.

User avatar
doublemax
Moderator
Moderator
Posts: 14786
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: Chinese text conversion

Post by doublemax » Thu Jan 09, 2020 9:19 am

The problem is that wxConvGB2312 is not defined in strconv.h
I just assumed that it existed. If not, use wxCSConv(wxT("gb2312")) like you already did.

It's a little bit messy to have the encoded data in a wxString, but this should work:

Code: Select all

    wxString input( wxT("鲤鱼跃龙门3.bmp") );
    wxCharBuffer buf = input.mb_str(wxCSConv(wxT("gb2312")));

    wxString filename = wxString::From8BitData( buf );
    wxLogMessage(filename);
Use the source, Luke!

vanarieleyen
Knows some wx things
Knows some wx things
Posts: 47
Joined: Thu Aug 29, 2019 3:55 am
Location: China, Shenzhen

Re: Chinese text conversion

Post by vanarieleyen » Fri Jan 10, 2020 3:19 am

Max, I now know why you are called DoubleMax!

This worked perfectly, thank you :D :D :D

Post Reply