Load from file in unicode Topic is solved

If you are using the main C++ distribution of wxWidgets, Feel free to ask any question related to wxWidgets development here. This means questions regarding to C++ and wxWidgets, not compile problems.
Post Reply
rrcn
Earned a small fee
Earned a small fee
Posts: 14
Joined: Mon Nov 17, 2008 4:57 pm

Load from file in unicode

Post by rrcn » Wed Feb 11, 2009 6:46 pm

I have a problem when I get text from a file.
I get the right text when not in unicode but the wrong text in unicode. On unicode it saves right but what I get is not the text on the file.

This is what i use to save and load from a file:

Save:

Code: Select all

class Serializer
{
public:
	Serializer (wxString const & nameFile) : _stream(nameFile)
    {
        if (!_stream.IsOpened ())
            throw "couldn't open file";
    }
    
    ...

    void PutString (wxString const & str)
    {
        int len = str.length ();
        PutLong (len);
        if(!_stream.Write (str, wxConvUTF8))
            throw "file write failed";
    }
    bool Commit() {
	return _stream.Commit();
	}
private:
    wxTempFile _stream;
};
Load:

Code: Select all


class DeSerializer
{
public:
	DeSerializer (wxString const & nameFile) : _stream(nameFile) {

		if (!wxFile::Exists(nameFile))
			throw "File doesn't exist";
        if (!_stream.IsOpened())
            throw "couldn't open file";
    }
    
    ...
   
    wxString GetString ()
    {
        long len = GetLong ();
		long e;
        wxString str;
        str.resize(len);
        e = _stream.Read (&str [0], len);
        if (e != len)
            throw "file read failed";
        return str;
    }

private:
    wxFile _stream;
};
GetLong and PutLong gets/puts a number on file.
The line: e = _stream.Read (&str [0], len); gets the wrong text.
What am I doing wrong?

User avatar
Disch
Experienced Solver
Experienced Solver
Posts: 99
Joined: Wed Oct 17, 2007 2:01 am

Re: Load from file in unicode

Post by Disch » Wed Feb 11, 2009 7:20 pm

rrcn wrote: The line: e = _stream.Read (&str [0], len); gets the wrong text.
What am I doing wrong?
You should refrain from writing to the buffer directly, because the characters in wxString will be stored differently depending on the Unicode build settings (ie: each character might be 1 or 2 bytes -- as opposed to your file which will always be 1 byte).

Instead, read the text into a buffer, then copy the buffer to your wxString so that you can use one of the conversions:

Code: Select all

    wxString GetString ()
    {
        long len = GetLong ();
        char* buf = new char[len];
        long e = _stream.Read(buf,len);
        if(e != len)
        {
          delete[] buf;
          throw "file read failed";
        }

        wxString str(buf,wxConvUTF8,len);
        delete[] buf;

        return str;
    }
edit -- fixed memory leak problem ^^

extreme001
I live to help wx-kind
I live to help wx-kind
Posts: 192
Joined: Fri Dec 22, 2006 9:17 am
Location: Germany
Contact:

Post by extreme001 » Wed Feb 11, 2009 8:48 pm

Maybe wxMBConvFile would help in this case...?

User avatar
Disch
Experienced Solver
Experienced Solver
Posts: 99
Joined: Wed Oct 17, 2007 2:01 am

Post by Disch » Wed Feb 11, 2009 9:13 pm

extreme001 wrote:Maybe wxMBConvFile would help in this case...?
wxMBConvFile is for converting to whatever format the platform recognizes for file names (ie, it's what you would pass to wxFile::Open). It wouldn't really help in this case because he's writing his string as UTF-8 to the file.

For a more detailed explanation of what's going wrong in his program:

let's say he's trying to read the string "test". In UTF-8 (without a null terminator, since he's not using one) this would be 4 bytes: 74 65 73 74.

wxString (I assume) would use wxChar types internally, which likely are generally 2 bytes wide in Unicode build and 1 byte wide in not unicode. So when not in Unicode, his code works because he's reading the four bytes as-is from the file, and each byte is being read to a seperate wxChar, resulting in the desired string.

However in Unicode build, where wxChar is two bytes wide, the 4 bytes are only filling two wxChars, giving him '6574 7473' ("整瑳") if on a little endian machine or '7465 7374' ("瑥獴") if on big endian. Instead of the desired '0074 0065 0073 0074' which would be his expected "test" string.

Plus... bypassing wxString's encapsulation of the string buffer is seldom a good idea. There's no telling how wxString works internally, nor how it may change in the future, so you should refrain from stepping outside the provided interface.

rrcn
Earned a small fee
Earned a small fee
Posts: 14
Joined: Mon Nov 17, 2008 4:57 pm

Post by rrcn » Thu Feb 12, 2009 3:24 pm

You should refrain from writing to the buffer directly, because the characters in wxString will be stored differently depending on the Unicode build settings (ie: each character might be 1 or 2 bytes -- as opposed to your file which will always be 1 byte).
But if in the file I have only 1 byte for character and it writes 4 bytes for the word "test", it will lose information and I can't have a portable file for several languages.
wxString GetString ()
{
long len = GetLong ();
char* buf = new char[len];
long e = _stream.Read(buf,len);
if(e != len)
{
delete[] buf;
throw "file read failed";
}

wxString str(buf,wxConvUTF8,len);
delete[] buf;

return str;
}
As an example, for the word "colecção" this doesn't works in unicode. It creates an empty string on the line: wxString str(buf,wxConvUTF8,len);

User avatar
Disch
Experienced Solver
Experienced Solver
Posts: 99
Joined: Wed Oct 17, 2007 2:01 am

Post by Disch » Thu Feb 12, 2009 4:03 pm

rrcn wrote: But if in the file I have only 1 byte for character and it writes 4 bytes for the word "test", it will lose information and I can't have a portable file for several languages.
This is a little funky to explain if you're not already familiar with it.

UTF-8 characters are variable size. This means that one wxChar does not necessarily equate to one character. In UTF-8, ASCII characters (such as a-z, A-Z, 0-9, etc) are all represented as 1 byte each, however other characters may need two, three or even four bytes. wxString hides all this from you, usually, so you don't have to worry about it. However when serializing to/from a file, it's important to know the difference.

Rather than attempt to explain exactly how UTF-8 works, I'll link to the wikipedia article here which outlines the basic idea.

Now, it's not really all that important to understand all that. What IS important to understand is that 8 wxChar's does not necessarily mean 8 characters. Depending on the characters, it could be as little as 2.

UTF-8 is capable of representing any Unicode character, so you don't have to worry about losing information when storing strings as UTF-8... it retains all information just fine.
As an example, for the word "colecção" this doesn't works. It creates an empty string on the line: wxString str(buf,wxConvUTF8,len);
Strange, it works fine for me. Here's a quick and sloppy test program I whipped up. This program outputs the expected "colecção" text in the messagebox in Unicode Build, but outputs nonsense in ANSI build (possibly because wxMessageBox can't print Unicode chars when not in Unicode build?).

Code: Select all

#include <wx/wx.h>
#include <wx/file.h>

class App : public wxApp
{
public:
	bool OnInit()
	{
		char* buf = new char[10];
		wxFile file(wxT("test.bin"));
		file.Read(buf,10);
		file.Close();

		wxString str(buf,wxConvUTF8,10);
		delete[] buf;
		wxMessageBox(str);
		return false;
	}
};

IMPLEMENT_APP(App);
"test.bin" is a file which just contains "colecção" in UTF-8 encoding:
63 6F 6C 65 63 C3 A7 C3 A3 6F

Since it's 10 bytes long, I just used a fixed value of 10 in my code rather than trying to determine the length some other way. Note the difference, here -- it's 10 bytes even though it's only 8 characters. When writing your string to the file, make sure you write how many bytes long the string is, and not how many characters. Either that, or just use a null terminator.

rrcn
Earned a small fee
Earned a small fee
Posts: 14
Joined: Mon Nov 17, 2008 4:57 pm

Thanks

Post by rrcn » Fri Feb 13, 2009 5:17 pm

Thanks for your great explanation!

User avatar
Disch
Experienced Solver
Experienced Solver
Posts: 99
Joined: Wed Oct 17, 2007 2:01 am

Post by Disch » Fri Feb 13, 2009 5:41 pm

Ironic you replied when you did -- I was just about to post with an update ^^

I just had to do something like this in a program I'm working on. If you're still running into problems you can try this code out. It's working fine for me in both Unicode and non-Unicode builds:

Code: Select all

void PutString(const wxString& str)
{
  wxCharBuffer buf = str.utf8_str();
  long len = (long)strlen(buf);

  PutLong(len);
  file.Write(buf,len);
}

wxString GetString()
{
  long len = GetLong();
  if(len <= 0)
    return wxEmptyString;

  char* buf = new char[len];
  file.Read(buf,len);

  wxString str = wxString::FromUTF8(buf,len);
  delete[] buf;

  return str;
}
The strlen() call is unfortunate, but I don't see any way around it -- wxString returns the character length and not the byte length, but strlen() is apparently oblivious to utf-8 and returns the byte length just fine. It would be nice if you could somehow get the length from wxCharBuffer, but it doesn't look like you can (why on Earth not!!!)

Post Reply