Page 1 of 1

writing/reading uft-8 from special chars to a file

Posted: Tue Jul 14, 2020 8:45 am
by mael15
I am struggling to write a string containing special chars in UTF-8 to a file and read it back correctly. Problems start when the utf-8 string supposedly has the length of 8 but in the file 12 bytes are written.

Code: Select all

wxString testStr(wxT("abcdêü"));
wxString utf8Str = testStr.ToUTF8();
wxFile uft8File;
wxString filePath = wxGetUserHome() + wxT("\\Desktop\\uft8File.txt");

// write file
bool isOk = uft8File.Create(filePath, true);
isOk &= uft8File.Open(filePath, wxFile::read_write);

wxUint8 strLen = utf8Str.Length();	// 8 ?!?
uft8File.Write(&strLen, 1);
uft8File.Write(utf8Str);		// 12 bytes are written

uft8File.Close();

// read file
isOk &= uft8File.Open(filePath, wxFile::read);
char buf[20];
uft8File.Read(buf, 1);
strLen = buf[0];
uft8File.Read(buf, strLen);
wxString fromFile(buf[0], file.Length());
wxString fromUtf8 = wxString::FromUTF8(fromFile);
uft8File.Close();
The long term goal is to write special char strings in a xml structure. This test here is to understand utf-8 from special chars.

Re: writing/reading uft-8 from special chars to a file

Posted: Tue Jul 14, 2020 10:54 am
by PB
I think your code has multiple issues, like this one

Code: Select all

wxString testStr(wxT("abcdêü"));
wxString utf8Str = testStr.ToUTF8();
wxString always uses the platform native UTF encoding (e.g., UTF-16 on MSW), so the above does not make any sense. Similar with this

Code: Select all

wxUint8 strLen = utf8Str.Length();	// 8 ?!?
and others. You can create a wxString from UTF-8 and you can get UTF-8 encoded char buffer from wxString but not UTF-8 encoded wxString.

If you want to write XML, I would recommend using an XML library, wxWidgets has wxXML which is supposed to work seamlessly with wxString without you caring about character encoding.

If you want to use plain text file, use wxTextFile where you make sure that you pass wxConvUTF8 to its Open and Write methods.

If you want to use binary files, see here for an example how to store a variable-length UTF-8-encoded wxString:
viewtopic.php?f=1&t=47221&p=199141#p199131

Re: writing/reading uft-8 from special chars to a file

Posted: Tue Jul 14, 2020 1:44 pm
by PB
I found some code which I may have already posted here, perhaps it could help you understand how to deal with UTF-8

Code: Select all

#include <wx/wx.h>
#include <wx/ffile.h>

class MyApp : public wxApp
{
public:
    bool OnInit() override
    {
        // nihongo in kanji
        const char* UTF8literal = "\xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e";
        const wxString filePath = "uft8File.txt";

        wxString outStr = wxString::FromUTF8(UTF8literal);
        wxString inStr;
        wxFFile  utf8File;
        size_t   utf8FileSize = 0;

        // create and write a string to file
        if  ( !utf8File.Open(filePath, "w")
               || !utf8File.Write(outStr, wxConvUTF8) )
        {
            return false;
        }
        utf8File.Close();

        // read a string from file
        if  ( !utf8File.Open(filePath, "r")
               || !utf8File.ReadAll(&inStr, wxConvUTF8) )
        {
            return false;
        }
        utf8FileSize = static_cast<size_t>(utf8File.Length());
        utf8File.Close();

        wxLogMessage("outStr: '%s' (length %zu, size in bytes %zu)\n"
                     "inStr: '%s' (length %zu, size in bytes %zu)\n"
                     "UTF-8 string literal size in bytes: %zu\n"
                     "utf8File size in bytes: %zu",
                     outStr, outStr.size(), outStr.size() * sizeof(wxStringCharType),
                     inStr, inStr.size(), inStr.size() * sizeof(wxStringCharType),
                     outStr.ToUTF8().length(),
                     utf8FileSize);

        return true;
    }
}; wxIMPLEMENT_APP(MyApp);
utf8file.png
utf8file.png (6.57 KiB) Viewed 259 times
Just a reminder, UTF-16 encoded wxString (used e.g. on MSW, macOS, or Qt) does not work properly with characters outside the Basic Multilingual Plane (i.e., not fitting into a 16-bit wchar_t), since it does not support surrogate pairs.

Re: writing/reading uft-8 from special chars to a file

Posted: Thu Jul 16, 2020 5:45 pm
by mael15
That was really helpful, thanx! I made some simple tests and it took some time but works now. I use libxml2 but had some old unnecessarily complicated details to clean up. It is surprisingly complicated how strings are converted and how to save, read and reverse it.