wxXmlDocument.Save() parameter truncation

If you are using the main C++ distribution of wxWidgets, Feel free to ask any question related to wxWidgets development here. This means questions regarding to C++ and wxWidgets, not compile problems.
Post Reply
Mtik
In need of some credit
In need of some credit
Posts: 6
Joined: Tue Apr 01, 2008 3:37 am

wxXmlDocument.Save() parameter truncation

Post by Mtik » Tue Apr 01, 2008 3:43 pm

I'm loading in an XML document, changing some parameters, and saving it out using the .Save( filaneme ) function.

Here's a sample parameter before the save:

Code: Select all

<customParams captionHeader="Visible"
Here's the same parameter after the save:

Code: Select all

<customParams captionHeader="V"
This happens on all the parameters whether I changed them or not.

The document is UTF-16 encoded if that has anything to do with it.

Anyone have any ideas? Like my last post, I'm sure it's something silly.

I don't know why only the parameter values are affected.

Thanks,

-Tim

ArKay
Knows some wx things
Knows some wx things
Posts: 41
Joined: Wed Mar 26, 2008 1:38 pm
Location: Germany

Post by ArKay » Tue Apr 01, 2008 6:38 pm

I run into the same problem. It probably has to do with the UTF-16 encoding. :?

Mtik
In need of some credit
In need of some credit
Posts: 6
Joined: Tue Apr 01, 2008 3:37 am

Post by Mtik » Tue Apr 01, 2008 7:20 pm

Thanks ArKay ( although it's not very encouraging :wink: ).

I've stepped through wxWidgets source and I *think* it comes down to the function:

Code: Select all

inline static void OutputString(wxOutputStream& stream, const wxString& str,
                                wxMBConv *convMem = NULL,
                                wxMBConv *convFile = NULL)
in C:\wxWidgets-2.8.7\src\xml\xml.cpp.

In this code:

Code: Select all

#if wxUSE_UNICODE
    wxUnusedVar(convMem);

    const wxWX2MBbuf buf(str.mb_str(*(convFile ? convFile : &wxConvUTF8)));
    stream.Write((const char*)buf, strlen((const char*)buf));
"convFile" is NULL for parameter names.

Therefore, the line

Code: Select all

const wxWX2MBbuf buf(str.mb_str(*(convFile ? convFile : &wxConvUTF8)));
uses wxConvUTF8 for conversion.

For parameter values, however, convFile points to a UTF-16 converter. That's got to be it.

When the "wxWX2MBbuf" object is created with the UTF-16 converter, the call

Code: Select all

strlen((const char*)buf)
returns "1". That explains why there's only 1 character written out.

I can make everything work by setting the file encoding to UTF-8 right before I save:

Code: Select all

finalParamXML.SetFileEncoding( wxT("UTF-8") );
Of course, I wind up with a UTF-8 XML file and not a UTF-16 file like I originally started out with.

Can anyone explain this behavior?

I'm really not up on my UTF-8 / UTF-16 / wxString conversions.

-Tim

ArKay
Knows some wx things
Knows some wx things
Posts: 41
Joined: Wed Mar 26, 2008 1:38 pm
Location: Germany

Post by ArKay » Wed Apr 02, 2008 9:59 am

convFile isn't NULL, but it's not used everywhere.

I think that in UTF-16 EVERY character needs to be written as 16bit (I might be wrong), contrary to UTF-8 where only non-ASCII characters are escaped. OutputNode() only converts comments and parameters (OutputString() called with 4 parameters), the rest is written as UTF-8 which isn't correct.

The whole code just doesn't seem to handle UTF-16 too well.

I tried using wxXML2 (a wxWidgets wrapper for libxml2) and it didn't do too well either.

Image

Then I tried xerces which at least writes out correct Unicode files, but I didn't delve too deep into it. :D

Mtik
In need of some credit
In need of some credit
Posts: 6
Joined: Tue Apr 01, 2008 3:37 am

Post by Mtik » Wed Apr 02, 2008 7:22 pm

ArKay wrote:The whole code just doesn't seem to handle UTF-16 too well.
Agreed.

As it turns out, the files I'm working with are tagged wrong. The XML says it's a UTF-16 file but it's actually saved as UTF-8. Go figure.

Of course, it still doesn't work when I match the file encoding with the encoding parameter.

I'm working around this by changing the encoding parameter to UTF-8 in the file itself, and then just run normal from there.

-Tim

ArKay
Knows some wx things
Knows some wx things
Posts: 41
Joined: Wed Mar 26, 2008 1:38 pm
Location: Germany

Post by ArKay » Wed Apr 02, 2008 9:32 pm

I haven't come across UTF-16 encoded XML myself yet, it's probably not too common and UTF-8 works just fine.

Do those files really have to be encoded in UTF-16?

Belgabor
I live to help wx-kind
I live to help wx-kind
Posts: 173
Joined: Mon Sep 25, 2006 1:12 pm

Post by Belgabor » Wed Apr 02, 2008 11:29 pm

ArKay wrote:I think that in UTF-16 EVERY character needs to be written as 16bit (I might be wrong), contrary to UTF-8 where only non-ASCII characters are escaped.
That's wrong. In UTF16 each character is minimally 16 bits long, but there are some that need more.

ArKay
Knows some wx things
Knows some wx things
Posts: 41
Joined: Wed Mar 26, 2008 1:38 pm
Location: Germany

Post by ArKay » Wed Apr 02, 2008 11:33 pm

Belgabor wrote:
ArKay wrote:I think that in UTF-16 EVERY character needs to be written as 16bit (I might be wrong), contrary to UTF-8 where only non-ASCII characters are escaped.
That's wrong. In UTF16 each character is minimally 16 bits long, but there are some that need more.
I know. Just wanted to illustrate that what wx is doing can't be right (writing anything but text & attribute values as UTF-8).

ArKay
Knows some wx things
Knows some wx things
Posts: 41
Joined: Wed Mar 26, 2008 1:38 pm
Location: Germany

Post by ArKay » Wed Apr 02, 2008 11:49 pm

Actually wxXml2 does it _nearly_ right. The only thing wrong with it is that it doesn't encode EOL characters in UTF-16.

It writes...

0D 0A

instead of...

0D 00 0A 00

The rest of the file is correct (tried it with a hex editor, after insertion of the missing 0s the file can be opened).

No idea whether that's a wxXml2 or libxml2 bug (which it wraps).

Mtik
In need of some credit
In need of some credit
Posts: 6
Joined: Tue Apr 01, 2008 3:37 am

Post by Mtik » Thu Apr 03, 2008 9:51 pm

ArKay wrote:I haven't come across UTF-16 encoded XML myself yet, it's probably not too common and UTF-8 works just fine.

Do those files really have to be encoded in UTF-16?
I can't see why it would need to be.

My solution is to copy the XML file to the temp dir, change the encoding parameter, and then open the temp file as a wxXmlDocument with UTF-8. Works peachy.

I'm dealing with a 3rd party vendor who provides a plug-in to someone else's software to generate these XMLs. Who knows where the real problem lies...

-Tim

ArKay
Knows some wx things
Knows some wx things
Posts: 41
Joined: Wed Mar 26, 2008 1:38 pm
Location: Germany

Post by ArKay » Fri Apr 04, 2008 8:22 pm

Then they are not generating UTF-16 to begin with. Some people are really careless with document encoding information :D

I'm wondering how they can parse their own documents. They probably don't even use an XML toolkit for generation and parsing.

You don't have to use a temp file. You could do it in memory, too.

Code: Select all

bool ParseDocument(const wxString& strFile, wxXmlDocument& document) {
	bool bSuccess = false;
	wxFile fileIn(strFile);
	if (fileIn.IsOpened()) {
		ssize_t fileSize = fileIn.Length();
		if (0 != fileSize) {
			char* pBuffer = new char[fileSize];
			if (fileSize == fileIn.Read(pBuffer, fileSize)) {
				const char* pEncodingOld = "encoding=\"UTF-16\"?>";
				const char* pEncodingNew = "encoding=\"UTF-8\"?> ";
				char* pEncoding = strstr(pBuffer, pEncodingOld);
				if (NULL != pEncoding) {
					strncpy(pEncoding, pEncodingNew, strlen(pEncodingNew));
				}
				wxMemoryInputStream streamXML(pBuffer, fileSize);
				bSuccess = document.Load(streamXML);
			}
			delete [] pBuffer;
		}
		fileIn.Close();
	}
	return bSuccess;
}

ArKay
Knows some wx things
Knows some wx things
Posts: 41
Joined: Wed Mar 26, 2008 1:38 pm
Location: Germany

Post by ArKay » Sat Apr 05, 2008 3:35 pm

Should you ever need UTF-16 (just in case), build and use the wxXml2 library (syntax is similar to that of wxXml)...

http://wxcode.sourceforge.net/components/wxxml2/

Pass 0 flags when saving the file (document.Save(file, encoding, 0)). Otherwise the EOLs will be converted to native (flags default to wxXML2DOC_USE_NATIVE_NEWLINES). The EOL conversion routine doesn't seem to care for character encodings.

Post Reply