Escape non printable chars in XML Topic is solved

If you are using the main C++ distribution of wxWidgets, Feel free to ask any question related to wxWidgets development here. This means questions regarding to C++ and wxWidgets, not compile problems.
Post Reply
Moonslate
Earned a small fee
Earned a small fee
Posts: 10
Joined: Sat Aug 24, 2019 10:09 am

Escape non printable chars in XML

Post by Moonslate » Thu Aug 06, 2020 9:37 pm

Hello, how do I escape non-printable chars when saving a wxXmlDocument? i.e saving a string with a text attribute with the string: "Choose another one?[0x0500]" [0x0500] is the non printable char. This saves without error, but when loading again, I get an error of not well formed character.

I also tried to save the xml to a string stream, then replace all 0x0500 to , but I get an error of reference to invalid character number.

Code: Select all

<String references="0007073c">Choose another one?&#x5;</String>
or

Code: Select all

<String references="0007073c">Choose another one?[0x0500]</String>


I save the xml with UTF-16 encoding (UCS-2 Little Endian). I'm in Windows using Visual Studio 2019.

catalin
Moderator
Moderator
Posts: 1597
Joined: Wed Nov 12, 2008 7:23 am
Location: Romania

Re: Escape non printable chars in XML

Post by catalin » Fri Aug 07, 2020 3:10 am

Did you mean "non-printable" as in "not on the keyboard"? Because it looks like 0x0500 does have a glyph.
And you probably need to escape it inside the wxString that you use to write the xml, in which case you'd be better off by finding its utf8 representation and using that, which in your case should be wxString::FromUTF8("Choose another one?\xD4\x80").

Unicode Support in wxWidgets might help.

Moonslate
Earned a small fee
Earned a small fee
Posts: 10
Joined: Sat Aug 24, 2019 10:09 am

Re: Escape non printable chars in XML

Post by Moonslate » Tue Aug 11, 2020 12:14 am

catalin wrote:
Fri Aug 07, 2020 3:10 am
Did you mean "non-printable" as in "not on the keyboard"? Because it looks like 0x0500 does have a glyph.
And you probably need to escape it inside the wxString that you use to write the xml, in which case you'd be better off by finding its utf8 representation and using that, which in your case should be wxString::FromUTF8("Choose another one?\xD4\x80").

Unicode Support in wxWidgets might help.
Thanks for the reply.

With non-printable, I mean chars that can not be printed. Check isprint.

The UTF-8 representation of 0x0500 (unicode) is the same as UTF-8 (0x05), but with only one char.

I searched more over the network and I found that is not possible to scape these non-printable chars in XML using the &#x5 syntax (sorry for not find this information before, I thought this was an error from wxWidgets).

So, I'm saving with

Code: Select all

wxString::Replace(L'\x5', L"[END]", true)
and loading with

Code: Select all

wxString::Replace(L"[END]", L'\x5', true)
This is a bit hacky, since I will find others non-printable chars in the future, but does the jog for now.

While I was wiriting this, I found in this site that control characters (from 0x00 to 0x08) are allowed in XML version 1.1. I tried to set my xml doc to version 1.1, but the xml parser don't allow this character even with version 1.1.

Code: Select all

	wxXmlDocument doc;
	doc.SetVersion(L"1.1");
I guess that changing this are only semantic and only affects the xml output, not the parser at all.

Code: Select all

<?xml version="1.1" encoding="UTF-16"?>
<ROM-String start="000F0E7C" size="00000008">
  <String references="00004EFC">Player[0x0500]</String>  
</ROM-String>
or

Code: Select all

<?xml version="1.1" encoding="UTF-16"?>
<ROM-String start="000F0E7C" size="00000008">
  <String references="00004EFC">Player&#x5</String> 
</ROM-String>
Please note that [0x0500] are the raw (ENQ) character in the document. I can't paste it here. Try opening the Notepad++ and pressing CTRL+E. That's (and others) the character I want to escape.

catalin
Moderator
Moderator
Posts: 1597
Joined: Wed Nov 12, 2008 7:23 am
Location: Romania

Re: Escape non printable chars in XML

Post by catalin » Tue Aug 11, 2020 12:38 am

Moonslate wrote:
Tue Aug 11, 2020 12:14 am
The UTF-8 representation of 0x0500 (unicode) is the same as UTF-8 (0x05)
Uhm, no, it is not. Unicode U+0500 can be represented in utf-8 as 0xD4 0x80 (and FWIW as 0x0500 in utf-16).
Moonslate wrote: but with only one char.
There is no such thing. You are probably confusing a misinterpretation of utf-16 0x0500 value as 0x05 and 0x00 utf-8 values.
Moonslate wrote: Please note that [0x0500] are the raw (ENQ) character in the document.
Again, no.
0x0500 is the representation in utf-16 of unicode U+0500. ENQ is unicode U+0005. And you don't have to believe me, just look them up (i.e. using the link in my first reply).

Moonslate
Earned a small fee
Earned a small fee
Posts: 10
Joined: Sat Aug 24, 2019 10:09 am

Re: Escape non printable chars in XML

Post by Moonslate » Tue Aug 11, 2020 1:31 am

catalin wrote:
Tue Aug 11, 2020 12:38 am
Moonslate wrote:
Tue Aug 11, 2020 12:14 am
The UTF-8 representation of 0x0500 (unicode) is the same as UTF-8 (0x05)
Uhm, no, it is not. Unicode U+0500 can be represented in utf-8 as 0xD4 0x80 (and FWIW as 0x0500 in utf-16).
Moonslate wrote: but with only one char.
There is no such thing. You are probably confusing a misinterpretation of utf-16 0x0500 value as 0x05 and 0x00 utf-8 values.
Moonslate wrote: Please note that [0x0500] are the raw (ENQ) character in the document.
Again, no.
0x0500 is the representation in utf-16 of unicode U+0500. ENQ is unicode U+0005. And you don't have to believe me, just look them up (i.e. using the link in my first reply).
I just created a document in Notepad++ with UTF-8 encoding, and I saved it with only an enquiry character.

Notepad++ screenshot:

Image

The file opened in a hex editor:

Image

Here the same file saved with UCS-2 LE encoding:

Image

and in this site you can see that the UTF-8 ENQ is 0x05 (the same as ASCII) and the Unicode is 0x0500.

But, it doesn't matter. My problem is not encoding. I know how to encode correctly my strings (That I load from an old executable encoded in ASCII that I'm doing some reverse-engineering). Also, my xml files are NOT encoded in UTF-8.
Moonslate wrote: I save the xml with UTF-16 encoding (UCS-2 Little Endian).
My problem is with the xml parser and how to store these characters in a xml file, because wxWidgets parser can't read that character, even escaping with &#x5;
So, I'm saving with

Code: Select all

wxString::Replace(L'\x5', L"[END]", true)
and loading with

Code: Select all

wxString::Replace(L"[END]", L'\x5', true)
Because this works, but I have others characters that the xml parser don't allow. Like 0x0c.

catalin
Moderator
Moderator
Posts: 1597
Joined: Wed Nov 12, 2008 7:23 am
Location: Romania

Re: Escape non printable chars in XML

Post by catalin » Tue Aug 11, 2020 9:33 am

Moonslate wrote:
Tue Aug 11, 2020 1:31 am
you can see that the UTF-8 ENQ is 0x05 (the same as ASCII) and the Unicode is 0x0500.
..this time I think it's perfectly clear -- you are confusing "Unicode" with "utf-16" / "UCS-2".
"Unicode" is not an encoding, while utf-8, utf-16, ucs-2 are all encodings intended to represent Unicode. Again, 0x05 utf-8 value and 0x0500 utf-16/ucs-2 value, are both different encodings of Unicode Character (U+0005). Please read the title of the page in that link too.
Moonslate wrote:My problem is not encoding.
I actually think it is.
Moonslate wrote: I save the xml with UTF-16 encoding (UCS-2 Little Endian).
And are you also reading it using UTF-16 / UCS-2? Because it looks like you are reading it incorrectly, using UTF-8.
Moonslate wrote: I have others characters that the xml parser don't allow. Like 0x0c.
And this pretty much confirms my assumption -- 0x0c utf-8 is 0x000c utf-16/ucs-2, which means that for the utf-16 value the parser using utf-8 will no longer be fooled in the same way, and it will first read a 0x00 utf-8 value, and probably stop there.
Anyway, you should provide a minimal code sample with the way you write and then read the xml file. Should be possible to make it really small.

Moonslate
Earned a small fee
Earned a small fee
Posts: 10
Joined: Sat Aug 24, 2019 10:09 am

Re: Escape non printable chars in XML

Post by Moonslate » Tue Aug 11, 2020 8:57 pm

catalin wrote:
Tue Aug 11, 2020 9:33 am
Moonslate wrote:
Tue Aug 11, 2020 1:31 am
you can see that the UTF-8 ENQ is 0x05 (the same as ASCII) and the Unicode is 0x0500.
..this time I think it's perfectly clear -- you are confusing "Unicode" with "utf-16" / "UCS-2".
"Unicode" is not an encoding, while utf-8, utf-16, ucs-2 are all encodings intended to represent Unicode. Again, 0x05 utf-8 value and 0x0500 utf-16/ucs-2 value, are both different encodings of Unicode Character (U+0005). Please read the title of the page in that link too.
#-o Ok, so, let's assume I'm confusing the "Unicode" with "utf-16" / "UCS-2". What does this have to do with the title of my post?
catalin wrote:
Moonslate wrote:I save the xml with UTF-16 encoding (UCS-2 Little Endian).
And are you also reading it using UTF-16 / UCS-2? Because it looks like you are reading it incorrectly, using UTF-8.
Yes, reading works perfectly. This is also not my problem.
catalin wrote:
Moonslate wrote:I have others characters that the xml parser don't allow. Like 0x0c.
And this pretty much confirms my assumption -- 0x0c utf-8 is 0x000c utf-16/ucs-2, which means that for the utf-16 value the parser using utf-8 will no longer be fooled in the same way, and it will first read a 0x00 utf-8 value, and probably stop there.
You are right. 0x0c is 0x000c in UCS-2. You are right that the parser will read a 0x00. But, as I said several times:
Moonslate wrote: I save the xml with UTF-16 encoding (UCS-2 Little Endian).
So actually ALL characters have a 0x00, as in this screenshot:

Image

But... As I said several times:
Moonslate wrote: My problem is not encoding.
Moonslate wrote: My problem is with the xml parser and how to store these characters in a xml file, because wxWidgets parser can't read that character, even escaping with &#x5;

Code: Select all

<ROM-String start="0010538C" size="0000097A">
  <String references="0007D8C4">
    <![CDATA[  ]]> <!--The parset don't allow white space only strings. So I put inside a CDATA block-->
  </String>
  <String references="0007D8C8">Some &#xA text.</String> <!--This don't give any problem-->
  <String references="0007D8CC">Other &#x5 text.</String> <!--Error line 6: not well formed character-->
 </ROM-String>
My question here is:

How do I escape characters like 0x0500 and 0x0c00 (and some others I have here)?
Moonslate wrote: While I was wiriting this, I found in this site that control characters (from 0x00 to 0x08) are allowed in XML version 1.1. I tried to set my xml doc to version 1.1, but the xml parser don't allow this character even with version 1.1.

Code: Select all

wxXmlDocument doc;
doc.SetVersion(L"1.1");
I guess that changing this are only semantic and only affects the xml output, not the parser at all.
The time I wasted here, I could make my own xml-like file format and make a simple parser that don't need theses hackys solutions... Or store the strings like in the older versions of my application: non human readable binary strings...

I hope you finally understand that my problem is not with encoding (AS I SAID SEVERAL TIMES) and is only a XML markup/parser question...

User avatar
doublemax
Moderator
Moderator
Posts: 15647
Joined: Fri Apr 21, 2006 8:03 pm
Location: $FCE2

Re: Escape non printable chars in XML

Post by doublemax » Tue Aug 11, 2020 11:14 pm

Disclaimer: I've never used XML for anything, the following is just the result of a little bit of Googling:

You didn't mention it, but i assume you use wxXmlDocument? wxWidgets uses expat for xml parsing, which only supports xml 1.0. There, both 0x05 and 0x0c are not allowed.

So i believe you have to perform both escaping and un-escaping of these characters yourself (with any method that suits you and that doesn't create invalid xml).
Use the source, Luke!

Moonslate
Earned a small fee
Earned a small fee
Posts: 10
Joined: Sat Aug 24, 2019 10:09 am

Re: Escape non printable chars in XML

Post by Moonslate » Wed Aug 12, 2020 12:54 am

Hello, doublemax. Thank you for your answer.
doublemax wrote: You didn't mention it, but i assume you use wxXmlDocument?
...
Moonslate wrote: Hello, how do I escape non-printable chars when saving a wxXmlDocument?
I see. I was deceived with the wxXmlDocument::SetVersion.

If your answer had been the first, this post would be very small.

catalin
Moderator
Moderator
Posts: 1597
Joined: Wed Nov 12, 2008 7:23 am
Location: Romania

Re: Escape non printable chars in XML

Post by catalin » Wed Aug 12, 2020 4:33 am

Moonslate wrote:
Tue Aug 11, 2020 8:57 pm
let's assume I'm confusing the "Unicode" with "utf-16" / "UCS-2". What does this have to do with the title of my post?
It was very misleading, including the mix-up with utf-8 representation.
Moonslate wrote: How do I escape characters like 0x0500 and 0x0c00 (and some others I have here)?
You surely meant 0x000c, but it doesn't matter.
Likely you need a different parser. With wxXmlDocument (expat) there are multiple limitations, like the ones you've seen already.
Moonslate wrote:The time I wasted here, I could make my own xml-like file format and make a simple parser
Next time you might think to waste a bit of time on writing 5-10 lines of code with your use of wxXmlDocument. It should not be only about your time waste. You're welcome, nevertheless.

Post Reply