The following is not a valid XML 1.0 document:
Try it yourself in your favourite XML parser!
(This blog-post was meant to be a tweet, but after two attempts, I couldn’t make Twitter render it right.)
I don’t really know a lot about XML. I get the general sense that it’s kinda cool, but suffers from a bad reputation in some peoples’ minds because it initially got over-hyped like cloud computing or the Segway scooter.
What makes this so inexcusable is that XML was explicitly an attempt to learn from the pre-existing SGML standard, making a version of it that was simpler, more elegant, had fewer special cases, etc. And yet somehow it still ended up with crazy restrictions like the no-control-characters-even-if-escaped rule. The good news is that XML 1.1 lifts this restriction. The bad news is that XML 1.1 has been a W3C recommendation since 2004, and eight years on no-one is using it.
For the record (for the benefit of those following the trail of ‘why is my XML containing screenscrapes of COBOL powered mainframes invalid?’):
There are a whole lot of other characters that are not allowed, not even in escaped state:
Everything below space (<0x09), 0x0B, ox0C, everything between 0x0E and 0x1F, everything between 0x7F and 0x84 and everything between 0x86 and 0x9F.
I used to have the page bookmarked, but apparently lost it, in one of the XML standards on w3c there is a whole list of ranges that are not allowed, mostly UTF-8 ***FE ***FF sequences.
Regarding disallowed characters–perhaps this was an attempt to force all XML documents to be human readable?
I’m not saying I agree or disagree with that idea, I’m just randomly speculating.
& a m p ; # 2 7 ;
I am sorry for the previous comments. My point is that I refuse to describe this case as stupid. XML has a precise (non-ambiguous) syntax that guarantee the inter-operability. That case is forcing the syntax. In terms of representation, you are not limited. You just have to represent it in the escaped mode: “”&”a”m”p”;”#”2″7″;””. That’s because “&” is actually a special character.
No, Michele — that is precisely my point. We can all understand why XML does not permit literal ESC characters (or other control characters) in documents. That’s not stupid. The stupid part is that even though XML has a perfectly good built-in encoding mechanism, you are arbitrarily forbidden to use it for representing these characters. So using  instead of a literal ESC does not solve the problem. Instead, a W3C-compliant XML parser has to decode the encoded character, notice that it decodes to a control character, and reject the document.
Now that is stupid.
If you doubt it, take a look at http://www.miketaylor.org.uk/tmp/x.xml
Browsers refuse to display this document, complaining “error on line 1 at column 9: xmlParseCharRef: invalid xmlChar value 27” or similar.
Or if you don’t trust your browser, ask the W3C validator to check that file.
I still don’t see the stupidity in it. In my opinion, it’s just a matter of representation.
No. It’s not a matter of representation. It’s a matter of repertoire.
There is no way to represent the ESC character in an XML document. It simply is not in the repertoire. At all.
I verified it using Oxygen XML Editor and it provided the same output of both your links. My point is that I expected a behaviour like that if I would have written that code.
What about this?
That is precisely what is in my document says (as does the actual post). View Source if you don’t believe me.
@Michele: Not even that! A CDATA block won’t work either, the parser considers that as encoded and tries to decode that, with the same effect.
The only way to add such characters is to encode the encoded, so replacing the & with & yourself, no support from any (compliant) parser.
Right. The best you could do would be something like
And have your application make a pass over your DOM object or equivalent, base64-decoding all elements that carry the Content-Transfer-Encoding=”base64″ attribute. Very nasty, and — as you note — not something that any XML parser will do for you.
So that explains some of the BASE64 in Apple’s MacOS/IOS property list files. There I was in the heart of the early 21st century and staring at a relic from the TTY31 1950s.
Scroll to the bottom of this dailywtf for a bit of XML-related wft-ing.
Fill in your details below or click an icon to log in:
You are commenting using your WordPress.com account. ( Log Out / Change )
You are commenting using your Twitter account. ( Log Out / Change )
You are commenting using your Facebook account. ( Log Out / Change )
You are commenting using your Google+ account. ( Log Out / Change )
Connecting to %s
Notify me of new comments via email.
Notify me of new posts via email.
In reverse chronological order:
Get every new post delivered to your Inbox.
Join 3,283 other followers