You are viewing mackys

Adventures in Engineering - A decades late rant on XML.
The wanderings of a modern ronin.

Ben Cantrick
  Date: 2012-06-02 12:55
  Subject:   A decades late rant on XML.
Public
  Music:MC Plus+ - Dear Engineer
The fact that I've been able to avoid XML until now is - I think - generally a positive reflection on my software engineering instincts. When XML was a big new thing, I took one look at it and said to myself: "Uh, yeah... no. I'll just be over here with my C and my micro-controllers and stuff. Y'all have fun with that." As it turns out I called that correctly. It's nice to get one right.

But I'm also lucky that today, in the here-and-now, we actually know that XML isn't the best way to do a lot of the things that its advocates claim(ed) it should be used for. Today we can point at JSON or CSV or whatever and say: "That works (much) better." If you were a programmer in 2003 and thought XML sucked, you didn't have anything well-known to counter it with. And so quite possibly got forced to use it.

So forgive the "old news" flavor of this rant. Everyone who was forced to use XML (which is probably nearly everybody by now) has already learned these lessons. I've just been lucky enough to not have to use XML in any serious capacity... until recently.





XML comments are weapons-grade fail.

First, let's face it: XML got comments wrong from the very start. The mere idea of using one string ("!--") to open a comment, and then a different string ("--") to close it, was flat out stupid to begin with. What, exactly, was wrong with <-- comment --> ? Would that have been too easy to read? Too easy to type?

But of course XML couldn't stop failing there...

You can't put the string "--" inside a comment.

<!-- This is the first line of a comment.
  -- This is the second line of a comment.
  -->

If you do something like that, get ready to see:
ERROR caused by: org.yackity.smackity.WhackIty.JackIty.HackIty.OhGodHelpMe.ICanSeeForever.YouAreInAMazeOfTwistyLittleExeption
*** String '--' not allowed in comment at [row,col {unknown-source}]: [37,6]

Stop and think for a minute about how incredibly stupid this is. The string "--" is not "<--", nor is it "-->". Thus, "--" should have no significance what so ever unless it is immediately attached to a "!" or ">" character. There is quite literally no reason at all to disallow "--" inside a comment.

Just to drive the point home, let's show an example of what would happen if we used some other string instead of "--". How about "fish"?
<!fish This is a fish. fish>

That would be invalid XML. Because the same word that came after the exclamation point, was also found again in the comment.

The level of brain death required to accept this state of affairs even temporarily, much less advocate this as a global standard to be used for decades... is just staggering.

But did XML cease its parade of comment failure there? Oh no...

You can't comment out an attribute.

Wanting to commenting out an individual attribute is a perfectly normal and reasonable thing to do. Something you'd clearly anticipate someone wanting to do, among other occasions, during process of debugging. But XML won't let you do it:
<sometag
  firstattrib="a"
<!--  secondattrib="b"  -->
  thirdattrib="c"
  fourthattrib="d">
</tag>

Or, suppose you're trying to explain why an attribute is set the way it is...
<sometag
  firstattrib="a"
  secondattrib="b"      <!-- Reticulates splines optimally. -->
  thirdattrib="c"
  fourthattrib="d">
</tag>

In both cases, your XML parser will shit all over itself at secondattrib. In XML you simply cannot have comments anywhere in a list of attributes. Period. Why? Because XML comments are made of pure fail.

!-- SPECIAL BONUS FAIL --

XML's <xs:sequence> is used all over the place in XML schema definition files. This allows the schema designer to enforce the exact order of sub-tags with a tag. Which in turn allows the XML parser to reject a perfectly valid XML document for no other reason than the order of the sub-elements inside some tag was different than it expected. E.g, this works fine:
<tag>
 <dog>lassie</dog>
 <cat>mittens</cat>
</tag>

But this blows right the hell up:
<tag>
 <cat>mittens</cat>
 <dog>lassie</dog>
</tag>

I honestly don't know why <xs:sequence> was even created. I've been thinking about this for a week and I can neither think of, nor find via googling, any example anywhere that shows a valid need to enforce the order of child tags within a tag. The whole point of having a Data Description Language is that doing so allows the computer to take care of the little bullshit things - like what order some arbitrary list of items is given in. But XML schemas far and wide enforce bullshit ordering with <xs:sequence>. Why? Because screw you, that's why!


Considering that XML requires a sophisticated Turing-complete parser to parse correctly, the fact that it can't make a single one of the above simple and obvious things work is very impressive. XML has accomplished something that's rare even in the bug-ridden and incompatible realm of software: it has managed to create the worst of all possible worlds.

So, why does this incredible heap of crap survive? Are we, the software engineers, REALLY THAT STUPID?

Yes. Yes we are.

In 2.5 years it will be 2015. And in 2015, hundreds of millions upon hundreds of millions of lines of code will still depend on - or even be written specifically to support - XML. Because we as software engineers are too damn stupid, too damn lazy, and too damn cowardly to put a bullet in this disease-ridden corpse that never should have won out over plain old SGML in the first place.

tl;dr - Screw XML. Screw it forever. XML is the herpes of the software universe.




And to my fellow software "engineers": if you advocate the use of XML in a new project that could just as easily use JSON, CSV, Windows .INI file format, or any of dozens of other far saner options... shoot yourself in the head you are a bad person and you should feel bad. (I understand, however, if you got stuck with a codebase that is already deeply XML-dependent. My condolences. Welcome to the club.)
Post A Comment | 6 Comments | Share | Link






Trevor Stone: java logo
  User: flwyd
  Date: 2012-06-02 21:39 (UTC)
  Subject:   (no subject)
Keyword:java logo
I think the comment rules are inherited from SGML, though I seem to remember reading somewhere that SGML comments were supposed to have balanced pairs of -- and HTML parsers (which tend to live on the lax side) accept a stray -- as valid.

To be fair, JSON and CSV specifications don't allow comments anyway.

I don't remember if ASCII protocol buffers allow comments either, but I think they do. Even if they don't, protobufs have enough awesome that lack of comments is forgivable. Super-efficient wire format, readable ASCII format, typesafe, convenient API style, composeable, backwards- and forwards-compatible, simple definition syntax…
Reply | Thread | Link



Ben Cantrick: ronin
  User: mackys
  Date: 2012-06-03 00:55 (UTC)
  Subject:   (no subject)
Keyword:ronin
> To be fair, JSON and CSV specifications don't allow comments anyway.

With CSV that isn't so bad. Most people don't store their config files in .CSV, so comments aren't nearly as necessary. If we start to get applications that use JSON config files, comments are going to be necessary.

Which brings up another important point. The one that catalyzed the above rant, in fact. One of the major "problems" with XML is that it's so often used as a config file language, when it really isn't suitable for such. (I say "problem" in quotes because obviously it's not XML's fault that people choose to use it for the wrong thing...)
Reply | Parent | Thread | Link



Alex Belits
  User: abelits
  Date: 2012-06-03 02:03 (UTC)
  Subject:   (no subject)
The whole reason for XML is, most "programmers" have no idea how a parser works, how to write one, where to find one, or any way to obtain that kind of knowledge. So they want one parser to rule them all, with not even a temptation for anyone to produce two interoperable implementations. And that requires a hopelessly obfuscated, verbose, human-unreadable format.

They also want an easy way to find broken code at runtime, because broken code is supposed to be an everyday occurence. So they need a validating parser. The fact that they can't devise any reaction to an invalid format that is not "explode into tiny bits", does not bother them at all, because the important part is to be able to point fingers at fellow idiot who sent them "invalid" data.
Reply | Thread | Link



Ben Cantrick: ronin
  User: mackys
  Date: 2012-06-03 07:32 (UTC)
  Subject:   (no subject)
Keyword:ronin
I agree with all of that, and have only one thing worth adding...

Compiler Construction should be mandatory for a CS degree. That way people will know it's possible to write their own parsers. And maybe XML won't be their first resort.
Reply | Parent | Thread | Link



Alex Belits
  User: abelits
  Date: 2012-06-03 19:21 (UTC)
  Subject:   (no subject)
I see a problem here -- most programmers don't even have a CS degree.
Reply | Parent | Thread | Link



  User: (Anonymous)
  Date: 2012-06-05 16:54 (UTC)
  Subject:   (no subject)
Alas...
Reply | Parent | Thread | Link



browse
September 2014