IBMSkip to main content
Advanced search
    IBM home  |  Products & services  |  Support & downloads   |  My account

IBM developerWorks : XML zone : XML zone articles
developerWorks
Soapbox: Humans should not have to grok XML
Discuss e-mail it!
Contents:
You say "date-uh," I say "dat-uh"
XML complexity and interprogram communication
Syntax is not semantics
XML is a poor human interface
Summary
Resources
About the author
Rate this article
Related content:
Soapbox: Why XML Schema beats DTDs hands-down for data
More dW XML resources
Also in the XML zone:
Tutorials
Tools and products
Code and components
Articles
Answers to the question "When shouldn't you use XML?"

Terence Parr
Chief scientist, jGuru.com
August 2001

Today the computing world tends toward using XML for any and all formal specifications and data descriptions. The author, a big fan of XML, asks a blasphemous question: "Is XML totalitarianism a good idea?" In this opinion piece, Terence Parr, co-founder of jGuru, demonstrates that XML makes a lousy human interface. He also provides questions to ask yourself to determine if XML is appropriate even for your project's program-to-program interface needs.

Remember what life was like before cut-and-paste? (As my friend Gary Funck says, "If you're not old enough to remember ... then good for you"). Every program stored data differently and rarely expected to shovel data to another application, and certainly not to another running program. In modern operating systems, the paste buffer holds data in a standard way and each program is free to interpret the buffer data as it sees fit. For example, you can cut from a database program and meaningfully paste into a graphing program.

Similarly, we have a standard means of sharing data between programs and between machines on the Internet called XML. Without XML or similar standard, no two programs could share information -- the fundamental syntax used to format data must be the same for data portability. Of course, you may not be able to interpret that data, but you can at least read it in. Take a look at such things as SOAP and XBeans to see how XML facilitates interoperability (see Resources).

Now that we've had a group hug and agreed that XML is, or should be, the common language of program data interchange, I'd like to discuss the converse: When does using XML makes no sense? First, I need to remind you what XML looks like and how it differs from other data formats. Given that background I can then ask a series of questions that may prove useful when determining a data format for your project. Finally, I'll demonstrate my main proposition: XML makes a lousy human interface.

You say "date-uh," I say "dat-uh"
XML is a means of highlighting the structure of data to make it easier for a computer program to examine that data. It is not, of course, the first data format. The venerable comma-separated-values (CSV) format has been used for decades to describe rows of data. For example, here are three rows of integers that might describe three date records:


8, 17, 1964
12, 30, 1975
9, 1, 1970

CSV is pretty hard to beat in terms of readability and implementation simplicity (surely you had to write a program to read CSV data in your first computer class). The problem is that CSV imposes a strict ordering of the data and CSV cannot easily describe nested structures or elements of varying types. Adding curly braces to indicate nested, aggregate data (as in C and VRML) improves expressibility while leaving the data human readable. For example, to associate an identifier with each of the rows of data, you might write:


{{8, 17, 1964}, instructor}
{{12, 30, 1975}, student}
{{9, 1, 1970}, student}

This format is still pretty easy to parse, but it continues to impose a strict ordering of the data. The way around the ordering limitation is to label all of the data, such as:


{date={m=8, d=17, y=1964}, title=instructor}
{date={d=30, m=12, y=1975}, title=student}
{title=student, date={m=9, d=1, y=1970}}

While the data is now position independent (you can see that I have shuffled some of the elements), the redundant labels increase storage costs and make it much harder to parse.

Nowadays, most of us would encode the data in XML form something like the following:


<record>
    <date><m>8</m> <d>17</d> <y>1964</y></date> <title>instructor</title>
</record>
<record>
    <date><d>30</d> <m>12</m>, <y>1975</y></date> <title>student</title>
</record>
<record>
    <title>student</title> <date><m>9</m> <d>1</d> <y>1970</y></date>
</record>

My point is that there are an infinite number of description formats (languages) with varying degrees of readability, ease of implementation, efficiency, generality, expressiveness, and so on. Nonetheless, we are rapidly converging on complete XML format dominance.

XML complexity and interprogram communication
The explosion of the Web and resulting pervasive familiarity with HTML has led us to XML, which is really just another structured data format. Unfortunately, computer time and space efficiency suffers tremendously the more we mark up our data. On the other hand, the data is in a standard form that is readable by any program with an XML parser. You need to resolve this trade-off between simplicity and standardization on a case-by-case basis. When in doubt, you should probably use XML. Another obvious rule is that if your data is highly structured, use XML. For example, when exporting jGuru FAQ content (see Resources) to our partners, we provide an XML data file. Conversely, if your data is not that complex, XML may be an inferior choice. By asking a few simple questions, you can usually decide between XML and another data format for your program. If you answer "yes" to any of the four questions in this section, you might want to consider trading standard XML data for a simpler format.

Would an XML parser far outweigh the rest of the program?
If your programming task and associated data are fairly simple, why burden yourself with a huge XML parser and the glue code necessary to yank out the data from the resulting tree? Remember, your goal is to get the job done, not entertain yourself playing with DOM trees. The more components you have in your program, the more likely it is to fail. Now, on the other hand, if your program already has an XML parser you might as well use it to be consistent within the program.

Recall also that most programming languages have small built-in parsers for non-XML data formats. Java, for example, has a standard way to read in a property file and also a StreamTokenizer class that makes it easy to pull apart simple data strings.

Will your program run on a small machine or on a huge data set?
If your data is not highly structured/nested and is very large relative to your machine, avoid XML because of the extra disk and memory storage you would need. When I worked at a supercomputing center back in 1993, the physicists often stored terabyte files (that's 1000 gigabytes per file!). Adding XML markup to the data would have made those files truly gargantuan. I dare say that getting the entire tree into memory would be a challenge even today on a parallel supercomputer. The extra processing time to parse the XML markup would also be prohibitive. Granted, you rarely need to simulate high energy laser fusion fluid dynamics, but even mundane files such as logs can get big.

Would an XML data format prevent you from using the myriad of line-based tools like grep, sed, awk, and wc?
If you say ls instead of dir, then you're probably familiar with the Unix line-based tools like grep and wc. I can safely say that I am crippled without grep and feel naked (believe me, that ain't pretty) without sed, awk, and wc. Storing information as position-dependent data with one record per line is a huge advantage over marked up XML data because of all the amazing translations and operations you can perform on the data from the command-line or with simple scripts. I don't need to fire up a development environment and write a program using an XML parser just to examine my data.

Consider the jGuru Web site, which generates many logs and transaction records. I elected to write the records out one per line without mark up information -- the item's position on the line determines what it is. For example, site login events are written to a file in this form:


timestamp: user-id

Because the format is so simple, I can operate on the data with all the Unix tools. If I want to know how many times user 1290 has logged in today, I say something like the following:


$ grep 1290 login.log | wc -l

which filters the log file for any line that has 1290 in it and then counts how many lines resulted. If I want a histogram of all user logins today, I invoke the following magic:


$ awk '{print $2;}' login.log | sort | uniq -c | sort -r -n

Not everybody is so familiar with the Unix data tools, but the point is that choosing a simple data format lets me play with the data without having to resort to a program.

If I had stored the data with XML records like this:


<login><timestamp>2001-04-06</timestamp><id>1290</id></login>

it would be a lot harder to play with that data without writing a program.

Is your program a "one-off," a truly unique application?
You can never predict what will happen to your program or data in the future (just ask the COBOL programmers from the 1960s), but occasionally you have to write a program that performs a task you never expect to repeat. For example, if you are massaging the schema of your database when upgrading your software, it's reasonable to expect that the schema converter will probably not be used again (though you might later borrow some of the code). When dumping information from the database to restructure it for the new schema, you should optimize for speed of execution and ease of implementation, not for XML compliance.

Syntax is not semantics
Finally, as far as the interprogram communications are concerned, I'd like to remind you that syntax does not equal semantics. The semantics (meaning) of the data totally depends on your application. A parser handles only the syntax (format). Consider that all human languages use the exact same data format -- a string of characters (sentences) or a stream of sounds (speech). But, if you have ever asked for directions in Paris as a non-French speaker, you know that communication requires far more than agreeing to communicate by spoken utterances (helpful hint to the traveler: wave your hands a lot when you talk; it seems to help).

Just because something is in XML format doesn't mean that you can use a generic XML parser to understand the input. Trying to jam space-probe spectrographic data from the moons of Jupiter into your company's accounting program may be amusing, but it will probably irritate the accountants. It would be like cutting your Apache configuration file and pasting it into your graphics program. A common data format is great, but a syntax standard does not imply all programs will be able to understand all data.

XML is a poor human interface
Until now, I have discussed only data formats for conversations between programs. Aside from the caveats in the previous section, XML should be a safe bet for most of your program-to-program data format needs. What about programs, specifications, initialization files, and the like that are conversations between a human and a computer? In this section, I hope to convince you that humans should not have to write and grok XML. Besides the many existing standard special-purpose languages that provide superior interfaces, XML is about as far away from natural human language as you can get.

My argument is simple: Humans have an innate ability to apply structure to a stream of characters (sentences), therefore, adding markup symbols can only make it harder for us to read and more laborious to type. The problem is that most programmers have very little experience designing and parsing computer languages. Rather than spending the time to design and parse a human-friendly language, programmers are using the fastest path to providing a specification language and implementation: "Oh, use XML. Done." And that's OK, but I want programmers to recognize that they are providing an inferior interface when they take that easy route. Don't believe me? In the remainder of this article, I provide stark side-by-side comparisons of specialized human-targeted languages and their unnatural XML-structured equivalents.

Let's start with a simple arithmetic expression. Which is easier to read and write, the special-purpose syntax humans have used for at least a thousand years, or the XML equivalent?

MathematicsXML syntax
3+4*5
<add>
  <int>3</int>
  <mult>
    <int>4</int><int>5</int>
  </mult>
</add>

Surely 3+4*5 is easier to read and write. Humans have built specialized domain-specific languages precisely to be efficient at describing problems using them (note the vast number of special-purpose programming languages such as PostScript, PERL, Mathematica, and so on). The above XML specification is a representation of the parse tree (the structure) of the expression -- remember sentence diagramming? Language parsers convert input sentences into parse trees before processing because explicit parse trees are much easier for a computer to deal with than the implied structure of sentences which humans understand so readily. Typing the explicit structure avoids the need for a specialized parser in your program, but it places a great burden on the user.

Lest you think expressions are too simple to be an appropriate example, consider the customized query language I designed and implemented for pulling data from jGuru's object database. Listing 1 a sample query.

Listing 1. A concise query not in XML

query type Person props (email,firstname,lastname) where "EID>100"

Humans certainly prefer typing that concise one-line query sentence rather than the equivalent XML in Listing 2.

Listing 2. The same query in XML
<query>
  <type>Person</type>
  <props>
    <prop id="email"/>
    <prop id="firstname"/>
    <prop id="lastname"/>
  </props>
  <cond>
    <gt>
      <prop id="EID"/>
      <int>100</int>
    </gt>
  </cond>
</query>

Naturally, my specialized parser converts the query into a tree structure in memory that is structured just like the XML. The point is that humans get to type the simple query and the computer does the work of explicitly deriving the structure. Note that the result of this query, a set of objects, is sent back to the client as serialized XML data since it represents program-to-program communication and is an extremely appropriate use of XML.

What about larger specifications? They too are more easily understood using human-oriented languages. Consider the course descriptor file that we use at jGuru to combine various course modules into a complete course. Listing 3 shows a small description for a JavaIntro course.

Listing 3. A short non-XML description of an intro to Java course

course JavaIntrocourse {
  title = "Java Language Essentials"
  caption = "Core features of the Java programming language"
  mmlvers = "2"
  content {
    intro = "intro.mml"
    variables = {useApplets="false", genTM="true", genIBM="true"}
    modules = { "JavaIntro.mod", "vsCOBOL.mod" }
  }
}

There is a lot of information in the description, but it is really just a bunch of assignment statements and lists of strings. Even nonprogrammers can look at this and get the basic idea. I would guess the equivalent XML in Listing 4 would present legibility problems to most nonprogrammers.

Listing 4. The XML equivalent to Listing 3

<course>
  <title>Java Language Essentials</title>
  <caption>Core features of the Java programming language</caption>
  <mmlvers>2</mmlvers>
  <content>
    <intro>intro.mml</intro>
    <variables>
      <var id="useApplets">false</var>
      <var id="genTM">true</var>
      <var id="genIBM">true</var>
    </variables>
    <modules>
      <module>JavaIntro.mod</module>
      <module>vsCOBOL.mod</module>      
    </modules>
  </content>
</course>

Certainly, the XML specification is harder to read than the more classical specification even for programmers. There is so much XML "noise" that the data no longer jumps out at you. Yes, with experience, reading XML becomes easier (I even got good at reading hex memory dumps when I wrote device drivers for industrial robots), but which would you rather type in? You can get used to anything, but why should you have to get used to grokking computer-friendly data when there are more human-friendly alternatives?

It is worthwhile to mention that we use an XML-compliant HTML-like markup language for writing the actual course module text because, we need to embed structure within the English prose that was clearly separable from the English. Without a markup language, the module parser could not distinguish between the course content and the section markers and so on. It is sometimes hard to read the raw module source (again, because of the XML noise), but no other domain-specific embedded language made sense.

For many specification problems, natural languages, such as English, would provide the most natural written interface. Unfortunately, natural languages are ambiguous and extremely difficult to recognize. With a little work, however, you can define a simplified subset that is unambiguous, but still expressive. Consider an adventure game language where you might say:

Tease the nice velociraptor

Any English speaker will be able to parse the sentence above even if they have never heard of a velociraptor. How would you like to play a game where you had to type Listing 5 instead?

Listing 5. An XML rendering of Tease the nice velociraptor

<command>
  <verb>Tease</verb>
  <object>
    <article>the</article>
    <nounmodifier>
      <adjective>nice</adjective>
      <noun>velociraptor</noun>
    </nounmodifier>
  </object>
</command>

Whoa! What a fist full. Humans neither need nor want markup structure tags to grok sentences. When typing, I want to say "Tease the nice velociraptor". When the game records my command history, on the other hand, it might store commands as XML to prevent having to reparse the commands.

I'll close my argument with an analogy to another human-to-program interface language: human speech. For a computer program, looking at a digitized speech signal (a stream of numbers) and trying to extract a sequence of English words is extremely difficult. On the other hand, humans have evolved over millions of years to understand speech without effort, hence we find it a particularly satisfying interface. Any change to your natural way of speaking reduces the efficiency of the interface. Imagine having to "mark up" your speech by providing extra information to an unsophisticated recognition program. Unfortunately, early commercial voice-recognition programs required you to do exactly that: You had to pause between spoken words! The pauses remove word boundary uncertainties -- one of the biggest problems -- by explicitly saying the equivalent of:


<word>stupid</word><word>computer</word>

Using XML markup for human-computer languages is analogous to making you pause between spoken words and is equally annoying.

Summary
XML is just another data format, albeit an important and sophisticated standard format. In most cases, it makes sense to store or transmit your data in XML format with a few exceptions:

  • When your problem is very simple
  • When it would generate humongous files
  • When your app is a "one-off"
  • When you need to use Unix line-oriented text processing tools

There is room for discussion concerning the use of XML for inter-program communication, but when it comes to human-computer communication such as programming languages or configuration files, XML provides the least natural human interface possible.

My argument boils down to one of human vs. computer hardware. Humans deal especially well with implied structure whereas computers, which were designed to be good at what we are not, prefer explicit structure. The closer your computer language is to natural language, the more natural it will be for a human, but the harder it will be to implement. A good compromise in this tug-o-war is to use a subset of natural language possibly with some hints in the form of punctuation, mathematics being the most obvious and useful example. To my amazement, this classical approach has lost dominance to XML-based explicit structure languages whose form is trivial to recognize (download a free standard XML parser), but that are extremely unnatural and laborious to type and read. Where you strike the balance in your interface language has a lot to do with your experience and available resources, but I hope you at least recognize that computer-friendly XML syntax is not human friendly.

Let me leave you with some advice: learn about languages, their design and implementation. Consider that XML itself exists to "fix" SGML's linguistic complexity and implementation difficulties. Skill with computer languages is the single most useful weapon you can acquire because it covers just about every application of computing. As the primary developer of ANTLR, a popular parser/translator generator (see Resources), I receive questions from an amazingly broad group of users: biologists doing DNA pattern recognition, NASA scientists automatically building communication libraries from deep space probe specification RTF documents, people building configuration files for every conceivable kind of program, and so on. The jGuru.com portal uses many languages and parsers from object-schema specifications to HTML sanitizers. The point is that computer language skills enable you to produce extremely flexible and powerful software, not just compilers for new programming languages. And, most importantly with respect to my focus here, you will be able to produce human-friendly text interfaces.

Resources

About the author
Terence Parr Terence Parr is a co-founder and chief scientist at jGuru.com where he built the current version of the jGuru portal. When not writing code, he secretly writes naughty limericks about his coworkers and bangs out some pretty mean boogie-woogie piano. Terence is also the principle force behind the widely used ANTLR language translator-generator tool and has made fundamental contributions to the computer language research community. Terence holds a Ph.D. in computer engineering from Purdue University.


Discuss e-mail it!
What do you think of this article?
Killer! (5) Good stuff (4) So-so; not bad (3) Needs work (2) Lame! (1)

Send us your comments or click Discuss to share your comments with others.


  About IBM  |  Privacy  |  Legal  |  Contact