My job is the subfield of programming that relates to searching, retrieval and metadata, especially as it relates to libraries. That means that what I deal with is mostly bibliographic metadata — sets of fields that describe book or journal articles. For example, the federated search system that we provide, while not in any way limited to searching for and presenting results of this kind, has tended to be used primary in the library domain, so I spend a lot of my time dealing with bibliographic data.
It’s a jungle out there. The dominant electronic format for bibliographic information is, still, by far, the ancient and faintly comical MARC (MAchine Readable Catalog) format, or rather, the MARC family of similar but subtly incompatible formats. MARC originated in the 1960s at the Library of Congress, literally as a way to encode the information on physical catalogue cards.
The MARC formats are too horrible to contemplate in detail, and I would not wish to impose on non-library people the horrors of their details. Suffice to say that fields are numbered rather than named (numbered from 001 to 999 inclusive), that fields have subfields named with any single character (which may or may not be ASCII, and if it is outside the ASCII range may or may not be encoded as UTF-8), and that — to give just one popular example — the field that you and I might call “title” is instead “field 245, subfield a”. In most MARC variants, anyway. So long as one or both of the two available “indicators” doesn’t change the semantics.
So MARC is justly reviled — increasingly, even within the libraries themselves, and certainly outside them. In these enlightened days, we all want to express bibliographic data in a simple format that is both human-readable and machine parseable, and that uses actual, you know, field names to name the fields. In short, something in XML or YAML or similar. And after all, how hard can it be?
Representing a journal article
Take, for example, this perfectly simple reference to the paper describing and naming the awesome dinosaur Xenoposeidon:
Taylor, Michael P. and Darren Naish. 2007. An unusual new neosauropod dinosaur from the Lower Cretaceous Hastings Beds Group of East Sussex, England. Palaeontology 50 (6): 1547-1564. doi:10.1111/j.1475-4983.2007.00728.x
I won’t trouble you with how hideous this looks in the ubiquitous ISO 2709 encoding of MARC, because now that we live in the glorious 21st century we can represent it in XML. You’d surely just do something like this:
I mean, it’s not hard, is it? Maybe shove in a namespace or something and you’re done.
It’s easy, right?
(At this point, those of you know what’s coming have my permission to break down and weep openly. We’ll wait.)
MARCXML — why do you taunt me, cruel fate?
The library community’s initial brilliant solution to the problem of expressing bibliographic data in XML was to take the whole crumbling MARC edifice and schlump it down in the middle of the XML swamp and declare the job done. The result was MARCXML, a format that combines the simplicity and elegance of classic MARC with the concise syntax of XML to yield fragments such as the following, which expresses the title — just the title — of the record above:
Part of me is impressed, and … Let’s just leave it at that.
So no-one loves MARCXML. To call it the worst of both worlds seems insufficiently harsh — it seems like you ought to need to take the worst from a lot more worlds than two to come up with something quite so pessimal as MARCXML. On top of everything else, it’s horrifically slow to parse — so much so that when we switched our federated search system across to using a customised alternative (which we called TurboMARC), we found that our XML parsing time dropped by a factor of four to five.
So there has to be a better way to represent data about documents — something not mired in 1960s card-cataloguing practices.
And there is.
You’re not going to like it.
Tune in next time for whole new vistas of vileness.