[This is part one of a three-part series. When you’re done here, read on to part 2 and part 3.]
My job is the subfield of programming that relates to searching, retrieval and metadata, especially as it relates to libraries. That means that what I deal with is mostly bibliographic metadata — sets of fields that describe book or journal articles. For example, the federated search system that we provide, while not in any way limited to searching for and presenting results of this kind, has tended to be used primary in the library domain, so I spend a lot of my time dealing with bibliographic data.
It’s a jungle out there. The dominant electronic format for bibliographic information is, still, by far, the ancient and faintly comical MARC (MAchine Readable Catalog) format, or rather, the MARC family of similar but subtly incompatible formats. MARC originated in the 1960s at the Library of Congress, literally as a way to encode the information on physical catalogue cards.
The MARC formats are too horrible to contemplate in detail, and I would not wish to impose on non-library people the horrors of their details. Suffice to say that fields are numbered rather than named (numbered from 001 to 999 inclusive), that fields have subfields named with any single character (which may or may not be ASCII, and if it is outside the ASCII range may or may not be encoded as UTF-8), and that — to give just one popular example — the field that you and I might call “title” is instead “field 245, subfield a”. In most MARC variants, anyway. So long as one or both of the two available “indicators” doesn’t change the semantics.
So MARC is justly reviled — increasingly, even within the libraries themselves, and certainly outside them. In these enlightened days, we all want to express bibliographic data in a simple format that is both human-readable and machine parseable, and that uses actual, you know, field names to name the fields. In short, something in XML or YAML or similar. And after all, how hard can it be?
Representing a journal article
Take, for example, this perfectly simple reference to the paper describing and naming the awesome dinosaur Xenoposeidon:
Taylor, Michael P. and Darren Naish. 2007. An unusual new neosauropod dinosaur from the Lower Cretaceous Hastings Beds Group of East Sussex, England. Palaeontology 50 (6): 1547-1564. doi:10.1111/j.1475-4983.2007.00728.x
I won’t trouble you with how hideous this looks in the ubiquitous ISO 2709 encoding of MARC, because now that we live in the glorious 21st century we can represent it in XML. You’d surely just do something like this:
I mean, it’s not hard, is it? Maybe shove in a namespace or something and you’re done.
It’s easy, right?
(At this point, those of you know what’s coming have my permission to break down and weep openly. We’ll wait.)
MARCXML — why do you taunt me, cruel fate?
The library community’s initial brilliant solution to the problem of expressing bibliographic data in XML was to take the whole crumbling MARC edifice and schlump it down in the middle of the XML swamp and declare the job done. The result was MARCXML, a format that combines the simplicity and elegance of classic MARC with the concise syntax of XML to yield fragments such as the following, which expresses the title — just the title — of the record above:
Part of me is impressed, and … Let’s just leave it at that.
So no-one loves MARCXML. To call it the worst of both worlds seems insufficiently harsh — it seems like you ought to need to take the worst from a lot more worlds than two to come up with something quite so pessimal as MARCXML. On top of everything else, it’s horrifically slow to parse — so much so that when we switched our federated search system across to using a customised alternative (which we called TurboMARC), we found that our XML parsing time dropped by a factor of four to five.
So there has to be a better way to represent data about documents — something not mired in 1960s card-cataloguing practices.
And there is.
You’re not going to like it.
Tune in next time for whole new vistas of vileness.
That one is pretty good. Also, appreciate the new vocab — “pessimal”. (I would have normally used the understated “sub-optimal” and a roll of the eyes.)
“Pessimal” has to remain a notional thing, like “infinity”. A design may be optimal, impossible to improve upon, but no matter how bad it is, you can always find a way to make it worse, and (this bit is deep) you will always be able to find someone who would argue in favor of doing so.
Reminds me of a recent client that turned my simple JSON response data:
{
“Error” : 4,
“ErrorText” : “Failed Validation”,
“ValidateError”: {
“Field 1” : false,
“Field 2” : false
}
}
Into:
{
“Return”:
{
“Type” : “device.list API 2.0”,
“ResponseSummary” :
{
“StatusCode” : 4,
“ErrorMessage” : “Failed Validation”
},
“Results” :
{
“ValidateError”: {
“Field 1” : false,
“Field 2” : false
}
}
}
}
I don’t think I even have to explain the silliness here. Whats worse is that he got this format by misunderstanding the return data from the Bing maps API and tried to emulate it, getting it wrong, and furthermore insisting it had to be this way for Javascript to be able to parse it.
I’ve seen that MARCXML approach elsewhere, with a third party we deal with – ‘we have a new web-serviced based XML interface’. Us – great, we can ditch a whole load of costly to maintain low-level comms code, and the ISDN line we keep around just to talk to you – send us the spec . . . . . hmm, this looks exactly as if you’ve taken your old interface and put a tag around every field.
My favourite bit was that in the old interface, variable length items would be represented by a count field at the start, that would tell you how many repetitions of the fixed size block would immediately follow the count field.
In the XML version, rather than represent this using nested tags, of course they implemented it as :
Never underestimate an institution’s desire to take a wonderful modern information tool and re-implement their old, terrible information system with it. I work in government IT, and it’s practically the law that you have to at least consider, when spending time and money on a new system, recreating the old one.
Wow, that’s bad XML there. However, I have this nagging feeling that you could find enough bad XML like MARCXML out there to write a whole book.
Then you could catalog THAT book with MARCXML. :-)
Pingback: Bibliographic data, part 2: Dublin Core’s dirty little secret | The Reinvigorated Programmer
Gee, and they missed a chance to call it MARCy Markup Language!
Pingback: Bibliographic data, part 3: Has anyone, anywhere, ever read the whole of the RDA specification? | The Reinvigorated Programmer
Pingback: Guest post: Do we need cataloguers? | Voices for the Library
Pingback: Datos bibliográficos, parte 1: MARC y su vil progenie | BaDoc
Pingback: PSNC Digital Libraries Team » Dublin Core’s dirty little secret
Pingback: Semantic mapping is hard | The Reinvigorated Programmer
Pingback: 5 Things Thursday: Rare Books, Maps, Digital Collections | MOD LIBRARIAN
Pingback: Guest post: Do we need cataloguers? | Voices for the Library