I only meant to write two articles on the difficulty of representing a journal article reference in a standard XML format. But an epilogue is warranted because, well, surely there has to be a standard way to do this.
Well, let’s step back a bit from the detail of XML representation. Let’s just look at cataloguing rules.
For the last forty years years, librarians have sworn by (and in some cases at) the Anglo-American Cataloguing Rules, Second Edition, known universally as AACR2. This is a set of specifications that describe exactly how to take down the title, publication year and other relevant details of books, articles and other documents, and forms the semantic basis on which actual machine-readable formats like MARC are based.
Of course things have changed in the last 40 years, and AACR2 is seen as rather archaic. Accordingly, in 1997(!) it was decided that instead of revising these rules to a Third Edition, a more radical reworking would be needed for the 21st century, and so what would have become AACR3 was instead called Resource Description and Access, or RDA for short.
Well, the final draft of the RDA specification was made freely available in November 2008 for constituency review, so we can just look at that and see how it says to handle journal title, volume number, start-page and so on.
The download is a 35 Mb zip archive, which when unpacked proves to contain 54 PDFs: 52 that, together, make up the actual specification, and two slightly different version of the table of contents. If I tell you that (the more recent version of) the table of contents alone weighs in at 113 pages, you might get some idea of the size of the whole thing.
As best I can tell, the current total is about 1640 pages — so if you printed it out on standard 80 gsm paper, it would be 17 cm thick (nearly 7 inches) and weigh something over 8 kg (18 pounds). But that’s with loads of stuff left out: chapters 12-15, 23 and 33-37 inclusive are all represented by single-page placeholders saying “To be developed after the first release of RDA in 2009″, so I think we can expect that 1640 pages to grow quite a bit.
For comparison, the complete and unexpurgated edition of Les Miserables that I bought a couple of weeks ago comes in at 1376 pages; the 2006 paperback edition of War and Peace is 1475 pages; and the single-volume Lord of the Rings 50th anniversary edition (including The Fellowship of the Ring, The Two Towers, The Return of the King and the extremely voluminous appendices) is a relative lightweight at 1137 pages.
So the question is: does the 1620-page RDA specification tell me anything about how to encode, say, a journal title?
And the answer is: how the heck would I know? My only approach to figuring out an answer would be to search through 52 PDFs, which is not my idea of a good time.
How difficult can it be? I mean, really?
We all know what the references in bibliographies look like: you have author names, a publication date, a title, maybe a DOI; then for an article in a journal you have a journal title, volume number and maybe issue number, and start and end pages; for a chapter in an edited volume, you have editor names and the volume name; and both books and chapters have a publisher name and place, and maybe a page-count.
That’s it. Fourteen fields. One fewer than the Dublin Core. You could write a specification for representing these things in XML in ONE PAGE. One. Not 1640. One.
I think I speak for all right-thinking people when I say: *headdesk*.