Bibliographic data, part 3: Has anyone, anywhere, ever read the whole of the RDA specification?

[This article concludes what’s turned out to be a three-part series.  You may wish to read part 1 and part 2 before this one.]

I only meant to write two articles on the difficulty of representing a journal article reference in a standard XML format.  But an epilogue is warranted because, well, surely there has to be a standard way to do this.

Well, let’s step back a bit from the detail of XML representation.  Let’s just look at cataloguing rules.

For the last forty years years, librarians have sworn by (and in some cases at) the Anglo-American Cataloguing Rules, Second Edition, known universally as AACR2.  This is a set of specifications that describe exactly how to take down the title, publication year and other relevant details of books, articles and other documents, and forms the semantic basis on which actual machine-readable formats like MARC are based.

Of course things have changed in the last 40 years, and AACR2 is seen as rather archaic.  Accordingly, in 1997(!) it was decided that instead of revising these rules to a Third Edition, a more radical reworking would be needed for the 21st century, and so what would have become AACR3 was instead called Resource Description and Access, or RDA for short.

Well, the final draft of the RDA specification was made freely available in November 2008 for constituency review, so we can just look at that and see how it says to handle journal title, volume number, start-page and so on.

Except …

The download is a 35 Mb zip archive, which when unpacked proves to contain 54 PDFs: 52 that, together, make up the actual specification, and two slightly different version of the table of contents.  If I tell you that (the more recent version of) the table of contents alone weighs in at 113 pages, you might get some idea of the size of the whole thing.

As best I can tell, the current total is about 1640 pages — so if you printed it out on standard 80 gsm paper, it would be 17 cm thick (nearly 7 inches) and weigh something over 8 kg (18 pounds).  But that’s with loads of stuff left out: chapters 12-15, 23 and 33-37 inclusive are all represented by single-page placeholders saying “To be developed after the first release of RDA in 2009″, so I think we can expect that 1640 pages to grow quite a bit.

For comparison, the complete and unexpurgated edition of Les Miserables that I bought a couple of weeks ago comes in at 1376 pages; the 2006 paperback edition of War and Peace is 1475 pages; and the single-volume Lord of the Rings 50th anniversary edition (including The Fellowship of the Ring, The Two Towers, The Return of the King and the extremely voluminous appendices) is a relative lightweight at 1137 pages.

So the question is: does the 1620-page RDA specification tell me anything about how to encode, say, a journal title?

And the answer is: how the heck would I know?  My only approach to figuring out an answer would be to search through 52 PDFs, which is not my idea of a good time.

How difficult can it be?  I mean, really?

We all know what the references in bibliographies look like: you have author names, a publication date, a title, maybe a DOI; then for an article in a journal you have a journal title, volume number and maybe issue number, and start and end pages; for a chapter in an edited volume, you have editor names and the volume name; and both books and chapters have a publisher name and place, and maybe a page-count.

That’s it.  Fourteen fields.  One fewer than the Dublin Core.  You could write a specification for representing these things  in XML in ONE PAGE.  One.  Not 1640.  One.

I think I speak for all right-thinking people when I say: *headdesk*.

About these ads

15 responses to “Bibliographic data, part 3: Has anyone, anywhere, ever read the whole of the RDA specification?

  1. Pingback: Bibliographic data, part 2: Dublin Core’s dirty little secret | The Reinvigorated Programmer

  2. But will the XML spec last? Can it be used for archiving, referencing, cross-referencing, and will it still be usable in 40+ years?

    Can it be universally adopted by libraries and archives, and be 100% clear in all respects?

    Will it be forward-compatible, will it be backward-compatible (very important, considering the body of works that we have accumulated in the last few thousand years)?

    Will it be universal to all types of media and publication formats possible?

  3. Philip, the problem isn’t “will the XML spec last?” The problem is that if the alternative is a specification that is 1600 pages long, no implementation is going to get it exactly right. So it doesn’t matter how clear it is, how forward compatible, or how universal — the larger the spec, the more likely it is that it is implemented incorrectly.

  4. Have you looked at MODS XML or Zotero RDF? They both seem quite sensible and useful compared to the horrors you’ve described. TEI XML also has elements for describing bibliographic data as well as the structure of the text itself, but is probably overkill if you only want bibliographic data.

  5. Gavin, yes I know about MODS and I think it’s the most promising of the alternatives out there: I do actually plan to write a followup on that subject, though from my experience looking into this a while back it still doesn’t make this trivial task as trivial as it ought to be.

    I didn’t know about the Zotero format, though. I tried to take a look at that, but … Do you have a reference for Zotero’s format? From my initial grepping around, it doesn’t look promising, indeed the “dev/data model” page says “Up to now there is no full reference of item types, fields and user data—just try to find out using the user interface or dig into SQL!”

  6. Anything that requires less then 1500 pages and/or 50 PDFs and and doesn’t come with two separate tables of contents isn’t worth saying, much less writing.

    Somebody wise once said that the documentation for a format as a product of the organization which produced it, is a mirror of aforementioned organization. Or something along those lines anyway. Case in point, Librarians seem to organize in byzantine structures that have the serious ambition of rivaling even the most convoluted bureaucracy hitherto known to man.

  7. Pingback: Bibliographic data, part 1: MARC and its vile progeny | The Reinvigorated Programmer

  8. @Florian, I believe you talking about Conway’s Law: “…organizations which design systems … are constrained to produce designs which are copies of the communication structures of these organizations.”

  9. Thank you for a brilliant series of posts. I’ll second the recommendation for MODS. Although it betrays some of its MARC heritage, it at least lets you encode all the parts of a citation. You won’t be able to encode an article in just 14 fields, but the bigger vocabulary gives it more expressive power. I wouldn’t want to use a metadata format that only let me describe articles, even if it was exquisitely tailored to do so.

  10. I would hand you a nice shiny OPAC which you have won, but you wouldn’t want it, so here are a few internets instead. This is brilliant.

    Phillip: Chances are XML will be a lot more usable in 40 years than MARC. If I were a betting woman, that’s definitely where I’d place my bet.

  11. Pingback: Datos bibliográficos, parte 1: MARC y su vil progenie | BaDoc

  12. Pingback: Tweets that mention Bibliographic data, part 3: Has anyone, anywhere, ever read the whole of the RDA specification? | The Reinvigorated Programmer -- Topsy.com

  13. If you’re a data guy, you really don’t need to mess with the RDA text. Instead, take a look at http://metadataregistry.org/rdabrowse.htm. These are the RDA Vocabularies, done by the DCMI/RDA Task Group, and lurching towards official status, probably later this year or early in 2011. For a description of what’s been done to create these, see: http://dlib.org/dlib/january10/hillmann/01hillmann.html

  14. Pingback: Semantic mapping is hard | The Reinvigorated Programmer

  15. The RDA will become a mandatory international standard, required for all scholarly projects. In a coincidental event slightly later, no reference will ever be checked again by humans, and a new dark age of ignorance and bad scholarship will follow, until civilization collapses excepting only the Amish! :-)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s