Bibliographic data, part 2: Dublin Core’s dirty little secret

[This is part two in a series — you should read part 1 first for context and then you might go on to part 3.]

The Dublin Core — metadata made dumb

Just when librarians were in despair of ever getting their data out to the world in a form it could understand, along came the Dublin Core (DC for short) — a simple set of fifteen metadata elements (contributor, coverage, creator, date, description, format, identifier, language, publisher, relation, rights, source, subject, title, and type) that could be used to describe “document-like objects” such as books, journal articles and web pages.

Everyone in the library world got really excited about the Dublin Core for about three weeks in 1999, before realising that you can’t actually do anything with those elements beyond expressing author (called “creator“), title and date. Everything else was too vague to be of any use — coverage, anyone? Relation? Format?

If you don’t believe me, try translating a reference to a journal article into DC — for example, this one that we used in the previous article:

Taylor, Michael P. and Darren Naish. 2007. An unusual new neosauropod dinosaur from the Lower Cretaceous Hastings Beds Group of East Sussex, England. Palaeontology 50 (6): 1547-1564. doi:10.1111/j.1475-4983.2007.00728.x

I can easily see how to map the author, date and articleTitle, but not the journalTitle, volume, issue, startPage, endPage or DOI.  So one third of the elements are representable.

The Dublin Core people quickly realised that while the fifteen core elements are OK for describing web-pages (which is frankly what they were designed for, despite all the “cross-domain” rhetoric), they were not much use for describing, well, anything else. Not beyond the absolute basics, anyway.

[Oh, and by the way: there was, and is, no standard XML format for Dublin Core, merely guidelines for how to roll your own. Just in case you were wondering. There are standard element names to use (e.g. <dc:title>) but no standard wrapper element to represent the record as a whole.]

Qualified Dublin Core — metadata made slightly less dumb

The solution to the paucity of Dublin Core elements was this thing called “qualified Dublin Core” (although that term doesn’t seem to be used much any more), in which the fifteen core elements are qualified to make them more specific — for example, dateAccepted, dateAvailable and dateCopyrighted are refinements of the core element date. According to the Dublin Core’s own dumb down principle, “a client should be able to ignore any qualifier and use the value as if it were unqualified […] Qualification is therefore supposed only to refine, not extend the semantic scope of an Element.” Sounds good, right?

Except:

  • There is still no canonical XML representation for Dublin Core records, only canonical XML element names for Dublin Core elements.
  • The XML representation of dateAccepted is not, as you might expect, <dc:date type=”accepted”> but <dcterms:acceptedDate>, which means you can’t implement the dumb down principle just by discarding qualifiers, you need to encode specific knowledge of how to map “qualified” to core DC elements in your application. In other words, “qualified Dublin Core” is not qualified at all.
  • The dcterms namespace has its own instances of the fifteen core elements, so when you want to add a contributor, you have to choose (how?) between <dc:contributor> and <dcterms:contributor>.

All of this, inexplicable though it may appear, would perhaps be tolerable. Were it not for the core incompetence of the Dublin Core model. And here at last we come to the promised Dirty Little Secret …

Even qualified Dublin Core can’t describe a journal article

When I first heard this, I flatly refused to believe it. It seemed impossible that anyone could design a metadata element set for describing documents and have it not able to describe a journal article. But it is, amazingly, quite true. When I made my best effort to render the reference above into Qualified Dublin Core, I found that I was able to represent only one additional field (the DOI, and that not very well) beyond the three basic elements (authors, date, title) that basic Dublin Core allowed me.

Critics, with the exception of Oscar Wilde, seem mostly to agree that the death of Little Nell (in Dickens’s Old Curiosity Shop) is one of the saddest passages in literature. Personally, I lean more towards the separation of Rose and Doctor at the end of Doomsday (you know, before the They Can Never See Each Other Again Because The Path Between Universes Has Closed Forever thing got downgraded to She Can’t Appear Again Until The Fourth Season Due To Other Work Commitments). Others may cite the ending of Old Yeller or the departure of the ring-bearers to the Undying Lands after the scouring of the Shire. But for me, the most tragic document ever written is Guidelines for Encoding Bibliographic Citation Information in Dublin Core Metadata: four and half thousand words of desperate flailing that could have been summarised as “Don’t even bother trying, it just doesn’t work”.

Turns out that the Qualified Dublin Core solution to the problem of citing journal articles was to add — get ready for this — a bibliographicCitation element. Oh, joy! And so the introduction of the Guidelines document concludes with the observation:

Before the introduction of the Dublin Core term ‘bibliographicCitation’ it was not obvious how to describe fully a journal article using Dublin Core metadata. There was no suitable Dublin Core property to capture the journal title, as distinct from the article title, or the volume, issue and page details, other than as part of a general description.

Thank heavens that‘s changed! Now, instead of shoving the journal title, volume, issue and page details into an undifferentiated lump of text in the description field, we can shove the journal title, volume, issue and page details into an undifferentiated lump of text in the bibliographicCitation field!

This, let me remind you, in a specification that includes SEVENTY data elements — the original fifteen core elements, plus 55 added in Qualified DC, of which 15 are duplicates of the originals. And in those 70 elements they couldn’t make room for journal title? Seriously?

The official, sanctioned, allegedly interoperable encoding of my perfectly simple article citation into Dublin Core

Here it is, folks, based on the Guide. Read it and weep:

<mikesMadeUpNamespace:article
xmlns:mikesMadeUpNamespace=”whatever”>
<dc:creator>Michael P. Taylor</dc:creator>
<dc:creator>Darren Naish</dc:creator>
<dcterms:issued>2007</dcterms:issued>
<dc:title>An unusual new neosauropod dinosaur
from the Lower Cretaceous Hastings Beds Group
of East Sussex, England.</dc:title>
<dcterms:isPartOf>urn:ISSN:0081-0239</dcterms:isPartOf>
<dc:publisher>Blackwell</dc:publisher>
<dcterms:bibliographicCitation>
Palaeontology 50(6), 1547-1564. (2007)
</dcterms:bibliographicCitation>
<dc:identifier>info:doi:10.1111/j.1475-4983.2007.00728.x</dc:identifier>
</mikesMadeUpNamespace:article>

It makes me want to cry.

Note that:

  • There is still no standard XML format for Dublin Core records, so I had to make up my own wrapper element (which of course can’t be in either of the two DC namespaces).
  • For the actual elements, I am supposed to use a mixture of elements from dc and dcterms namespaces.
  • The element containing the publication date is not called publicationDate or datePublished, nor even issuedDate or dateIssued, but just issued — unlike, for example, dateSubmitted or dateAccepted.
  • The best I can do by way of trying to express the journal title is to use the dcterms:isPartOf element and give it the ISSN of the journal (wrapped up as a URI), in the hope that whoever uses this record will go and look that ISSN up to find out what journal it pertains to.
  • The publisher is considered an important part of the citation (unlike, say, the journal title, volume, issue or page-range) despite the fact that journal-article citations never include the publisher.
  • It’s considered important to state that the type of the referenced item is Text.
  • The type “Text” is drawn from a vocabulary whose URI is known (I got it from the Guide) but I couldn’t figure out what XML attribute I am supposed to use to point to that URI.

And of course all of this is on top of the utterly baffling brain-damage that is the bibliographicCitation element. And by the way, if the sample bibliographicCitation above doesn’t seem too dreadful to you, then consider this sample Big Undifferentiated Blob Of Text, straight from the Guide:

Proceedings of the International Conference on Dublin Core and metadata for e-communities, 2002; DC-2002: Metadata for e-Communities: Supporting Diversity and Convergence, Florence, Italy, 13-17 October 2002, pp 71-80

bibliographicCitation format: the pain, the glory, the other pain

But at least the client software can reliably parse the journal title, volume, issue, start-page and end-page out of the bibliographicCitation, right?  I mean, it must be in a standard format, right?

Right?

Viewers of a sensitive disposition might wish to look away now.

Here’s what section 2.2 of the Guide says:

Plain text citations may be according to a recognised citation style. Several styles were reviewed by the DCMI Citation Working Group, and are listed on a Citation Styles page, but there is no particular recommendation for choice of style.

And indeed the two sample bibliographicCitation examples above are in noticably different formats even allowing that one is for a journal article and the other for a paper in a proceedings volume — for example, the date is parenthesised in the former but not in the latter.

Oh, and from section 2.1:

Other details of the resource, such as its title and creators, will be described using the usual Dublin Core properties. Optionally, but redundantly, these details may be included in the citation as well.

In other words, any old crap can be shoved in the bibliographicCitation field.

So let’s review: the official way to represent journal title, volume number, issue number, start page and end page in the 70-element Qualified Dublin Core set is: jam them, and quite possibly some other data you happen to have lying around, together into a text blob in any format you happen to feel like.

Of course, for a computer reading the XML to make any use of this information, it will need to parse the bibliographicCitation to figure out what the journal title, etc., are. But since that field can contain any combination of elements in any format, any parser will need to try all sorts of heuristics to match the format and figure out which bits represent what data. Which of course is exactly what you’d have to do if all you had to work with was the plain-text citation that we started this article with, long, long ago.

To summarise: Qualified Dublin Core, with its 70 fields, is no more useful for expressing journal-article citations than plain text.

Oh, am I shouting? Sorry.

Appendix. Don’t even get me started on the use of the OpenURL 1.0 (ANSI/NISO Z39.88) ContextObject KEV format as an alternative for the content of the <bibliographicCitation> field

Having written that heading, I feel no need to expand further on it.

OK, I’m out of here. I need to take a shower.

Tune in next time for yet more pain.

Advertisement

23 responses to “Bibliographic data, part 2: Dublin Core’s dirty little secret

  1. Pingback: Bibliographic data, part 1: MARC and its vile progeny | The Reinvigorated Programmer

  2. One would think they could learn a thing or two from BibTeX …

  3. Pingback: Bibliographic data, part 3: Has anyone, anywhere, ever read the whole of the RDA specification? | The Reinvigorated Programmer

  4. You’re supposed to use RDF (1.0) to encode DC. That way, OWL gives you the subtype and equality information

  5. Could this be because your standard brick-and-mortar library catalogs physical items? I mean, it makes sense that a library can lend out its print copy of a journal issue, but how does a library lend out a journal article?

  6. Thanks, all, for some interesting comments.

    Chris, I don’t think bringing on RDF and OWL helps very much here, because the actual fieldnames just don’t exist. By saying that a journal article isPartOf a journal issues, you can then use the dc:title element to specify the journal name, but (A) that doesn’t help with the volume number, issues number, and page-range; and (B) it’s not really right anyway, since that specific issue doesn’t have a title, it’s an instance of a journal that has a title. If you have to say that the article isPartOf and issue and that the issue isInstanceOf a journal just so you can give the journal title, then things have gone very, very wrong — it’s information architecture astronautics gone wild.

    Khairul, you are almost certainly right that the poor support for journal articles in library standards goes back to brick-and-mortar days. But (A) that excuse would have been less unacceptable in 2000 than it is now, and (B) that doesn’t explain the lack of volume and issue fields, which you’d surely need for cataloguing hardcopy issues.

  7. Unless there exist journals that can have multiple issues in a single day, a date and title uniquely identifies a journal issue. For retrieval purposes, I suppose that’s all is needed.

  8. The date that is included in citations is almost always a year alone — not 12 March 1968, but just 1968. And the great majority of journals publish more than one issue a year, so “date” as given is certainly not an adequate substitute for volume and issue.

  9. To go from citation to journal issue, one could pull up all the issues for the year and count, but that does sound rather inefficient.

  10. Mike: Sure, I appreciate RDF/OWL don’t solve your main problem, just wanted to address some of your sub-problems!

  11. Pingback: Datos bibliográficos, parte 1: MARC y su vil progenie | BaDoc

  12. Douglas Campbell

    Looking at the background discussions, the DC Citation Guidelines were developed in response to the question: How do I quickly add the journal details into an article’s DC record? It seems no one has ever actually said to DC that they need to encode a full journal article citation – maybe you should?

    Since those guidelines came out, a couple of schemas have come out that play nicely with DC – Bibliographic Ontology http://bibliontology.com/ and PRISM http://www.prismstandard.org/ – they might do the trick?

  13. Pingback: PSNC Digital Libraries Team » Dublin Core’s dirty little secret

  14. Sorry… I’m rather late to this.

    It’s very funny… but I think you are confused. (Being confused about DC is fine by the way… most people are (including me)).

    DC has evolved to be used as an RDF vocabulary.

    It didn’t start out that way of course… because the original 15 ‘elements’ pre-dated RDF. It started out as a set of ‘attributes’ to be used in the HTML meta tag, flirted briefly with XML (not least in the form of the OAI-PMH) and finally emerged into the butterfly of an RDF vocabulary.

    I use the word ‘butterfly’ loosely.

    To complain that DC doesn’t work as an XML language is like complaining that concrete makes bad cakes. You’re right… but so what!?

    Part of the reason you are confused is because DCMI itself is confused. The long history of DC has left a lot of people with differing views about where DC sits in that HTML/XML/RDF spectrum. Indeed, many people ‘inside’ the DC camp consider that DC should function across the whole piste. The trouble is, in trying to do so DC becomes jack of all trades, master of none.

    My view (a view that is shared by a few others but that is also violently disagreed with by many others) is that DC now has to be viewed pretty much solely as an RDF vocabulary. The properties are now declared using RDFS for example. If viewed in that way… many of your complaints above disappear. Sure, DC doesn’t have all the properties necessary to capture a full citation. So what? It was never intended to. The whole point of RDF is that others can come up with such a set of properties and use them inter-mixed with DC (or on their own) as necessary.

    Small pieces loosely joined and all that.

    All IMHO of course.

    (BTW, I hate bibliographicCitation as well).

  15. I don’t know, Andy. This “it’s just for RDF” thing sounds like post-hoc pleading to me. There is literally nothing about RDF on the Dublin Core home page http://dublincore.org/ — if that really was its whole raison d’etre, wouldn’t you expect to see it at least mentioned?

    And let’s not forget that even if you leap into the RDF swamp and model stuff like journal title as the title of a separate “The Journal” object that is linked to the article using isPartOf, that still doesn’t get you anywhere near everything you need. Even a fully RDF-bedrangled bibliographic reference would be missing such core information as the volume, issue and page-range.

    All of this comes from the horrible tendency of library scientists such as your good self and, well, me, to want to model everything. This always — always results in confusion, yet we never seem to learn: OpenURL 1.0, RDA, FRBR, Dublin Core/RDF scholarly references … All that infrastructure, all that learning curve, and still we can’t do the trivial thing that the RIS format has been happily doing since the dawn of time:

    TY  - JOUR
    AU  - Taylor, Michael P.
    AU  - Naish, Darren
    PY  - 2007
    TI  - An unusual new neosauropod dinosaur from the Lower Cretaceous Hastings Beds Group of East Sussex, England
    JO  - Palaeontology
    VL  - 50
    IS  - 6
    SP  - 1547
    EP  - 1564
    ID  - doi:10.1111/j.1475-4983.2007.00728.x
    ER  -
    

    There — was that really so hard? I’m not saying this is a good format, but it does at least allow you to Say What You Mean, Simply And Directly.

  16. Pingback: Semantic mapping is hard | The Reinvigorated Programmer

  17. Pingback: links for 2010-10-29 « sonofbluerobot

  18. Pingback: spitting my tea and thinking of Grant Campbell. | librarian @large

  19. Best written, most depressing article I’ve read about libraries and metadata, sigh!

  20. Thanks for those, I guess, kind words :-)

  21. XML is a great concept and integrations are a doable rational sell.

    It just barely holds up afterwards. Choices are involved, non-obvious and meaning appears at runtime, not notation time. Preparing information for simple selection, by agreeing on field types and names, ehm, well, – you have to know why you want to select the field.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.