Semantic mapping is hard

A while back, I wrote about the MARC format that is widely used in libraries to represent bibliographic data, and the much simpler Dublin Core set of 15 data elements (creator, title, date, etc.) that can also be used to describe documents — although, as it turns out, inadequately.

There is an official MARC to Dublin Core Crosswalk — i.e. a mapping from MARC elements to correponding Dublin Core elements — developed and maintained by the Library of Congress.  Today I learned, from a CODE4LIB mailing-list message, that the crosswalk does not map any MARC tag to the DC Creator element.  “Creator” is what the Dublin Core set calls the author; so if you have a MARC record describing The Lord of the Rings, and translate it to Dublin Core using LC’s official mapping, the resulting record will not tell you that J. R. R. Tolkien is the author.

How is this possible, you may ask?  In a followup CODE4LIB message, uber-librarian Karen Coyle speculates:

I don’t actually know why, but I can imagine a plausible answer: the MARC record does not distinguish between contributors and creators sufficiently well to separate out the x00 fields between them. Either everyone is a creator, or everyone is a contributor, or the main entry (100) is treated as a creator and everyone else a contributor. Whichever it is, some portion of the mapping will be wrong. What would be interesting would be a study that would show the relative ratios of right to wrong in the three (or more?) scenarios.

And this in a nutshell is the whole problem of semantic mapping.  It’s hard.  Semantics are slippery.  We may think we know what we mean by “author”, but when you try to codify it precisely enough for a computer to automatically translate the “authors” expressed in one format into another format, it turns out that the concept is complex.  So complex that the RDA specification — the current set of rules describing how bibliographic data should be recorded for computers — is 1640 pages long.  [Special bonus stupidity: since I wrote the article about RDA, the specification is no longer freely available, so the correct way of recording an author is now a secret unless you want to stump up $195.  True.]

Now here’s the thing: people tend to talk about mapping data back and forth between MARC and XML formats (such as XML-encoded Dublin Core) as though it’s a trivial thing.  “What the heck”, they say, “We have the MARC::Record library for reading the MARC, and the XML::Writer library for writing the XML, so How Hard Can It Be?”  All you need to do is write a little glue layer, right?  while (get a marc record) { write an XML record }Bam, done.

Wrong wrong wrong wrong wrong!  That’s dealing with the syntactic differences between MARC and DC-XML.  But that was never the problem.

The problem has always been to do with meaning.  It’s slippery stuff.

Douglas Hofstadter (best known for the classic book Gödel, Escher, Bach [amazon.com, amazon.co.uk]), wrote an entire volume, which I am keen to read, on the problems of translation — it’s called Le Ton Beau de Marot, and even the title itself is a multilingual pun.  My memory is playing tricks on me here, because although I’ve never read that, I know I’ve read something, somewhere, by Hofstadter about this particular pun — maybe it was mentioned in GEB?  He approaches the problem of translation using a  different example in his fascinating anthology Metamagical Themas [amazon.com, amazon.co.uk]), on pages 23-24:

I wonder what literalists […] would suggest as the proper translation of the title of the book All the President’s Men (a book about the downfall of President Nixon, a downfall that none of the people around him could prevent). Would they say that Tous le homes du Président fills the bill admirably? Back-translated rather literally, it means “All the men of the President”. It completely lacks the allusion — the reference by similarity of form — to the nursery rhyme “Humpty Dumpty”. Is that dispensible? In my opinion, hardly. To me, the essence of the title resides in that allusion. To lose that allusion is to deflate the title totally.

Of course, what do I mean by “that allusion”? Do I wish the French title to contain, somehow, an allusion to an English nursery rhyme? That would be rather pointless. Well, then, do I want the French title to allude to the French version of “Humpty Dumpty”? It all depends on how well known it is. But given that Humpty Dumpty is practically an unknown figure to French-speaking people, it seems that something else is wanted. Any old French nursery rhyme? Obviously not. The critical allusion is to the lines:

All the King’s horses / And all the King’s men / Couldn’t put Humpty together again.

Are there — anywhere in French literature — lines with a similar import? If not, how about in French popular songs? In French proverbs? Fairy Tales?

This is tough stuff; and I want to suggest that distinguishing between MARC’s and Dublin Core’s concepts of authors and contributors is scarcely less so.  And of course, author/contributor is just the tip of the iceberg.  What can we usefully say about gathering the various relevant MARC fields into the Dublin Core field “extent”?  How about “relation”?  How about “resource type”?  Ugh.  These are deep waters.

And this is why library technologists are completely missing the point when they argue — as they so often do — that all their problems would be solved if only they could get rid of all those antiquated MARC records and replace them with nice, shiny XML.  The MARC format was never the issue.  The issue is semantics, and no amount of XML tools is going to solve that for you.

Cataloguing is hard.  If you still doubt it, check out yet another recent CODE4LIB message, this one by cataloguer Kelly McGrath on the subtleties of MARC’s notion of “geographical subdivision” and on subject headings where not all components are established.  I won’t even try to summarise it here, it’s more than 2000 words long.

There is a great quote somewhere by (I am pretty sure) Joel Spolsky, which I wanted to use to close this article.  Unfortunately, I can’t find it.  It was something on the following lines: People have a tendency to abstract a problem until it’s very simple, then solve the very simple part, then sit around patting each other on the back, all the while leaving the hard part unsolved, and indeed unaddressed.  (The uncritical adoption of XML is often a sign of this syndrome.)

If anyone knows the quote I am thinking of, please mention it in the comments!

About these ads

7 responses to “Semantic mapping is hard

  1. Great article. (Of course, this relates back to AI as well, talking about underestimating difficulties of problems.)

    I don’t know if this is exactly the quote you’re thinking of, but Joel did write an article about Leaky Abstractions:

    http://www.joelonsoftware.com/articles/LeakyAbstractions.html

    including this law:

    All non-trivial abstractions, to some degree, are leaky.

  2. “Le Ton Beau de Marot” could be a pun in French, but I didn’t know it was multilingual. It literally means “the fine sound of Marot”, but is phonetically identical to “Le Tombeau de Marot”, which means “Marot’s tomb” – as in “Le Tombeau de Couperin”, by Maurice Ravel.

    Ravel also set the French Fairy Stories of Mother Goose to music (in Ma Mere L’Oye) and geese feature prominently in both English and French nursery rhymes. This in turn has given rise to various spurious publications in the past tracing origin of the Mother Goose stories, some very much tongue-in-cheek, but among which can be found such multilingual delights as : “Goussets, goussets, Gandhara, oui châles à oindre . . “

  3. Not wishing to hijack this thread, but if you like this sort of stuff (and I do) check out Kevin Rice’s “Anguish Languish” site: http://www.justanyone.com/allanguish.html

  4. So today I learned (i’m a newb) that modernization of important data sets is nearly impossible and that all abstractions are leaky. If this is the eventual destination for coders farther along, the journey seems like its gonna be rough.

  5. Adam,

    You have taken the first step into a larger world. But let us say it’s tough rather than rough. That needn’t be discouraging — after all, what’s worth doing that isn’t tough? Writing a program, learning to play an instrument, building a marriage, getting fit, losing weight, writing a blog, raising children … Life is challenges. Accepting that makes them easier to meet.

  6. Clearly this is difficult, but “the relative ratios of right to wrong” might be a way to make useful, if still imperfect, progress.
    I recently enountered Intuition & Data Driven Machine Learning on igvita.com, which says remarkably simple learning models with really large data sets are the basis of Google Translate. That’s not the same as a professional translator, and this wouldn’t yield the quality of a professional librarian, but it might help.

    There is the problem of the lifetime of digital formats to consider. I’d guess that XML data is much more common than MARC data, and so I’d suppose that software to use it will more likely be ported to newer systems as a result. As if to prove it’s point about this sort of thing, The Dead Media Project seems to be off-line, and was last time I checked as well.

  7. Your note on translation reminded me of the Japanese translator in David Lodge’s hilarious Small World, an Arthurian saga set in the world of the MLA. If I remember correctly, Alls Well That Ends Well turned into The Oar Accustomed to the Water in Japanese. The translator was primarily translating the works of a 1950s/1960s English “angry young man” author and always asking questions about things that would be painfully obvious to anyone living in England, but obscure in Japan. (e.g. What is jam-butty?)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s