Category Archives: Formats

All the cool kids are using JSON instead of XML

My colleague Kurt Nordstrom mentioned a few days ago that there was this period of time when XML-based everything was the future. It was going to solve all our problems. Let’s use XML for everything!

Now, of course, we’ve all seen past the crazily naive idea that anything as mundane as the XML meta-format could make any real difference to anything, since it’s just a solution to the easy part of every task (syntax) and leaves the hard part to be done (semantics). No, we’re much more sophisticated than that now. Now we realise that JSON is the metaformat that will make everything suddenly work.

Continue reading

When good standards go bad

Years ago, I was on the editorial board for versions 1.1 and 1.2 an informal standard called SRU. It defined a way to do IR queries over HTTP with XML payloads: you’d send a URL like, and it would send back an XML document describing the search result — hit count, that kind of thing — and containing payload records.


Since the payload records themselves were also in XML, it was often convenient to just embed them right in the response, where they could be extracted by XSLT or similar. Continue reading

Shift-Ctrl-V: paste without formatting

This is the single most useful keyboard shortcut I’ve ever found, but it seems no-one knows about it; at least, if they do, they didn’t tell me at any point in the first 33 years of my computer career.

If you copy (Ctrl-C) from a web-page, or a Word/OpenOffice document, or any other source that has formatting, then when you paste the copied material into a document that supports formatting (such as another Word/OpenOffice document or a WordPress post), the formatting — or at least, a broken attempt at it — will be pasted in. That usually means that you get fonts you didn’t want, that are inconsistent with the rest of the document. I imagine this is the cause of most of the horrible Font Soup you see in too many MS-Word documents.

This is a proper pain if, to pick a purely hypothetical example, you’re putting together a book based on your own blog-posts.

So instead, use paste-without-formatting. In OpenOffice (and its derivatives: NeoOffice, LibreOffice, etc.) this is on the Edit menu as Paste Special…, but you can use the shortcut Shift-Ctrl-V (or Apple-Ctrl-V on a Mac) to invoke it. It pops up a small menu of paste styles: double-click on Unformatted Text and you’re done.

It turns out that although there is no corresponding menu-item in the WordPress composer, the keyboard shortcut works — in fact, it works even better than it does in OpenOffice, as it doesn’t bother with the menu and just does the paste as unformatted text.

Shift-Ctrl-V: it’s your friend.

Misunderstanding natural monopoly

Seth Godin’s blog is a great source of pithy, wise, generous insights. I read pretty much every entry, and often find myself going “Huh! I’d never thought of it that way”. But as usual. I’m only going to blog about him when I disagree.


Continue reading

Semantic mapping is hard

A while back, I wrote about the MARC format that is widely used in libraries to represent bibliographic data, and the much simpler Dublin Core set of 15 data elements (creator, title, date, etc.) that can also be used to describe documents — although, as it turns out, inadequately.

There is an official MARC to Dublin Core Crosswalk — i.e. a mapping from MARC elements to correponding Dublin Core elements — developed and maintained by the Library of Congress.  Today I learned, from a CODE4LIB mailing-list message, that the crosswalk does not map any MARC tag to the DC Creator element.  “Creator” is what the Dublin Core set calls the author; so if you have a MARC record describing The Lord of the Rings, and translate it to Dublin Core using LC’s official mapping, the resulting record will not tell you that J. R. R. Tolkien is the author.

Continue reading

Bibliographic data, part 3: Has anyone, anywhere, ever read the whole of the RDA specification?

[This article concludes what’s turned out to be a three-part series.  You may wish to read part 1 and part 2 before this one.]

I only meant to write two articles on the difficulty of representing a journal article reference in a standard XML format.  But an epilogue is warranted because, well, surely there has to be a standard way to do this.

Well, let’s step back a bit from the detail of XML representation.  Let’s just look at cataloguing rules.

Continue reading

Bibliographic data, part 2: Dublin Core’s dirty little secret

[This is part two in a series — you should read part 1 first for context and then you might go on to part 3.]

The Dublin Core — metadata made dumb

Just when librarians were in despair of ever getting their data out to the world in a form it could understand, along came the Dublin Core (DC for short) — a simple set of fifteen metadata elements (contributor, coverage, creator, date, description, format, identifier, language, publisher, relation, rights, source, subject, title, and type) that could be used to describe “document-like objects” such as books, journal articles and web pages.

Everyone in the library world got really excited about the Dublin Core for about three weeks in 1999, before realising that you can’t actually do anything with those elements beyond expressing author (called “creator“), title and date. Everything else was too vague to be of any use — coverage, anyone? Relation? Format?

Continue reading

Bibliographic data, part 1: MARC and its vile progeny

[This is part one of a three-part series.  When you’re done here, read on to part 2 and part 3.]

My job is the subfield of programming that relates to searching, retrieval and metadata, especially as it relates to libraries. That means that what I deal with is mostly bibliographic metadata — sets of fields that describe book or journal articles. For example, the federated search system that we provide, while not in any way limited to searching for and presenting results of this kind, has tended to be used primary in the library domain, so I spend a lot of my time dealing with bibliographic data.

It’s a jungle out there. The dominant electronic format for bibliographic information is, still, by far, the ancient and faintly comical MARC (MAchine Readable Catalog) format, or rather, the MARC family of similar but subtly incompatible formats. MARC originated in the 1960s at the Library of Congress, literally as a way to encode the information on physical catalogue cards.

Continue reading

The second stupidest thing in the world

If DVD region locking is the third stupidest thing in the world, then the second stupidest thing is …

The wildly differing submission formats of academic journals.

Continue reading