9 September 2010
I’ve been debating some issues on metadata, authorship and publishing on twitter today and it seems the right thing to do is to clarify my thoughts by noting them down somewhere.
First of all, this is all mostly based on my experience, which as everybody knows is an extremely unreliable way of learning, full of magical thinking, superstitions and misunderstandings. Also, these are half-random thoughts, with no particular order. Most of it is likely to make little to no sense.
With that caveat…
The first issue is that there’s metadata and then there’s metadata. On the one hand you have metadata on the work as a whole, and on the other you have in-content metadata, which is a form of structural markup. Those two groups of metadata require two different approaches in terms of authorship.
The second issue is that any metadata left to the publisher is likely to be outsourced to an underpaid data-centre in India, or left to an unpaid intern, or overworked librarian/information architect and be full of errors.
The third issue is that any metadata the publisher leaves to others (like libraries or distributors) is likely to be outsourced to an underpaid data-centre in India, or left to an unpaid intern, or overworked librarian/information architect and be full of errors.
All joking aside, my argument is that while metadata on the work level can be authored by anybody integral in the production workflow (i.e. anybody on the publisher level), in-content metadata has to be done by the author.
In-content, structural markup metadata is a way of exposing the author’s understanding of the text via commonly recognised conventions. Marking up the references or which piece of dialogue belongs to who is no different from the markup that indicates where a chapter begins or ends, or which part is a header, or which part is a blockquote.
One hypothetical example would be marking up the text of a novel so that every piece of dialogue has a known speaker who is identified via some sort of markup convention.1 A ‘specialist’ in metadata who is given the task of marking this up (AKA The Intern) would have no understanding of when the author wants the identity of the speaker to be ambiguous.
So, to summarise: Metadata is a way of exposing the author’s understanding of the text via conventions and if the author isn’t in charge of metadata, every instance of ambiguity will be replaced by the publisher’s, editor’s or intern’s interpretation. IMHO, obviously.
Which brings me to the issue of metadata formats. In-content metadata formats can go two ways. One way is for them to be geared towards easy integration in text-based markup formats like Markdown or RestructuredText2, easy hand-coding, or easy integration in the thoroughly dumb template systems that dominate the web.
The second way is to integrate them in GUI software applications, word processors, outliners, etc..
The first way needs simple formats that rely on markup conventions. Microformats are a pretty exact match to the problem.
The second way needs a highly structured format that is transformable between formats, easily extracted and comes with widely deployed, well tested tool-chains. RDFa is a pretty exact match to that problem.
This could be used later on in various ways. Search that is scoped to just search dialogue or a specific character’s dialogue. Aggregating the dialogue of one character over a series. Cues for reading software to change gender or tone. Or simply to allow the reader to find out who the speaker is for sure. ↩
RestructuredText with its Roles and Interpreted Text is a much more capable and extensible markup language than many give it credit for. ↩