Thursday, July 07, 2005

Metadata

Why does metadata frustrate me?

I've looked at dozens of metadata formats over time, most associated with online learning. And in the field of online learning, at least, the metadata developed simply does not address the needs of a data environment.

Let me explain what I mean. In database management there is a concept known as normalization. Expressed most simply, the idea is that in a fully normalized database, no piece if information is ever stored in more than one location.

In database design, there are degrees of normalization. The reason for this is that normalization of a database involves a trade-off with speed and clarity. For example, in a fully normalized database, the string 'Alberta' would only be stored once, and all instances of 'Alberta' in an address would be a pointer to this string in the 'provinces' table. But this makes data harder to understand, and involves an extra lookup each time an address is displayed.

Still. Some degree of normalization is to be desired. For example, the entry '11829-118 Street, Edmonton, Alberta T3C 2P4' should never be entered in a database more than once. This is a highly specific entry, and one that is (typically) liable to change or to be mistyped or whatever. When something like this is enetered more than once, the reliability of the dabase information decreases dramatically.

To turn now to metadata, normalization (in my view) amounts to this (Stephen's first rule of metadata): metadata for a given entity should never be stored in more than one place.

If we have, for example, metadata about a given person (say, me), then it should be stored in one and only one location. (That does not mean that it cannot be aggregated or mirrored, but it does mean that there is one and only one location that would constitute the source of information about this person, and that aggregators and mirrors would update on a regular basis from this source.

The reason for this should be clear. With each additional location for original metadata about a person, the probability of error in that metadata increases dramatically. If the metadata changes (say, the person moves, or changes email address) then each instance of original metadata means an additional instance of data that must be updated.

Following from this first principle is a natural second principle: metadata for a given entity should not contain metadata for a second entity. The reason for this is that, with few exceptions, entities do not exist in a 1:1 relationship with each other. For example, an author writes many papers. A paper may have many authors.

If you accept these two principles, and if you want to show a relationship between entities (say, author of a paper) then you are commited to a mechanism whereby the metadata for one entity needs to be able to refer to the metadata for another entity. There are mechanisms for this, for example, the rdf:resource element.

But they are almost never used. Indeed, in learning metadata, there is generally no means of referring to one entity from within the metadata of another entity.

What do they do instead?

Some of the time, they include other metadata in the resource metadata. For example, learning object metadata, instead of pointing to contributor metadata, uses instead vcard information embedded within the metadata. In all such instances, the learning object metadata should be pointing to person or organizational metadata.

Much of the time, they use bare strings (aka Language Strings) to store metadata. In some cases this is appropriate. The description of a resource within the resource matadata, for example. But to do as RSS and Dublin Core do, to actually use a string to designate the 'creator' of a reasource, is to embed external metadata into the current metadata.

The problem with this - and the reason why normalization is so important in database management - is that such metadata constitutes an ambiguous reference. It's not simply that it may falsely describe the properties of the external entity, it's that it may fail to uniquely determine the external entity at all. When the 'Creator' of an object is 'Stephen Downes', do we mean the researcher in Moncton, the restaurant critic in Melbourne or the professor in Utah?

In some cases, such as some more recently proposed metadata, a half-step is taken through the use of an identifier. For example, the creator of a post in a discussion may be identified not only by 'name' but also by 'identifier', where the 'identifier' is "an unambiguous reference to the message Creator within the environment." Better, but by no means perfect.

If the context of application is within a specific environment, then XML is not necessary at all. Put all the data into the environment database, at which point the identifier becomes the the primary key in the appropriate database. All identifiers are known to be unique within the database, and hence uniqueness of reference and (by additional lookups) metadata about the identified entity can be obtained.

But what happens when we start thinking about exchanging data between environments. The identifier is at this point no unique reference at all, unless the different databases have some sort of common system of identification. Moreover, merely having the identifier for an entity in an external database offer no additional information about that entity.

In practice, of course, when we use the designation 'identifier' we probably mean something like 'handle' or 'purl' or some two part reference, the first part a reference to the external database in question, and the second to a unique entity within that database (the second part may additionally resolve to a table:key pair, depending on the database).

But in practice, what we want to happen is for this identifier, when dereferenced, to yield metadata about the entity in question. For after all, an identifier is no use if it is not also a pointer of some sort, and in particular, a pointer to information about the entity it identifies. If it doesn't do that, it may as well be a string literal (and would be about as useful).

Now at this juncture, you can either (a) go through a web services search to find out the location and API for the external database in question, or (b) treat the identifier as a url and access the remote information directly. In my world, what we do is the latter - because it's simple, direct, and intuitive. For after all, if the metadata for each entity has a unique location, then identity and location resolve into the same entity.

So we come back to: the external entity is named with an identifier, which is the location of the metadata for the external entity. Which means that (in a core implementation), no additional metadata about the external entity needs to be stored in the metadata of the current entity.

What this does, in practice is make metadata dead simple.

Take, for example, a discussion post. The metadata describing this post can be reduced to some simple RSS-like metadata, with some simple references to external entities.

A discussion post is, after all, in nature not distinct from a web post. Hence we use:
  • title - name of the post
  • link - location of the post text or file
  • description - summary of the post
Optionally, we also use some Dublin Core RSS extensions:
  • dc:date - date the post was created
  • dc: keywords - keywords
To identify the creator, we refer to the creator's metadata description
  • dc:creator rdf:resource=http://creator.com/my.foaf
And finally, some discussion-specific metadata:
  • disc:replyto rdf:resource=http://some.board.com/previous_post.htm
  • disc:forum rdf:resource=http://some.forum.com/thread
(Strictly speaking, this second element is optional, as the first post in a thread would be a 'replyto' the thread itself, which in turn would be a 'replyto' the forum itself, which would be a root element.)

And after that, anything else is optional. For example, some sites may want a 'role' element for some of these creators or posts. That's fine, but that would be a locally defined implementation, and not something that characterizes discussions in general.

In other words, every entity in a discussion forum would look like this:

<item>
<title>My first Post</title>
<description>This is my first post</description>
<link>http://myforum.com/posts/123.htm
<dc:creator resource="http://www.downes.ca/my.foaf" />
<dc:date>11 Jun 2005 11;15 ADT</dc:date>
<disc:replyto resource="http://myforum.com/posts/111.xml" />
</item>

Anything over and above this basic forumation is not only needless complexity, it is in addition redundant and hence increases the possibility of error. Moreover, such additional information is likely to be domain-specific and hence would restrict the application of such metadata to a single element.

This, sadly, is not the model undertaken in learning metadata, and why - no matter what standards are actually approved - it's all going to have to be redone in a few years when something like distributed metadata finally becomes a reality. Not long from now.

4 comments:

  1. "When something like distributed metadata finally becomes a reality"

    Have we not had RDF and OWL since Feb 2004?

    ReplyDelete
  2. Longer, if you consider the lead-in time. But though what I'm talking about is RDF-like, it neither requires nor assumes the rigor of RDF. As for OWL, I find ontologies per se of limited significance.

    ReplyDelete
  3. I love your information on Database Management I bookmarked your blog and will be back soon. If you want, check out my blog on Database Management Exposed - please come by

    ReplyDelete
  4. I really liked the information on your blog about Database Management I have my own Database Management Exposedblog if you would like to come and see what I have on mine

    ReplyDelete

I welcome your comments - I'm really sorry about the moderation, but Google's filters are basically ineffective.