[ Team LiB ] Previous Section Next Section

5.1 Metadata in RSS 0.9x

As all good tutorials on the subject will tell you, metadata is data about data. In the case of RSS 0.92, this includes the name of the author of the feed, the date the channel was last updated, and so on. In Example 5-1, the bold code is the metadata. You could remove this data, and the feed itself would still both parse and be useful to the reader when displayed as HTML. The metadata is in the background, silent, but meaningful to those who can see it.

Example 5-1. The metadata within an RSS 0.92 feed
<rss version="0.92">
<channel>
  <title>RSS0.92 Example</title> 
  <link>http://www.oreilly.com/example/index.html</link> 
  <description>This is an example RSS0.91 feed</description> 
  <language>en-gb</language> 
  <copyright>Copyright 2002, Oreilly and Associates.</copyright> 
  <managingEditor>editor@oreilly.com</managingEditor> 
  <webMaster>webmaster@oreilly.com</webMaster> 
  <pubDate>03 Apr 02 1500 GMT</pubDate>
  <lastBuildDate>03 Apr 02 1500 GMT</lastBuildDate>
  <docs>http://backend.userland.com/rss091</docs>
  <skipDays>
    <day>Monday</day>
  </skipDays>
  <skipHours>
    <hour>20</hour>
  </skipHours>
  <cloud domain="http://www.oreilly.com" port="80" path="/RPC2" 
registerProcedure="pleaseNotify" protocol="XML-RPC" />

  <image>
    <title>RSS0.91 Example</title> 
    <url>http://www.oreilly.com/example/images/logo.gif</url> 
    <link>http://www.oreilly.com/example/index.html</link>
    <width>88</width> 
    <height>31</height> 
    <description>The World's Leading Technical Publisher</description>
  </image>
  <textInput>
    <title>Search</title>
    <description>Search the Archives</description>
    <name>query</name>
    <link>http://www.oreilly.com/example/search.cgi</link>
  </textInput>
   
  <item>
    <title>The First Item</title> 
    <link>http://www.oreilly.com/example/001.html</link> 
    <description>This is the first item.</description>
    <source url="http://www.anothersite.com/index.xml">Another Site</source>
    <enclosure url="http://www.oreilly.com/001.mp3" length="54321" type"audio/mpeg"/>
<category domain="http://www.dmoz.org">
Business/Industries/Publishing/Publishers/Nonfiction/</category>
  </item>
   
  <item>
    <title>The Second Item</title> 
    <link>http://www.oreilly.com/example/002.html</link> 
    <description>This is the second item.</description>
    <source url="http://www.anothersite.com/index.xml">Another ;Site</source>
    <enclosure url="http://www.oreilly.com/002.mp3" length="54321"
type"audio/mpeg"/>
    <category domain="http://www.dmoz.org">
    Business/Industries/Publishing/Publishers/Nonfiction/</category>
  </item>
</channel>
</rss>

With this sort of simple metadata, written in the grammar of RSS 0.92's XML format, we are describing simple statements. Take the first line of metadata in Example 5-1, for example. Focusing on the language aspect, we see:

<channel>
...
<language>en-gb</language>
...
</channel>

Here we see the language element with a value of en-gb. The language element is a subelement of channel, so a simple translation of the XML into English could read, "The object called channel has a subelement called language whose value is en-gb."

This phrase is grammatically and semantically correct, but it lacks a certain poetry. We can rewrite it with something more friendly: "The channel's language is en-gb."

Now that's more like it. We've created a statement of fact from the metadata: "The language of the channel is British English."

So far, so easy, you say. Well, you're quite right; metadata is all about making statements. With the simple metadata present in RSS 0.9x, we do it all the time:

  <language>en-gb</language> 
  <copyright>Copyright 2002, O'Reilly and Associates.</copyright> 
  <managingEditor>editor@oreilly.com</managingEditor> 
  <webMaster>webmaster@oreilly.com</webMaster> 
  <pubDate>03 Apr 02 1500 GMT</pubDate>
  <lastBuildDate>03 Apr 02 1500 GMT</lastBuildDate>

From this section, we see the feed is in English, it is copyright 2002, O'Reilly & Associates, the managing editor is editor@oreilly.com, and so on.

You will notice, alas, that all is not perfect with this syntax. For example, the managing editor is defined as editor@oreilly.com. To you and me, it is obvious that this is an email address for a person, and we can act accordingly, but to a machine — a search engine, for example — it is a general email address at best and just a string at worst. Either way, no one can tell anything at all about the managing editor. Herein lies a problem.

Let's recap. The simple metadata found in RSS 0.9x makes a simple statement based on its element, the element's value, and the place of the element within the document. We know the language element refers to the channel that is one level above it within the XML document. We also know that in our example the value of language is en-gb, and by understanding what the element and its value mean we can make the statement that the channel is written in British English.

Going back to our childhood grammar classes, we can see that this is a simple subject/predicate/object sentence:

The channel (subject) has the language (predicate) British English (object).

This sort of statement is called a triple. Remember this word—we'll need it later. Now, these simple triples work well for most things within RSS 0.9x, but they somewhat limit us to raw data values: things such as dates and language codes that are unambiguous and easily understood. Triples do not help us one bit when we're talking about abstract concepts, such as subjects, or when we're referring to other entities, such as people. Plus, and this is key, without human interaction, the combination of an arbitrary element name, value, and position within the document would be meaningless. If we disregard our ability to read English, we find we cannot tell what any of the element names refer to, and we cannot understand their values. As it stands, RSS 0.9x's metadata cannot be understood by machines, and the triple, though elegant, is very limited when you take the human out of the equation. Without machine comprehension, we lose a great deal of potential utility from our RSS feeds.

To start rectifying this situation, we need to define exactly what every word in the statement means. To do this, we must introduce the Uniform Resource Identifier (URI).

5.1.1 Using URIs in RSS

The URI is a string of characters that identifies a resource. This resource can be anything that has an identity, whether it is tangible or not: a person, a book, a standard, a web site, a service, an email address, and so on. For example:

mailto:ben@benhammersley.com

The URI for me

http://www.w3.org/1999/xhtml

The URI for the concept of XHTML

pop://pop.example.org

The URI for an example POP mailbox

You'll notice that these look very similar to URLs — the standard hyperlinks. You're right; URLs are a subset of URIs. There are, however, some major differences between the two.

Primarily, even though many URIs are named after, and closely resemble, network-contactable URLs, this does not mean that the resources they identify are retrievable via that network method: a person can be represented by a URI that looks like a URL, but pointing a browser at it will not retrieve the person. A concept — the XML standard, for example — has its own URI that starts with http://, but typing it into your address bar will not download the XML standard into your machine.

A URI simply provides a unique identifier for the resource, whatever it is. Granted, wherever possible, the URI will give you something useful (documentation on the resource, usually) if it is treated like a URL, but this is not in any way necessary.

Now, by allowing resources to be defined, we can make our metadata more robust. Let's reconsider the managingEditor example:

<managingEditor>editor@oreilly.com</managingEditor>

At the moment, we can't make any form of definitive statement about this, bar what we understand from being able to read English. We can't say for sure what managingEditor actually means (what context is this in?), nor can we understand what the value denotes. Is it an email address we may freely contact, or is it something else? We just can't tell.

If we can assign URIs to each of the resources in this statement, we can give it more meaning:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
         xmlns:RSS091="http://purl.org/rss/1.0/modules/rss091#"
         xmlns:rss="http://purl.org/rss/1.0/">  
   
  <rss:channel rdf:about="http://www.example.org/example.rss">
    <RSS091:managingEditor>editor@oreilly.com</RSS091:managingEditor>
  </rss:channel>
   
</rdf:RDF>

In this example, we introduce a few more concepts, which we'll discuss in the next section. In the meantime, if you look at the emphasized code, you'll see that the channel gains a URI, denoted by the rdf:about="" attribute, and the managingEditor element becomes RSS091:managingEditor.

This immediately gives more context to the metadata. For one, the channel is uniquely defined. Second, the managingEditor element is associated with a concept of RSS091, which itself is given a URI to identify it uniquely. Third, the concept of a channel is associated with its own URI. From this information, we can make the following assertion:

The channel (where the concept of channel is identified by the URI http://purl.org/rss/1.0/, and the channel itself is identified by the URI http://www.example.org/example.rss) has an attribute called managingEditor (which is part of a concept as defined by the URI http://purl.org/rss/1.0/modules/rss091#), whose value is editor@oreilly.com.

Because we can know what the managingEditor element means in the context of the resource represented by the URI http://purl.org/rss/1.0/modules/rss091# (it's the guy in charge of the site the feed is from, but you'll have to wait until Chapter 7 to see why), we can now understand what the statement means. Even better than that, we can start to make definitive statements about the metadata within a document, and hence about the document itself. We, and other machines, can definitively state that the managing editor of this feed has the email address editor@oreilly.com, because we've defined all the terms we are using. There is no ambiguity as to what each phrase means or to what it refers.

You can't have failed to notice the additional lines of code within the example. This was your first look at RDF. Much of the rest of this book deals with RDF, so let's take a look at it in some detail.

    [ Team LiB ] Previous Section Next Section