[ Team LiB ] Previous Section Next Section

13.9 Merging RDF/RSS Files

In addition to reading RSS, you can also create applications that write it or that both read and write RDF/RSS. I use an application that does just this with my own Burningbird web sites.

I have several different types of sites at burningbird.net, each with its own RSS/RDF file. People could subscribe to each individual file, but I wanted to give my readers an option to subscribe to one main RDF/RSS file that contains the 10 most recent entries across the entire Burningbird network. To do this, I created an application in Perl that reads in the individual RSS files and merges all the items into one array. I then sorted the array in descending order by date, and "skimmed" the 10 most recent entries. Next, I used the data making up these entries to create a brand new RDF/RSS file, hosted at the main http://burningbird.net/index.rdf RSS file location.

Because I don't always have access to the most recent version of Python or access to a Tomcat server, I couldn't use either my Java- or my Python-based solution at my main web site. Instead, I wrote an application in Perl, making use of a very handy RSS Perl module, XML::RSS, originally developed by Jonathan Eisenzopf and now maintained at Source Forge.

You can access the source and documentation for using XML::RSS at http://perl-rss.sourceforge.net/.

The XML::RSS Perl module provides an object that can read in an RSS file in either RDF/RSS format or the non-RDF format. The data is then accessible via associative arrays (or dictionaries for Pythonistas), using the RSS predicates as key to find each value. For instance, after reading in an RDF/RSS file, you can access the dc:date field for an individual item using code similar to the following:

 $dt = $item->{'dc'}->{'date'};

All items are accessible as an associative array with the key items, and each individual item is accessible from it with item. Elements associated with a particular namespace, such as dc, form yet another associative array attached to each item.

The application starts by opening a file that contains the list of all of my index.rdf files and reading the filenames (each on a separate line) into an array:

my $rdffile = "/home/shelleyp/www/work/cronapp/indexfiles.txt";
open(DAT, $rdffile) || die("could not open");
my @files=<DAT>;
close(DAT);

The application cycles through all the files in the array, creating an instance of XML::RSS to process the data in each. Each individual item within the file is loaded into an associative array, using the item's timestamp as key:

foreach my $file (@files) {
  my $rss = new XML::RSS;

  $rss->parsefile($file);

   foreach my $item(@{$rss->{'items'}}) {
     my $dt = $item->{'dc'}->{'date'};
     $arry{$dt} = $item;
     }
}

A new RDF/RSS object is created and the header information is provided:

my $rss = new XML::RSS (version => '1.0');

 $rss->channel(
   title        => "Burningbird Network",
   link         => "http://burningbird.net",
   description  => "Burningbird: Burning online since 1995",
   dc => {
     subject    => "writing,technology,art,photography,science,environment,politics",
     creator    => 'shelleyp@burningbird.net',
     publisher  => 'shelleyp@burningbird.net',
     rights     => 'Copyright 1995-2003, Shelley Powers, Burningbird',
     language   => 'en-us',
   },
   syn => {
     updatePeriod     => "hourly",
     updateFrequency  => "1",
     updateBase       => "1901-01-01T00:00+00:00",
   },
 );

 $rss->image(
   title  => "Burningbird",
   url    => "http://burningbird.net/mm/birdflame.gif",
   link   => "http://burningbird.net/",
   dc => {
     creator  => "Shelley Powers",
   },
 );

Once the items are loaded, they're sorted in descending order, and a scalar array of the timestamp keys is accessed in order to loop through only the top 10 (most recent) items. As each item is accessed, it's used to build a new item within the new RDF/RSS object. When the processing is finished, the generated RDF/RSS object is serialized to a file. Example 13-8 shows the code for the complete application.

Example 13-8. Perl application that merges the entries from several different RDF/RSS files, creating a new RDF/RSS file from results
#!/usr/bin/perl -w

##################################################
# merge RDF/RSS files
# Author: Shelley Powers
##################################################

use lib '.';
use strict;
use XML::RSS;
use HTML::Entities;

# read in list of RDF/RSS files
my $rdffile = "/home/shelleyp/www/work/cronapp/indexfiles.txt";
open(DAT, $rdffile) || die("could not open");
my @files=<DAT>;
close(DAT);

# how many items to include
my $total = 10;

# array for all RSS items
my %arry;

# read in each RDF/RSS file, load into array
foreach my $file (@files) {
  my $rss = new XML::RSS;

  $rss->parsefile($file);

   foreach my $item(@{$rss->{'items'}}) {
     my $dt = $item->{'dc'}->{'date'};
     $arry{$dt} = $item;
     }
}

# sort descending order by timestamp
my @keys = reverse(sort(keys %arry));

# create new RDF/RSS file
# create header
my $rss = new XML::RSS (version => '1.0');
$rss->channel(
   title        => "Burningbird Network",
   link         => "http://burningbird.net",
   description  => "Burningbird: Burning online since 1995",
   dc => {
     subject    => "writing,technology,art,photography,science,environment,politics",
     creator    => 'shelleyp@burningbird.net',
     publisher  => 'shelleyp@burningbird.net',
     rights     => 'Copyright 1995-2003, Shelley Powers, Burningbird',
     language   => 'en-us',
   },
   syn => {
     updatePeriod     => "hourly",
     updateFrequency  => "1",
     updateBase       => "1901-01-01T00:00+00:00",
   },
 );

 $rss->image(
   title  => "Burningbird",
   url    => "http://burningbird.net/mm/birdflame.gif",
   link   => "http://burningbird.net/",
   dc => {
     creator  => "Shelley Powers",
   },
 );

# add items
my $i = 0;

while ($i < $total) {
  my $key = $keys[$i];

  # build new RSS item
  $rss->add_item(
     title        => encode_entities($arry{$key}->{'title'}),
     description  => encode_entities($arry{$key}->{'description'}),
     link         => $arry{$key}->{'link'},
     dc => {
         subject  => $arry{$key}->{'dc'}->{'subject'},
         creator  => $arry{$key}->{'dc'}->{'creator'},
         date     => $arry{$key}->{'dc'}->{'date'},
     },
  );
  $i++;
}
$rss->save('/home/shelleyp/www/index.rdf');

The application is then run as a scheduled hourly task, which is more than frequent enough.

As you can see, when you use a specialized API, regardless of the language, your task is greatly simplified. Trying to code this all by hand using regular expressions or even an XML processor would take at least twice as much code, and three times the work. You get a lot of return for a little investment in using a specialized XML vocabulary organized with the RDF metamodel.

    [ Team LiB ] Previous Section Next Section