[ Team LiB ] Previous Section Next Section

9.1 Using RSS Feeds Inside Another Site

Because RSS feeds change independently of the rest of the content within your page, the most sensible method of displaying the feed is to treat it as an inclusion. There are two ways of doing this: as a server-side include (SSI), in which your server inserts the parsed RSS feed into the correct place inside your page, or as a client-side include, in which you rely on your user's browser to do the same.

Server-side inclusion depends on settings within your server's configuration. If you have neither control of your own server, nor a friendly system administrator who might be bribed to turn it on for you, you're out of luck here. Client-side inclusion depends on your user's browser allowing the execution of JavaScript. Most do, but some people keep it turned off for security reasons. Anyone using browsers without JavaScript (and this might include PDAs and web-television systems) will not be able to see the feed. You will have to account for this in your design process.

Deciding which method to use is really a matter of how much access you have to the technology that runs your site. At the end of the day, the feed must be parsed into something readable, and this is always done on the server side. So, if you don't have full access to CGI-Bin, you'll be relying on a third-party parsing service. These services give you a line of JavaScript to include in your page's code, which acts as the client-side inclusion. For example, a third party is good for blogs that might be hosted on services that do not allow scripting.

If you have CGI-Bin access, but no server-side inclusion allowed by your host's server settings, you'll need to combine server-side parsing with client-side inclusion. Those of us lucky enough to have full access to our servers can do whatever we like, with all of the toys available and waiting. The most fun, and certainly the most flexible method, is to use server-side parsing and server-side inclusion in combination.

In this section, we'll discuss all the options. But first, let's talk about parsing RSS into something more readable.

9.1.1 Parsing RSS as Simply as Possible

The disadvantage of RSS's split into two separate but similar specifications is that we can never be sure which of the standards our desired feeds will arrive in. If we restrict ourselves to using only RSS 0.9x, it is very likely that the universe will conspire to make the most interesting stuff available solely in RSS 1.0, or vice versa. So, no matter what we want to do with the feed, our approach must be able to handle both standards with equal aplomb. With that in mind, simple parsing of RSS can be done in three different ways:

  • XML parsing

  • Regular expressions

  • XSLT transformations

9.1.1.1 XML parsers

XML parsers are useful tools to have around when dealing with either RSS 0.9x or 1.0. While RSS 0.9x is a quite simple format, and using a full-fledged XML parser on it does sometimes seem to be like stirring soup with a cement mixer, it does have a distinct advantage over the other methods: future-proofing. Depending on how you architect your code, the use of a proper parser may save you a lot of time should the specifications change, or if new elements are introduced. This is especially useful with RSS 1.0 and its ever-growing raft of modules, and it is the method I recommend. For the majority of purposes, the simplest XML parsers are perfectly useful. The Perl module XML::Simple is a good example. Example 9-1 is a simple script that uses XML::Simple to parse both RSS 0.9x and RSS 1.0 feeds into XHTML that is ready for server-side inclusion.

Example 9-1. Using XML::Simple to parse RSS
#!/usr/local/bin/perl
   
use strict;
use warnings;
   
use LWP::Simple;
use XML::Simple;
   
my $url=$ARGV[0];
   
# Retrieve the feed, or die gracefully
my $feed_to_parse = get ($url) or die "I can't get the feed you want";
   
# Parse the XML
my $parser = XML::Simple->new(  );
my $rss = $parser->XMLin("$feed_to_parse");
   
# Decide on name for outputfile
my $outputfile = "$rss->{'channel'}->{'title'}.html";
   
# Replace any spaces within the title with an underscore
$outputfile =~ s/ /_/g;
   
# Open the output file
open (OUTPUTFILE, ">$outputfile");
   
# Print the Channel Title
print OUTPUTFILE '<div class="channelLink">'."\n".'<a href="';
print OUTPUTFILE "$rss->{'channel'}->{'link'}".'">';
print OUTPUTFILE "$rss->{'channel'}->{'title'}</a>\n</div>\n";
   
# Print the channel items
print OUTPUTFILE '<div class="linkentries">'."\n"."<ul>";
print OUTPUTFILE "\n";
   
foreach my $item (@{$rss->{channel}->{'item'}}) {
    next unless defined($item->{'title'}) && defined($item->{'link'});
    print OUTPUTFILE '<li><a href="';
    print OUTPUTFILE "$item->{'link'}";
    print OUTPUTFILE '">';
    print OUTPUTFILE "$item->{'title'}</a></li>\n";
           }
           
foreach my $item (@{$rss->{'item'}}) {
    next unless defined($item->{'title'}) && defined($item->{'link'});
    print OUTPUTFILE '<li><a href="';
    print OUTPUTFILE "$item->{'link'}";
    print OUTPUTFILE '">';
    print OUTPUTFILE "$item->{'title'}</a></li>\n";
           }           
           
print OUTPUTFILE "</ul>\n</div>\n";
  
# Close the OUTPUTFILE
close (OUTPUTFILE);

This script highlights various issues regarding the parsing of RSS, so it is worth dissecting closely. We start with the opening statements:

#!/usr/local/bin/perl
   
use strict;
use warnings;
   
use LWP::Simple;
use XML::Simple;
   
my $url=$ARGV[0];
   
# Retrieve the feed, or die gracefully
my $feed_to_parse = get ($url) or die "I can't get the feed you want";

This is nice and standard Perl—the usual use strict; and use warnings; for good programming karma. Next, we load the two necessary modules: XML::Simple we are aware of already, and LWP::Simple is used to retrieve the RSS feed from the remote server. This is indeed what we do next, taking the command-line argument as the URL for the feed we want to parse. We place the entire feed in the scalar $feed_to_parse, ready for the next section of the script:

# Parse the XML
my $parser = XML::Simple->new(  );
my $rss = $parser->XMLin("$feed_to_parse");

This section fires up a new instance of the XML::Simple module and calls the newly initialized object $parser. It then reads the retrieved RSS feed and parses it into a tree, with the root of the tree called $rss. This tree is actually a set of hashes, with the element names as hash keys. In other words, we can do this:

# Decide on name for outputfile
my $outputfile = "$rss->{'channel'}->{'title'}.html";
   
# Replace any spaces within the title with an underscore
$outputfile =~ s/ /_/g;
   
# Open the output file
open (OUTPUTFILE, ">$outputfile");

Here we take the value of the title element within the channel, add the string .html, and make it the value of $outputfile. This is for a simple reason: I wanted to make the user interface to this script as simple as possible. You can change it to allow the user to input the output filename themselves, but I like the script to work one out automatically from the title element. Of course, many title elements use spaces, which makes a nasty mess of filenames, so we use a regular expression to replace spaces with underscores. We then open up the file handle, creating the file if necessary.

With a file ready for filling, and an RSS feed parsed in memory, let's fill in some of the rest:

# Print the Channel Title
print OUTPUTFILE '<div class="channelLink">'."\n".'<a  href="';
print OUTPUTFILE "$rss->{'channel'}->{'link'}".'">';
print OUTPUTFILE "$rss->{'channel'}->{'title'}</a>\n</div>\n";

Here we start to make the XHTML version. We take the link and title elements from the channel and create a title that is a hyperlink to the destination of the feed. We assign it a div, so that we can format it later with CSS, and include some new lines to make the XHTML source as pretty as can be:

# Print the channel items
print OUTPUTFILE '<div class="linkentries">'."\n"."<ul>";
print OUTPUTFILE "\n";
   
foreach my $item (@{$rss->{channel}->{'item'}}) {
    next unless defined($item->{'title'}) && defined($item->{'link'});
    print OUTPUTFILE '<li><a href="';
    print OUTPUTFILE "$item->{'link'}";
    print OUTPUTFILE '">';
    print OUTPUTFILE "$item->{'title'}</a></li>\n";
           }
           
foreach my $item (@{$rss->{'item'}}) {
    next unless defined($item->{'title'}) && defined($item->{'link'});
    print OUTPUTFILE '<li><a href="';
    print OUTPUTFILE "$item->{'link'}";
    print OUTPUTFILE '">';
    print OUTPUTFILE "$item->{'title'}</a></li>\n";
           }           
           
print OUTPUTFILE "</ul>\n</div>\n";
  
# Close the OUTPUTFILE
close (OUTPUTFILE);

The last section of the script deals with the biggest issue for all RSS parsing: the differences between RSS 0.9x and RSS 1.0. With XML::Simple, or any other tree-based parser, this is especially crucial, because the item appears in a different place in each specification. Remember: in RSS 0.9x, item is a subelement of channel, but in RSS 1.0 they have equal weight.

So, in the preceding snippet you can see two foreach loops. The first one takes care of RSS 0.9x feeds, and the second covers RSS 1.0. Either way, they are encased inside another div and made into an ul unordered list. The script finishes by closing the file handle. Our work is done.

Running this from the command line, with the RSS feed from http://rss.benhammersley.com/index.xml, produces the result shown in Example 9-2.

Example 9-2. Content_Syndication_with_RSS.html
<div class="channelLink">
<a href="http://rss.benhammersley.com/">Content Syndication with XML and RSS</a>
</div>
<div class="linkentries">
<ul>
<li><a href="http://rss.benhammersley.com/archives/001150.html">PHP parsing of RSS</a></li>
<li><a href="http://rss.benhammersley.com/archives/001146.html">RSS for Pocket PC</a></li>
<li><a href="http://rss.benhammersley.com/archives/001145.html">Syndic8 is One</a></li>
<li><a href="http://rss.benhammersley.com/archives/001141.html">RDF mod_events</a></li>
<li><a href="http://rss.benhammersley.com/archives/001140.html">RSS class for cocoa</a></li>
<li><a href="http://rss.benhammersley.com/archives/001131.html">Creative Commons RDF</a></li>
<li><a href="http://rss.benhammersley.com/archives/001129.html">RDF events in Outlook.</a></li>
<li><a href="http://rss.benhammersley.com/archives/001128.html">Reading Online News</a></li>
<li><a href="http://rss.benhammersley.com/archives/001115.html">Hep messaging server</a></li>
<li><a href="http://rss.benhammersley.com/archives/001109.html">mod_link</a></li>
<li><a href="http://rss.benhammersley.com/archives/001107.html">Individual Entries as RSS 1.0</a></li>
<li><a href="http://rss.benhammersley.com/archives/001105.html">RDFMap</a></li>
<li><a href="http://rss.benhammersley.com/archives/001104.html">They're Heeereeee</a></li>
<li><a href="http://rss.benhammersley.com/archives/001077.html">Burton Modules</a></li>
<li><a href="http://rss.benhammersley.com/archives/001076.html">RSS within XHTML documents UPDATED</a></li>
</ul>
</div>

We can then include this inside another page using server-side inclusion (see later in this chapter.)

After all our detailing of additional elements, I hear you cry, where are they? Well, including extra elements in a script of this sort is rather simple. Here I've taken another look at the second foreach loop from our previous example. Notice the sections in bold type:

foreach my $item (@{$rss->{'item'}}) {
    next unless defined($item->{'title'}) && defined($item->{'link'});
    print OUTPUTFILE '<li><a href="';
    print OUTPUTFILE "$item->{'link'}";
    print OUTPUTFILE '">';
    print OUTPUTFILE "$item->{'title'}</a>";
    if ($item->{'dc:creator'}) {
        print OUTPUTFILE '<span class="dccreator">Written  by';
        print OUTPUTFILE "$item->{'dc:creator'}";
        print OUTPUTFILE '</span>';
        }
    print OUTPUTFILE "<ol><blockquote>$item->{'description'}</blockquote></ol>";
    print OUTPUTFILE "\n</li>\n";
           }

This section now looks inside the RSS feed for a dc:creator element and displays it if it finds one. It also retrieves the contents of the description element and displays it as a nested item in the list. You might want to change this formatting, obviously.

By repeating the emphasized line, it is easy to add support for different elements as you see fit, and it's also simple to give each new element its own div or span class to control the on-screen formatting. For example:

if ($item->{'dc:creator'}) {
   print OUTPUTFILE '<span class="dccreator">Written  by';
   print OUTPUTFILE "$item->{'dc:creator'}";
   print OUTPUTFILE '</span>';
   }
if ($item->{'dc:date'}) {
   print OUTPUTFILE '<span class="dcdate">Date:';
   print OUTPUTFILE "$item->{'dc:date'}";
   print OUTPUTFILE '</span>';
   }
if ($item->{'annotate:reference'}) {
   print OUTPUTFILE '<span class="annotation"><a href="';
   print OUTPUTFILE "$item->{'annotate:reference'}->{'rdf:resource'}";
   print OUTPUTFILE '">Comment  on this</a></span>';
       }

Installing XML Parsers with Expat

Most XML parsers found in scripting languages (Perl, Python, etc.) are really interfaces for Expat, the powerful XML parsing library. They therefore require Expat to be installed. Expat is available from http://expat.sourceforge.net/ and is released under the MIT License.

As you can see, the final extension prints the contents of the annotate:reference element. This, as we mentioned in Chapter 7, is a single rdf:resource attribute. Note the way we get XML::Simple to read the attribute. It just treats the attribute as another leaf on the tree — you call it in the same way you would a subelement. You can use the same syntax for any attribute-only element.

9.1.2 Regular Expressions

Using regular expressions to parse RSS may seem a little brutish, but it does have two advantages. First, it totally negates the issues regarding the differences between standards. Second, it is a much easier installation: it requires no XML parsing modules, or any dependencies thereof.

Regular expressions, however, are not pretty. Consider Example 9-3, which is a section from Rael Dornfest's lightweight RSS aggregator, Blagg.

Example 9-3. A section of code from Blagg
# Feed's title and link
my($f_title, $f_link) = ($rss =~ m#<title>(.*?)</title>.*?<link>(.*?)</link>#ms);
   
# RSS items' title, link, and description
   
while ( $rss =~ m{<item(?!s).*?>.*?(?:<title>(.*?)</title>.*?)?(?:<link>(.*?)</link>.
*?)?(?:<description>(.*?)</description>.*?)?</item>}mgis ) {
     my($i_title, $i_link, $i_desc, $i_fn) = ($1||'', $2||'', $3||'', undef);
   
     # Unescape &amp; &lt; &gt; to produce useful HTML
     my %unescape = ('&lt;'=>'<', '&gt;'=>'>', '&amp;'=>'&', '&quot;'=>'"'); 
     my $unescape_re = join '|' => keys %unescape;
     $i_title && $i_title =~ s/($unescape_re)/$unescape{$1}/g;
     $i_desc && $i_desc =~ s/($unescape_re)/$unescape{$1}/g;
   
     # If no title, use the first 50 non-markup characters of the description
     unless ($i_title) {
          $i_title = $i_desc;
          $i_title =~ s/<.*?>//msg;
          $i_title = substr($i_title, 0, 50);
          }
          next unless $i_title;

While this looks pretty nasty, it is actually an efficient way of stripping the data out of the RSS file, even if it is potentially much harder to extend. If you are really into regular expressions and do not mind having a very specialized, hard-to-extend system, their simplicity may be for you. They certainly have their place.

    [ Team LiB ] Previous Section Next Section