What is RSS?(2)

What is RSS?
by Mark Pilgrim | Pages: 1, 2

Despite being RDF/XML, RSS 1.0 is structurally similar to previous versions of RSS -- similar enough that we can simply treat it as XML and write a single function to extract information out of either an RSS 0.91 or RSS 1.0 feed. However, there are some significant differences that our code will need to be aware of:

  1. The root element is rdf:RDF instead of rss. We'll either need to handle both explicitly or just ignore the name of the root element altogether and blindly look for useful information inside it.

  2. RSS 1.0 uses namespaces extensively. The RSS 1.0 namespace is http://purl.org/rss/1.0/, and it's defined as the default namespace. The feed also uses http://www.w3.org/1999/02/22-rdf-syntax-ns# for the RDF-specific elements (which we'll simply be ignoring for our purposes) and http://purl.org/dc/elements/1.1/ (Dublin Core) for the additional metadata of article authors and publishing dates.

    We can go in one of two ways here: if we don't have a namespace-aware XML parser, we can blindly assume that the feed uses the standard prefixes and default namespace and look for item elements and dc:creator elements within them. This will actually work in a large number of real-world cases; most RSS feeds use the default namespace and the same prefixes for common modules like Dublin Core. This is a horrible hack, though. There's no guarantee that a feed won't use a different prefix for a namespace (which would be perfectly valid XML and RDF). If or when it does, we'll miss it.

    If we have a namespace-aware XML parser at our disposal, we can construct a more elegant solution that handles both RSS 0.91 and 1.0 feeds. We can look for items in no namespace; if that fails, we can look for items in the RSS 1.0 namespace. (Not shown, but RSS 0.90 feeds also use a namespace, but not the same one as RSS 1.0. So what we really need is a list of namespaces to search.)

  3. Less obvious but still important, the item elements are outside the channel element. (In RSS 0.91, the item elements were inside the channel. In RSS 0.90, they were outside; in RSS 2.0, they're inside. Whee.) So we can't be picky about where we look for items.

  4. Finally, you'll notice there is an extra items element within the channel. It's only useful to RDF parsers, and we're going to ignore it and assume that the order of the items within the RSS feed is given by their order of the item elements.

But what about RSS 2.0? Luckily, once we've written code to handle RSS 0.91 and 1.0, RSS 2.0 is a piece of cake. Here's the RSS 2.0 version of the same feed:

<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>XML.com</title>
    <link>http://www.xml.com/</link>
    <description>XML.com features a rich mix of information and services for the XML community.</description>
    <language>en-us</language>
    <item>
      <title>Normalizing XML, Part 2</title>
      <link>http://www.xml.com/pub/a/2002/12/04/normalizing.html</link>
      <description>In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.</description>
      <dc:creator>Will Provost</dc:creator>
      <dc:date>2002-12-04</dc:date>    
    </item>
    <item>
      <title>The .NET Schema Object Model</title>
      <link>http://www.xml.com/pub/a/2002/12/04/som.html</link>
      <description>Priya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic manipulation of W3C XML Schemas.</description>
      <dc:creator>Priya Lakshminarayanan</dc:creator>
      <dc:date>2002-12-04</dc:date>    
    </item>
    <item>
      <title>SVG's Past and Promising Future</title>
      <link>http://www.xml.com/pub/a/2002/12/04/svg.html</link>
      <description>In this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward to 2003.</description>
      <dc:creator>Antoine Quint</dc:creator>
      <dc:date>2002-12-04</dc:date>    
    </item>
  </channel>
</rss>

As this example shows, RSS 2.0 uses namespaces like RSS 1.0, but it's not RDF. Like RSS 0.91, there is no default namespace and items are back inside the channel. If our code is liberal enough to handle the differences between RSS 0.91 and 1.0, RSS 2.0 should not present any additional wrinkles.

How can I read RSS?

Now let's get down to actually reading these sample RSS feeds from Python. The first thing we'll need to do is download some RSS feeds. This is simple in Python; most distributions come with both a URL retrieval library and an XML parser. (Note to Mac OS X 10.2 users: your copy of Python does not come with an XML parser; you will need to install PyXML first.)

from xml.dom import minidom
import urllib

def load(rssURL):
  return minidom.parse(urllib.urlopen(rssURL))

This takes the URL of an RSS feed and returns a parsed representation of the DOM, as native Python objects.

The next bit is the tricky part. To compensate for the differences in RSS formats, we'll need a function that searches for specific elements in any number of namespaces. Python's XML library includes a getElementsByTagNameNS which takes a namespace and a tag name, so we'll use that to make our code general enough to handle RSS 0.9x/2.0 (which has no default namespace), RSS 1.0 and even RSS 0.90. This function will find all elements with a given name, anywhere within a node. That's a good thing; it means that we can search for item elements within the root node and always find them, whether they are inside or outside the channel element.

DEFAULT_NAMESPACES = /
  (None, # RSS 0.91, 0.92, 0.93, 0.94, 2.0
  'http://purl.org/rss/1.0/', # RSS 1.0
  'http://my.netscape.com/rdf/simple/0.9/' # RSS 0.90
  )

def getElementsByTagName(node, tagName, possibleNamespaces=DEFAULT_NAMESPACES):
  for namespace in possibleNamespaces:
    children = node.getElementsByTagNameNS(namespace, tagName)
    if len(children): return children
  return []

Finally, we need two utility functions to make our lives easier. First, our getElementsByTagName function will return a list of elements, but most of the time we know there's only going to be one. An item only has one title, one link, one description, and so on. We'll define a first function that returns the first element of a given name (again, searching across several different namespaces). Second, Python's XML libraries are great at parsing an XML document into nodes, but not that helpful at putting the data back together again. We'll define a textOf function that returns the entire text of a particular XML element.

def first(node, tagName, possibleNamespaces=DEFAULT_NAMESPACES):
  children = getElementsByTagName(node, tagName, possibleNamespaces)
  return len(children) and children[0] or None

def textOf(node):
  return node and "".join([child.data for child in node.childNodes]) or ""

That's it. The actual parsing is easy. We'll take a URL on the command line, download it, parse it, get the list of items, and then get some useful information from each item:

DUBLIN_CORE = ('http://purl.org/dc/elements/1.1/',)

if __name__ == '__main__':
  import sys
  rssDocument = load(sys.argv[1])
  for item in getElementsByTagName(rssDocument, 'item'):
    print 'title:', textOf(first(item, 'title'))
    print 'link:', textOf(first(item, 'link'))
    print 'description:', textOf(first(item, 'description'))
    print 'date:', textOf(first(item, 'date', DUBLIN_CORE))
    print 'author:', textOf(first(item, 'creator', DUBLIN_CORE))
    print

Running it with our sample RSS 0.91 feed prints only title, link, and description (since the feed didn't include any other information on dates or authors):

$ python rss1.py http://www.xml.com/2002/12/18/examples/rss091.xml.txt
title: Normalizing XML, Part 2
link: http://www.xml.com/pub/a/2002/12/04/normalizing.html
description: In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.
date:
author:

title: The .NET Schema Object Model
link: http://www.xml.com/pub/a/2002/12/04/som.html
description: Priya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic manipulation of W3C XML Schemas.
date:
author:

title: SVG's Past and Promising Future
link: http://www.xml.com/pub/a/2002/12/04/svg.html
description: In this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward to 2003.
date:
author:

For both the sample RSS 1.0 feed and sample RSS 2.0 feed, we also get dates and authors for each item. We reuse our custom getElementsByTagName function, but pass in the Dublin Core namespace and appropriate tag name. We could reuse this same function to extract information from any of the basic RSS modules. (There are a few advanced modules specific to RSS 1.0 that would require a full RDF parser, but they are not widely deployed in public RSS feeds.)

Here's the output against our sample RSS 1.0 feed:

$ python rss1.py http://www.xml.com/2002/12/18/examples/rss10.xml.txt
title: Normalizing XML, Part 2
link: http://www.xml.com/pub/a/2002/12/04/normalizing.html
description: In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.
date: 2002-12-04
author: Will Provost

title: The .NET Schema Object Model
link: http://www.xml.com/pub/a/2002/12/04/som.html
description: Priya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic manipulation of W3C XML Schemas.
date: 2002-12-04
author: Priya Lakshminarayanan

title: SVG's Past and Promising Future
link: http://www.xml.com/pub/a/2002/12/04/svg.html
description: In this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward to 2003.
date: 2002-12-04
author: Antoine Quint

Running against our sample RSS 2.0 feed produces the same results.

This technique will handle about 90% of the RSS feeds out there; the rest are ill-formed in a variety of interesting ways, mostly caused by non-XML-aware publishing tools building feeds out of templates and not respecting basic XML well-formedness rules. Next month we'll tackle the thorny problem of how to handle RSS feeds that are almost, but not quite, well-formed XML.

Related resources


Comment on this articleAre you using RSS in your web or XML projects? Share your experience in our forum.
(* You must be a
member of XML.com to use this feature.)
Comment on this Article

member of XML.com to use this feature.)
Comment on this Article

Titles Only Titles Only Newest First
  • Your article helped me -
    2005-03-06 08:25:14 katykoot [Reply]

    I just want to make sure I did it right. I am hoping that my feed.rss will be picked up by news agencies and syndicated. Can you check it for me...and by looking at this - is this the correct way to be doing a press release...by rss? versus prweb.com?


    http://www.amethystlive.com/feed.rss

  • My Company doesn't want RSS
    2004-04-07 08:47:00 Raleigh Swick [Reply]

    After trying to get my company to get RSS feeds.... their reply:


    >> However, we are holding off on deploying them in a widespread fashion
    >> while we craft a general strategy for syndication and mobility. Though
    >> RSS feeds may be useful in news aggregators, and could increase page
    >> views to articles, the risk is that they may actually reduce page
    >> views to our own index pages, which carry display advertising that RSS
    >> inherently cannot.


    How can I argue this? What to say?

    • My Company doesn't want RSS
      2004-08-07 20:09:08 Thogek [Reply]

      Note that an RSS feed often contains only the opening paragraph (or short teaser/summary) of each posted article (or announcement or whatever). The bulk of the article is generally kept on the Web site, the URL for which is included in the RSS feed, so that users who are interested can click through and read the whole thing. So, your RSS feed is basically another way for users to subscribe to announcements of new articles, features, etc., for which they still have to come to your site, and view your pages (and ads) in order to view the whole article. (Kinda like direct-emailing of new article announcements, but easier.)

    • My Company doesn't want RSS
      2004-06-11 09:09:38 prakashnambiar [Reply]

      Hey , you can deliver an Advt with the image/logo of your rss feed, stil you can use the comments tag !!!

  • Please don't break XML!
    2003-01-03 01:35:35 Henri Sivonen [Reply]

    Using the namespace prefixes instead of proper namespace processing is dirty. Getting a namespace-aware XML parser is not that hard. Please don't break namespaces by using prefix-based guessing.


    Even more worrying is the last sentence: "Next month we'll tackle the thorny problem of how to handle RSS feeds that are almost, but not quite, well-formed XML."


    What's there to tackle? The only correct way to handle ill-formed XML is to firmly reject it. Please enforce the XML well-formedness requirements in order to protect XML from degenerating into tag soup.

    • Please don't break XML!
      2005-01-19 04:20:03 Looking_past_XML [Reply]

      Oh my freakin' god, XML is not a sacred standard, shit happens, the offical W3C docs encourage browser/user-agent creators to attempt to properly render imperfect html/xhtml.


      The spirit of RFC's and protocols that has made the internet work(able) is:
      "Be liberal in in what you accept and conservative in what you send" and it's variations by Jon Postel.


      Also, TOG et al would probably assail a system that was so anal and rigid and non-resilient (and they would say lazy) that it couldn't route around some minor formatting transgressions and give the user 50%, 80% or whatever percent of the feed that it could decipher.
      But this brings up the elephant that no one is allowed to talk about - XML and it's main parsers are extremely brittle and complex.


      Flame on...

      • Please don't break XML!
        2005-01-19 04:33:16 Looking_past_XML [Reply]

        Since people may not get the "TOG" reference:
        TOG - usability expert that puts much more responsibility on the system creators for making systems that "Just Work++" than many programmers would like, after reading too much of his stuff start thinking "Wow, programs should do a lot better job for the user in many cases" http://www.asktog.com/Bughouse/10MostPersistentBugs.html

    • Please don't break XML!
      2005-01-11 05:27:50 despil [Reply]

      I absolutely agree.


      What is the sense in making standards if we throw them out the window so easily?


      Either don't make standards or use them.
      There is no third way.

  • RDF makes life difficult
    2002-12-19 12:39:02 Mario Diana [Reply]

    If there is some reason that sites wish to use RDF, they ought to include a XSL transformation of the document to RSS. Is that really so difficult?


    I was writing a Web service to gather an RSS feed from a client and return transformed HTML. When I ran into RDF, I was completely thrown. (Okay, maybe I live under a rock.)


    RDF is for machines; RSS is far more human-friendly. It's a pain to have to deal with it if you're not interested in its features.

    • RDF makes life difficult
      2004-09-24 13:06:41 kes [Reply]

      Can you tell me more about the project you are working on. It sounds really interesting...and I would love to learn more. thanks-

  • So that's what it is ;-)
    2002-12-19 11:59:51 Danny Ayers [Reply]

    Good piece, refreshingly practical. Also a refreshingly balanced comparison between the different formats, though (predictable quibble) the recommendation of RSS 2.0 for "general-purpose, metadata-rich syndication" seems a little strange when RSS 1.0-based feeds can be much more general purpose and metadata rich, thanks to RDF.
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值