什么是RSS

原创 2003年06月10日 12:12:00

RSS(Rich Site Summary或者RDF Site Summary)是一种用于网站内容集成的技术。这种最初源自浏览器“新闻频道”的技术,现在却在企业门户(portal)、企业应用集成(EAI)等方面得到了更加宽广的用武之地。

————————————————

What is RSS?
By Mark Pilgrim

RSS is a format for syndicating news and the content of news-like sites, including major news sites like Wired, news-oriented community sites like Slashdot, and personal weblogs. But it's not just for news. Pretty much anything that can be broken down into discrete items can be syndicated via RSS: the "recent changes" page of a wiki, a changelog of CVS checkins, even the revision history of a book. Once information about each item is in RSS format, an RSS-aware program can check the feed for changes and react to the changes in an appropriate way.

RSS-aware programs called news aggregators are popular in the weblogging community. Many weblogs make content available in RSS. A news aggregator can help you keep up with all your favorite weblogs by checking their RSS feeds and displaying new items from each of them.

A brief history

But coders beware. The name "RSS" is an umbrella term for a format that spans several different versions of at least two different (but parallel) formats. The original RSS, version 0.90, was designed by Netscape as a format for building portals of headlines to mainstream news sites. It was deemed overly complex for its goals; a simpler version, 0.91, was proposed and subsequently dropped when Netscape lost interest in the portal-making business. But 0.91 was picked up by another vendor, UserLand Software, which intended to use it as the basis of its weblogging products and other web-based writing software.

In the meantime, a third, non-commercial group split off and designed a new format based on what they perceived as the original guiding principles of RSS 0.90 (before it got simplified into 0.91). This format, which is based on RDF, is called RSS 1.0. But UserLand was not involved in designing this new format, and, as an advocate of simplifying 0.90, it was not happy when RSS 1.0 was announced. Instead of accepting RSS 1.0, UserLand continued to evolve the 0.9x branch, through versions 0.92, 0.93, 0.94, and finally 2.0.

What a mess.

So which one do I use?

That's 7 -- count 'em, 7! -- different formats, all called "RSS". As a coder of RSS-aware programs, you'll need to be liberal enough to handle all the variations. But as a content producer who wants to make your content available via syndication, which format should you choose?

RSS versions and recommendations
Version Owner Pros Status Recommendation
0.90 Netscape   Obsoleted by 1.0 Don't use
0.91 UserLand Drop dead simple Officially obsoleted by 2.0, but still quite popular Use for basic syndication. Easy migration path to 2.0 if you need more flexibility
0.92, 0.93, 0.94 UserLand Allows richer metadata than 0.91 Obsoleted by 2.0 Use 2.0 instead
1.0 RSS-DEV Working Group RDF-based, extensibility via modules, not controlled by a single vendor Stable core, active module development Use for RDF-based applications or if you need advanced RDF-specific modules
2.0 UserLand Extensibility via modules, easy migration path from 0.9x branch Stable core, active module development Use for general-purpose, metadata-rich syndication

What does RSS look like?

Imagine you want to write a program that reads RSS feeds, so that you can publish headlines on your site, build your own portal or homegrown news aggregator, or whatever. What does an RSS feed look like? That depends on which version of RSS you're talking about. Here's a sample RSS 0.91 feed (adapted from XML.com's RSS feed):

<rss version="0.91">
  <channel>
    <title>XML.com</title>
    <link>http://www.xml.com/</link>
    <description>XML.com features a rich mix of information and services for the XML community.</description>
    <language>en-us</language>
    <item>
      <title>Normalizing XML, Part 2</title>
      <link>http://www.xml.com/pub/a/2002/12/04/normalizing.html</link>
      <description>In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.</description>
    </item>
    <item>
      <title>The .NET Schema Object Model</title>
      <link>http://www.xml.com/pub/a/2002/12/04/som.html</link>
      <description>Priya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic manipulation of W3C XML Schemas.</description>
    </item>
    <item>
      <title>SVG's Past and Promising Future</title>
      <link>http://www.xml.com/pub/a/2002/12/04/svg.html</link>
      <description>In this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward to 2003.</description>
    </item>
  </channel>
</rss>

Simple, right? A feed comprises a channel, which has a title, link, description, and (optional) language, followed by a series of items, each of which have a title, link, and description.

Now look at the RSS 1.0 version of the same information:

<rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns="http://purl.org/rss/1.0/"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
>
  <channel rdf:about="http://www.xml.com/cs/xml/query/q/19">
    <title>XML.com</title>
    <link>http://www.xml.com/</link>
    <description>XML.com features a rich mix of information and services for the XML community.</description>
    <language>en-us</language>
    <items>
      <rdf:Seq>
        <rdf:li rdf:resource="http://www.xml.com/pub/a/2002/12/04/normalizing.html"/>
        <rdf:li rdf:resource="http://www.xml.com/pub/a/2002/12/04/som.html"/>
        <rdf:li rdf:resource="http://www.xml.com/pub/a/2002/12/04/svg.html"/>
      </rdf:Seq>
    </items>
  </channel>
  <item rdf:about="http://www.xml.com/pub/a/2002/12/04/normalizing.html">
    <title>Normalizing XML, Part 2</title>
    <link>http://www.xml.com/pub/a/2002/12/04/normalizing.html</link>
    <description>In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.</description>
    <dc:creator>Will Provost</dc:creator>
    <dc:date>2002-12-04</dc:date>    
  </item>
  <item rdf:about="http://www.xml.com/pub/a/2002/12/04/som.html">
    <title>The .NET Schema Object Model</title>
    <link>http://www.xml.com/pub/a/2002/12/04/som.html</link>
    <description>Priya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic manipulation of W3C XML Schemas.</description>
    <dc:creator>Priya Lakshminarayanan</dc:creator>
    <dc:date>2002-12-04</dc:date>    
  </item>
  <item rdf:about="http://www.xml.com/pub/a/2002/12/04/svg.html">
    <title>SVG's Past and Promising Future</title>
    <link>http://www.xml.com/pub/a/2002/12/04/svg.html</link>
    <description>In this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward to 2003.</description>
    <dc:creator>Antoine Quint</dc:creator>
    <dc:date>2002-12-04</dc:date>    
  </item>
</rdf:RDF>

Quite a bit more verbose. People familiar with RDF will recognize this as an XML serialization of an RDF document; the rest of the world will at least recognize that we're syndicating essentially the same information. In fact, we're including a bit more information: item-level authors and publishing dates, which RSS 0.91 does not support.

by Mark Pilgrim

Despite being RDF/XML, RSS 1.0 is structurally similar to previous versions of RSS -- similar enough that we can simply treat it as XML and write a single function to extract information out of either an RSS 0.91 or RSS 1.0 feed. However, there are some significant differences that our code will need to be aware of:

  1. The root element is rdf:RDF instead of rss. We'll either need to handle both explicitly or just ignore the name of the root element altogether and blindly look for useful information inside it.

  2. RSS 1.0 uses namespaces extensively. The RSS 1.0 namespace is http://purl.org/rss/1.0/, and it's defined as the default namespace. The feed also uses http://www.w3.org/1999/02/22-rdf-syntax-ns# for the RDF-specific elements (which we'll simply be ignoring for our purposes) and http://purl.org/dc/elements/1.1/ (Dublin Core) for the additional metadata of article authors and publishing dates.

    We can go in one of two ways here: if we don't have a namespace-aware XML parser, we can blindly assume that the feed uses the standard prefixes and default namespace and look for item elements and dc:creator elements within them. This will actually work in a large number of real-world cases; most RSS feeds use the default namespace and the same prefixes for common modules like Dublin Core. This is a horrible hack, though. There's no guarantee that a feed won't use a different prefix for a namespace (which would be perfectly valid XML and RDF). If or when it does, we'll miss it.

    If we have a namespace-aware XML parser at our disposal, we can construct a more elegant solution that handles both RSS 0.91 and 1.0 feeds. We can look for items in no namespace; if that fails, we can look for items in the RSS 1.0 namespace. (Not shown, but RSS 0.90 feeds also use a namespace, but not the same one as RSS 1.0. So what we really need is a list of namespaces to search.)

  3. Less obvious but still important, the item elements are outside the channel element. (In RSS 0.91, the item elements were inside the channel. In RSS 0.90, they were outside; in RSS 2.0, they're inside. Whee.) So we can't be picky about where we look for items.

  4. Finally, you'll notice there is an extra items element within the channel. It's only useful to RDF parsers, and we're going to ignore it and assume that the order of the items within the RSS feed is given by their order of the item elements.

But what about RSS 2.0? Luckily, once we've written code to handle RSS 0.91 and 1.0, RSS 2.0 is a piece of cake. Here's the RSS 2.0 version of the same feed:

<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>XML.com</title>
    <link>http://www.xml.com/</link>
    <description>XML.com features a rich mix of information and services for the XML community.</description>
    <language>en-us</language>
    <item>
      <title>Normalizing XML, Part 2</title>
      <link>http://www.xml.com/pub/a/2002/12/04/normalizing.html</link>
      <description>In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.</description>
      <dc:creator>Will Provost</dc:creator>
      <dc:date>2002-12-04</dc:date>    
    </item>
    <item>
      <title>The .NET Schema Object Model</title>
      <link>http://www.xml.com/pub/a/2002/12/04/som.html</link>
      <description>Priya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic manipulation of W3C XML Schemas.</description>
      <dc:creator>Priya Lakshminarayanan</dc:creator>
      <dc:date>2002-12-04</dc:date>    
    </item>
    <item>
      <title>SVG's Past and Promising Future</title>
      <link>http://www.xml.com/pub/a/2002/12/04/svg.html</link>
      <description>In this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward to 2003.</description>
      <dc:creator>Antoine Quint</dc:creator>
      <dc:date>2002-12-04</dc:date>    
    </item>
  </channel>
</rss>

As this example shows, RSS 2.0 uses namespaces like RSS 1.0, but it's not RDF. Like RSS 0.91, there is no default namespace and items are back inside the channel. If our code is liberal enough to handle the differences between RSS 0.91 and 1.0, RSS 2.0 should not present any additional wrinkles.

How can I read RSS?

Now let's get down to actually reading these sample RSS feeds from Python. The first thing we'll need to do is download some RSS feeds. This is simple in Python; most distributions come with both a URL retrieval library and an XML parser. (Note to Mac OS X 10.2 users: your copy of Python does not come with an XML parser; you will need to install PyXML first.)

from xml.dom import minidom
import urllib

def load(rssURL):
  return minidom.parse(urllib.urlopen(rssURL))

This takes the URL of an RSS feed and returns a parsed representation of the DOM, as native Python objects.

The next bit is the tricky part. To compensate for the differences in RSS formats, we'll need a function that searches for specific elements in any number of namespaces. Python's XML library includes a getElementsByTagNameNS which takes a namespace and a tag name, so we'll use that to make our code general enough to handle RSS 0.9x/2.0 (which has no default namespace), RSS 1.0 and even RSS 0.90. This function will find all elements with a given name, anywhere within a node. That's a good thing; it means that we can search for item elements within the root node and always find them, whether they are inside or outside the channel element.

DEFAULT_NAMESPACES = /
  (None, # RSS 0.91, 0.92, 0.93, 0.94, 2.0
  'http://purl.org/rss/1.0/', # RSS 1.0
  'http://my.netscape.com/rdf/simple/0.9/' # RSS 0.90
  )

def getElementsByTagName(node, tagName, possibleNamespaces=DEFAULT_NAMESPACES):
  for namespace in possibleNamespaces:
    children = node.getElementsByTagNameNS(namespace, tagName)
    if len(children): return children
  return []

Finally, we need two utility functions to make our lives easier. First, our getElementsByTagName function will return a list of elements, but most of the time we know there's only going to be one. An item only has one title, one link, one description, and so on. We'll define a first function that returns the first element of a given name (again, searching across several different namespaces). Second, Python's XML libraries are great at parsing an XML document into nodes, but not that helpful at putting the data back together again. We'll define a textOf function that returns the entire text of a particular XML element.

def first(node, tagName, possibleNamespaces=DEFAULT_NAMESPACES):
  children = getElementsByTagName(node, tagName, possibleNamespaces)
  return len(children) and children[0] or None

def textOf(node):
  return node and "".join([child.data for child in node.childNodes]) or ""

That's it. The actual parsing is easy. We'll take a URL on the command line, download it, parse it, get the list of items, and then get some useful information from each item:

DUBLIN_CORE = ('http://purl.org/dc/elements/1.1/',)

if __name__ == '__main__':
  import sys
  rssDocument = load(sys.argv[1])
  for item in getElementsByTagName(rssDocument, 'item'):
    print 'title:', textOf(first(item, 'title'))
    print 'link:', textOf(first(item, 'link'))
    print 'description:', textOf(first(item, 'description'))
    print 'date:', textOf(first(item, 'date', DUBLIN_CORE))
    print 'author:', textOf(first(item, 'creator', DUBLIN_CORE))
    print

Running it with our sample RSS 0.91 feed prints only title, link, and description (since the feed didn't include any other information on dates or authors):

$ python rss1.py http://www.xml.com/2002/12/18/examples/rss091.xml.txt
title: Normalizing XML, Part 2
link: http://www.xml.com/pub/a/2002/12/04/normalizing.html
description: In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.
date:
author:

title: The .NET Schema Object Model
link: http://www.xml.com/pub/a/2002/12/04/som.html
description: Priya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic manipulation of W3C XML Schemas.
date:
author:

title: SVG's Past and Promising Future
link: http://www.xml.com/pub/a/2002/12/04/svg.html
description: In this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward to 2003.
date:
author:

For both the sample RSS 1.0 feed and sample RSS 2.0 feed, we also get dates and authors for each item. We reuse our custom getElementsByTagName function, but pass in the Dublin Core namespace and appropriate tag name. We could reuse this same function to extract information from any of the basic RSS modules. (There are a few advanced modules specific to RSS 1.0 that would require a full RDF parser, but they are not widely deployed in public RSS feeds.)

Here's the output against our sample RSS 1.0 feed:

$ python rss1.py http://www.xml.com/2002/12/18/examples/rss10.xml.txt
title: Normalizing XML, Part 2
link: http://www.xml.com/pub/a/2002/12/04/normalizing.html
description: In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.
date: 2002-12-04
author: Will Provost

title: The .NET Schema Object Model
link: http://www.xml.com/pub/a/2002/12/04/som.html
description: Priya Lakshminarayanan describes in detail the use of the .NET Schema Object Model for programmatic manipulation of W3C XML Schemas.
date: 2002-12-04
author: Priya Lakshminarayanan

title: SVG's Past and Promising Future
link: http://www.xml.com/pub/a/2002/12/04/svg.html
description: In this month's SVG column, Antoine Quint looks back at SVG's journey through 2002 and looks forward to 2003.
date: 2002-12-04
author: Antoine Quint

Running against our sample RSS 2.0 feed produces the same results.

This technique will handle about 90% of the RSS feeds out there; the rest are ill-formed in a variety of interesting ways, mostly caused by non-XML-aware publishing tools building feeds out of templates and not respecting basic XML well-formedness rules. Next month we'll tackle the thorny problem of how to handle RSS feeds that are almost, but not quite, well-formed XML.

Related resources

为什么需要RSS及如何使用

http://www.douban.com/note/203836734/ 如果你对此真的没兴趣,我真心建议你只看倒数第二段。 电梯:如果你只是需要看如何使用RSS可从【第四个标题】后开始看。...
  • hshl1214
  • hshl1214
  • 2016年01月26日 15:38
  • 2197

RSS是什么,RSS怎么玩,RSS原理是什么

http://www.cjjjs.com/paper/gzsh/201622721397372.aspx **********************************************...
  • zhao1949
  • zhao1949
  • 2016年10月13日 13:31
  • 592

网卡多队列技术与RSS功能介绍

多队列网卡是一种技术,最初是用来解决网络IO QoS (quality of service)问题的,后来随着网络IO的带宽的不断提升,单核CPU不能完全处满足网卡的需求,通过多队列网卡驱动的支持,将...
  • baidu_24553027
  • baidu_24553027
  • 2017年02月08日 15:44
  • 4447

php编写RSS源

记编写rss源 点点细雨     2013年11月26日星期二   为了提高搜索引擎的收录速度,今天开始编写rss源来增加对搜索引擎的友好。 废话就不多打了,毕竟我打字速度也不快(O(∩_∩)...
  • diandianxiyu
  • diandianxiyu
  • 2013年11月26日 14:36
  • 2777

利用UT的RSS第一时间自动下载TTG种子(转载)

转自:http://leo.eool.net/blog/archives/50 现有些人总是能在第一时间下到新发布的种子,哪怕是凌晨三点发布的。 你好奇他是怎么做到的吗? 其实很简单,...
  • u010794523
  • u010794523
  • 2013年11月29日 17:21
  • 3316

Java生成RSS-XML文件

详细看代码,部分涉及到隐私的就给删了,但是不影响功能,里面的日期等格式化,不知道的可以看我前面的博客   /** * author:humf */ import java.text.Si...
  • qq_22260641
  • qq_22260641
  • 2017年03月28日 17:26
  • 681

各大网站RSS订阅源地址

十大最值得订阅的中文RSS源 1、FT中文网  http://feeds.feedburner.com/ftchina 2、果壳网 http://www.guokr.com/rss...
  • Techzero
  • Techzero
  • 2013年06月03日 18:27
  • 15046

教你6种过滤RSS种子信息的方法

教你6种过滤RSS种子信息的方法  RSS无疑是网络内容发布界10年来最好的技术之一。这项技术使得网站更新的内容可以被直接送达至用户手中,因而用户能够轻而易举的获取各类网站发布的最新信息。可问题是大...
  • youlasiqu6
  • youlasiqu6
  • 2014年08月21日 12:42
  • 866

PHP进行RSS订阅

PHP进行RSS订阅 现在有很多的rss订阅,我们直接可以订阅到邮箱。既然学了PHP,那么有没有一种方法可以直接将rss的新闻信息显示在自己的网页上呢?有的,必须的,下面就是这个rss脚本: ...
  • amberom
  • amberom
  • 2015年01月12日 10:36
  • 839

Web版RSS阅读器(三)——解析在线Rss订阅

Web版RSS阅读器 上篇博客《 Web版RSS阅读器(二)——使用dTree树形加载rss订阅分组列表》已经写到读取rss订阅列表了,今天就说一下,当获取一条在线rss订阅的信息,怎么去解析它,从...
  • xiaoxian8023
  • xiaoxian8023
  • 2013年08月09日 09:12
  • 7196
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:什么是RSS
举报原因:
原因补充:

(最多只允许输入30个字)