Python之feedparser

最新推荐文章于 2025-03-24 17:42:01 发布

醉小义

最新推荐文章于 2025-03-24 17:42:01 发布

阅读量2.1k

点赞数

分类专栏：机器学习 python

本文链接：https://blog.csdn.net/qq_30638831/article/details/80008786

版权

机器学习同时被 2 个专栏收录

76 篇文章

订阅专栏

python

63 篇文章

订阅专栏

本文介绍Python库feedparser的功能及其使用方法，演示如何解析RSS和Atom订阅源，获取标题、链接和文章条目等内容。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Python之feedparser

参考链接：http://blog.csdn.net/lanchunhui/article/details/51020566
feedparser是一个Python的Feed解析库，可以处理RSS ，CDF，Atom 。使用它我们可从任何 RSS 或 Atom 订阅源得到标题、链接和文章的条目了。
RSS(Really Simple Syndication,简易信息聚合)是一种描述和同步网站内容的格式你可以认为是一种定制个性化推送信息的服务。它能够解决你漫无目的的浏览网页的问题。它不会过时，信息越是过剩，它的意义也越加彰显。网络中充斥着大量的信息垃圾，每天摄入了太多自己根本不关心的信息。让自己关注的信息主动来找自己，且这些信息都是用户自己所需要的，这就是RSS的意义。

parse() 方法

feedparser 最为核心的函数自然是 parse() 解析 URL 地址的函数。
我们知道，每个RSS和Atom订阅源都包含一个标题（d.feed.title）和一组文章条目(d.entries)
通常每个文章条目都有一段摘要（d.entries[i].summary）,或者是包含了条目中实际文本的描述性标签（d.entries[i].description）

>>>import feedparser
>>>d=feedparser.parse('http://feed.cnblogs.com/blog/sitehome/rss')

d.feed

feed 对应的值也是一个字典

>>>d['feed']['title']
'博客园_首页'
>>>d.feed.title    #通过属性的方式访问
'博客园_首页'
>>>d.feed.subtitle
'代码改变世界'
>>>d.feed.link
'uuid:1b90fd0c-6080-4ea5-86b1-b87c64b95d69;id=4466'

d.entries

该属性类型为列表，表示一组文章的条目

>>>type(d.entries)    #类型为列表
<class 'list'>
>>>len(d.entries)   #一共20篇文章
20
>>>[e.title for e in d.entries][:5]         #列出前5篇文章的标题
['僵尸进程 - 乌龟运维', '深入浅出 spring-data-elasticsearch - 基本案例详解（三 - 泥瓦匠BYSocket', 'js继承 - huanglei-', 'ionic 使用了 crosswalkwebview 所产生的bug 及 解决方案 - FEer_llx', '关于并发你真的了解吗？（二） - 心灬无痕']
>>>d.entries[0].summary   #第一篇文章的摘要  和d.entries[0].description功能一样
'在UNIX系统中，僵尸进程是指完成执行（通过exit系统调用，或运行时发生致命错误或收到终止信号所致）但在操作系统的进程表中仍 然有一个表项（进程控制块PCB），处于”终止状态“的进程。这发生于子进程需要保留表项以允许其父进程读取子进程的exit status：一旦退出态通过wait系统调用读取，僵尸'

号称Universal feed parser, handles RSS 0.9x, RSS 1.0, RSS 2.0, CDF, Atom 0.3, and Atom 1.0 feeds。官网：

https://pypi.python.org/pypi/feedparser/

基本用法

>>> import feedparser
>>> d = feedparser.parse("http://feedparser.org/docs/examples/atom10.xml")
>>> d['feed']['title']             # feed data is a dictionary
u'Sample Feed'
>>> d.feed.title                   # get values attr-style or dict-style
u'Sample Feed'
>>> d.channel.title                # use RSS or Atom terminology anywhere
u'Sample Feed'
>>> d.feed.link                    # resolves relative links
u'http://example.org/'
>>> d.feed.subtitle                 # parses escaped HTML
u'For documentation <em>only</em>'
>>> d.channel.description          # RSS terminology works here too
u'For documentation <em>only</em>'
>>> len(d['entries'])              # entries are a list
1
>>> d['entries'][0]['title']       # each entry is a dictionary
u'First entry title'
>>> d.entries[0].title             # attr-style works here too
u'First entry title'
>>> d['items'][0].title            # RSS terminology works here too
u'First entry title'
>>> e = d.entries[0]
>>> e.link                         # easy access to alternate link
u'http://example.org/entry/3'
>>> e.links[1].rel                 # full access to all Atom links
u'related'
>>> e.links[0].href                # resolves relative links here too
u'http://example.org/entry/3'
>>> e.author_detail.name           # author data is a dictionary
u'Mark Pilgrim'
>>> e.updated_parsed              # parses all date formats
(2005, 11, 9, 11, 56, 34, 2, 313, 0)
>>> e.content[0].value             # sanitizes dangerous HTML
u'<div>Watch out for <em>nasty tricks</em></div>'
>>> d.version                      # reports feed type and version
u'atom10'
>>> d.encoding                     # auto-detects character encoding
u'utf-8'
>>> d.headers.get('Content-type')  # full access to all HTTP headers
u'application/xml'

标准的item：

<item>
<title><![CDATA[厦门公交车放火案死者名单公布<br/>警方公布嫌犯犯罪证据]]></title>
<link>http://www.infzm.com/content/91404</link>
<description><![CDATA[6月11日下午，厦门BRT公交车放火案47名死亡者名单公布。厦门政府新闻办6月10日发布消息称，有证据表明，陈水总携带汽油上了闽DY7396公交车。且有多名幸存者指认其在车上纵火，致使整部车引起猛烈燃烧。经笔迹鉴定，陈水总6月7日致妻、女的两封绝笔书系陈水总本人所写。]]></description>
<category>南方周末-热点新闻</category>
<author>infzm</author>
<pubDate>2013-06-11 11:24:32</pubDate>
</item>

feedparser.parse()得到什么，

d=feedparser.parse(' ')
>>> print d
{'feed': {}, 'encoding': u'utf-8', 'bozo': 1, 'version': u'', 'namespaces': {}, 'entries': [], 'bozo_exception': SAXParseException('no element found',)}

可以看到，得到的是字典，feed也是字典，entries是list。