beautiful soap simple examples

最新推荐文章于 2023-05-17 21:00:00 发布

screaming

最新推荐文章于 2023-05-17 21:00:00 发布

阅读量711

点赞数

分类专栏： bs4 spider Python

Python 同时被 3 个专栏收录

164 篇文章 1 订阅

订阅专栏

bs4

2 篇文章 0 订阅

订阅专栏

spider

2 篇文章 0 订阅

订阅专栏

python下很帅气的爬虫包 - Beautiful Soup 示例

2013-11-05 10:24 34879人阅读评论(3) 收藏举报

 
  分类： 
 
  python（50）   
   web（16）

 版权声明：本文为博主原创文章，未经博主允许不得转载。

目录(?)[+]

先发一下官方文档地址。http://www.crummy.com/software/BeautifulSoup/bs4/doc/

建议有时间可以看一下Python包的文档。

Beautiful Soup 相比其他的html解析有个非常重要的优势。html会被拆解为对象处理。全篇转化为字典和数组。

相比正则解析的爬虫，省略了学习正则的高成本。

相比xpath爬虫的解析，同样节约学习时间成本。虽然xpath已经简单点了。（爬虫框架Scrapy就是使用xpath）

安装

linux下可以执行

[plain]view plaincopy
 apt-get install python-bs4  

也可以用python的安装包工具来安装

[html]view plaincopy
 easy_install beautifulsoup4  
   
 pip install beautifulsoup4  

使用简介

下面说一下BeautifulSoup 的使用。

解析html需要提取数据。其实主要有几点

1：获取指定tag的内容。

[plain]view plaincopy
 <p>hello, watsy</p><br><p>hello, beautiful soup.</p>  

2：获取指定tag下的属性。

[html]view plaincopy
 <a href="http://blog.csdn.net/watsy">watsy's blog</a>  

3：如何获取，就需要用到查找方法。

使用示例采用官方

[html]view plaincopy
 html_doc = """  
 <html><head><title>The Dormouse's story</title></head>  
 <body>  
 <p class="title"><b>The Dormouse's story</b></p>  
   
 <p class="story">Once upon a time there were three little sisters; and their names were  
 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,  
 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and  
 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;  
 and they lived at the bottom of a well.</p>  
   
 <p class="story">...</p>  
 """  

格式化输出。

[html]view plaincopy
 from bs4 import BeautifulSoup  
 soup = BeautifulSoup(html_doc)  
   
 print(soup.prettify())  
 # <html>  
 #  <head>  
 #   <title>  
 #    The Dormouse's story  
 #   </title>  
 #  </head>  
 #  <body>  
 #   <p class="title">  
 #    <b>  
 #     The Dormouse's story  
 #    </b>  
 #   </p>  
 #   <p class="story">  
 #    Once upon a time there were three little sisters; and their names were  
 #    <a class="sister" href="http://example.com/elsie" id="link1">  
 #     Elsie  
 #    </a>  
 #    ,  
 #    <a class="sister" href="http://example.com/lacie" id="link2">  
 #     Lacie  
 #    </a>  
 #    and  
 #    <a class="sister" href="http://example.com/tillie" id="link2">  
 #     Tillie  
 #    </a>  
 #    ; and they lived at the bottom of a well.  
 #   </p>  
 #   <p class="story">  
 #    ...  
 #   </p>  
 #  </body>  
 # </html>  

获取指定tag的内容

[html]view plaincopy
 soup.title  
 # <title>The Dormouse's story</title>  
   
 soup.title.name  
 # u'title'  
   
 soup.title.string  
 # u'The Dormouse's story'  
   
 soup.title.parent.name  
 # u'head'  
   
 soup.p  
 # <p class="title"><b>The Dormouse's story</b></p>  
   
 soup.a  
 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>  

上面示例给出了4个方面

1：获取tag

soup.title

2：获取tag名称

soup.title.name

3：获取title tag的内容

soup.title.string

4：获取title的父节点tag的名称

soup.title.parent.name

怎么样，非常对象化的使用吧。

提取tag属性

下面要说一下如何提取href等属性。

[html]view plaincopy
 soup.p['class']  
 # u'title'  

获取属性。方法是

soup.tag['属性名称']

[html]view plaincopy
 <a href="http://blog.csdn.net/watsy">watsy's blog</a>  

常见的应该是如上的提取联接。

代码是

[html]view plaincopy
 soup.a['href']  

相当easy吧。

查找与判断

接下来进入重要部分。全文搜索查找提取.

soup提供find与find_all用来查找。其中find在内部是调用了find_all来实现的。因此只说下find_all

[html]view plaincopy
 def find_all(self, name=None, attrs={}, recursive=True, text=None,  
                  limit=None, **kwargs):  

看参数。

第一个是tag的名称，第二个是属性。第3个选择递归，text是判断内容。limit是提取数量限制。**kwargs 就是字典传递了。。

举例使用。

[html]view plaincopy
 tag名称  
 soup.find_all('b')  
 # [<b>The Dormouse's story</b>]  
   
 正则参数  
 import re  
 for tag in soup.find_all(re.compile("^b")):  
     print(tag.name)  
 # body  
 # b  
   
 for tag in soup.find_all(re.compile("t")):  
     print(tag.name)  
 # html  
 # title  
   
 列表  
 soup.find_all(["a", "b"])  
 # [<b>The Dormouse's story</b>,  
 #  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,  
 #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,  
 #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]  
   
 函数调用  
 def has_class_but_no_id(tag):  
     return tag.has_attr('class') and not tag.has_attr('id')  
   
 soup.find_all(has_class_but_no_id)  
 # [<p class="title"><b>The Dormouse's story</b></p>,  
 #  <p class="story">Once upon a time there were...</p>,  
 #  <p class="story">...</p>]  
   
 tag的名称和属性查找  
 soup.find_all("p", "title")  
 # [<p class="title"><b>The Dormouse's story</b></p>]  
   
 tag过滤  
 soup.find_all("a")  
 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,  
 #  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,  
 #  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]  
   
 tag属性过滤  
 soup.find_all(id="link2")  
 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]  
   
 text正则过滤  
 import re  
 soup.find(text=re.compile("sisters"))  
 # u'Once upon a time there were three little sisters; and their names were\n'  

获取内容和字符串

获取tag的字符串

[html]view plaincopy
 title_tag.string  
 # u'The Dormouse's story'  

注意在实际使用中应该使用 unicode(title_tag.string)来转换为纯粹的string对象

使用strings属性会返回soup的构造1个迭代器，迭代tag对象下面的所有文本内容

[html]view plaincopy
 for string in soup.strings:  
     print(repr(string))  
 # u"The Dormouse's story"  
 # u'\n\n'  
 # u"The Dormouse's story"  
 # u'\n\n'  
 # u'Once upon a time there were three little sisters; and their names were\n'  
 # u'Elsie'  
 # u',\n'  
 # u'Lacie'  
 # u' and\n'  
 # u'Tillie'  
 # u';\nand they lived at the bottom of a well.'  
 # u'\n\n'  
 # u'...'  
 # u'\n'  

获取内容

.contents会以列表形式返回tag下的节点。

[html]view plaincopy
 head_tag = soup.head  
 head_tag  
 # <head><title>The Dormouse's story</title></head>  
   
 head_tag.contents  
 [<title>The Dormouse's story</title>]  
   
 title_tag = head_tag.contents[0]  
 title_tag  
 # <title>The Dormouse's story</title>  
 title_tag.contents  
 # [u'The Dormouse's story']  

想想，应该没有什么其他的了。。其他的也可以看文档学习使用。

总结

其实使用起主要是

[html]view plaincopy
 soup = BeatifulSoup(data)  
 soup.title  
 soup.p.['title']  
 divs = soup.find_all('div', content='tpc_content')  
 divs[0].contents[0].string