Beautiful Soup的用法

最新推荐文章于 2023-04-26 17:16:00 发布

星辰学院

最新推荐文章于 2023-04-26 17:16:00 发布

阅读量678

点赞数

分类专栏： Python 文章标签： python

Python 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

转载博客：http://cuiqingcai.com/

Beautiful Soup简介

Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。
Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。

例子演示

from bs4 import BeautifulSoup
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html)
#我们还可以用本地 HTML 文件来创建对象
#soup = BeautifulSoup(open('index.html'))
print soup.prettify()#打印soup对象的内容，格式化输出

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


/home/fisher/soft/anaconda2/lib/python2.7/site-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

To get rid of this warning, change this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))

BeautifulSoup中的四大对象

，Tag，NavigableString，BeautifulSoup，Comment。

Tag，使用soup对象对标签进行调用

可以方便查看标签内容(全部)。

print soup.title,'\n',soup.a

<title>The Dormouse's story</title> 
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>

有一点要注意的是，它查找的是在所有内容中的第一个符合要求的标签，如果要查询所有的标签，我们要使用别的方法。

Tag，它有两个重要的属性，是 name 和 attrs

print soup.a
print soup.a.name
print soup.a.attrs
print soup.a['href']
#另一种方式
print soup.a.get('href')

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
a
{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}
http://example.com/elsie
http://example.com/elsie

我们可以单独查看标签内的参数内容，另外还可以对其进行修改删除。

要想获取标签内部的文字，用.string即可。非注释内容类型是bs4.element.NavigableString’，注释内容类型是bs4.element.Comment。

print soup.p.string
print type(soup.p.string)

The Dormouse's story
<class 'bs4.element.NavigableString'>

如果标签中的内容带有注释，我们需要进行特殊处理。

print soup.a
print soup.a.string#出现elsie内容
print type(soup.a.string)
#我们可以对标签内容进行类型判断，确定是否可以输出。
import bs4
if type(soup.a.string)!=bs4.element.Comment:
    print soup.a.string#没有输出

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
 Elsie 
<class 'bs4.element.Comment'>

另外，在进行操作时，如果一个标签里面没有标签了，那么 .string 就会返回标签里面的内容。如果标签里面只有唯一的一个标签了，那么 .string 也会返回最里面的内容。
如果tag包含了多个子节点,tag就无法确定.string属性应该调用哪个子节点的内容, .string 的输出结果是 None

print soup.head.string
print soup.html.string

The Dormouse's story
None

对于内部含有多个标签的情况，我们可以使用.strings 或者 .stripped_strings进行循环遍历。二者区别是后者去除了多余的空格或空行。我们可以使用repr(string)函数进行原始字符串的查看。例如：

for s in soup.html.strings:
    print repr(s)
print '另外一种形式：'
for s in soup.html.stripped_strings:
    print repr(s)#很明显输出信息更加实用

u"The Dormouse's story"
u'\n'
u'\n'
u"The Dormouse's story"
u'\n'
u'Once upon a time there were three little sisters; and their names were\n'
u',\n'
u'Lacie'
u' and\n'
u'Tillie'
u';\nand they lived at the bottom of a well.'
u'\n'
u'...'
u'\n'
另外一种形式：
u"The Dormouse's story"
u"The Dormouse's story"
u'Once upon a time there were three little sisters; and their names were'
u','
u'Lacie'
u'and'
u'Tillie'
u';\nand they lived at the bottom of a well.'
u'...'

遍历文档树，使用.contents 和.children属性

，输出类型都是列表。

print soup.head.contents[0]
for tag in soup.body.children:
    print tag#tag为bs4.element.NavigableString类型

<title>The Dormouse's story</title>


<p class="title" name="dromouse"><b>The Dormouse's story</b></p>


<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>


<p class="story">...</p>

.contents 和 .children 属性仅包含tag的直接子节点，.descendants 属性可以对所有tag的子孙节点进行递归循环.

for child in soup.body.descendants:
    print child

<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<b>The Dormouse's story</b>
The Dormouse's story


<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>
 Elsie 
,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
 and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
Tillie
;
and they lived at the bottom of a well.


<p class="story">...</p>
...

查看父节点和全部父节点

print soup.body.parent.name
print '查看全部父节点'
for parent in soup.head.title.string.parents:
    print parent.name

html
查看全部父节点
title
head
html
[document]

兄弟节点

：.next_sibling .previous_sibling 属性，加s表示全部节点

前后节点

.next_element .previous_element 属性，不分层次关系，比如：

print soup.head
print soup.head.next_element

<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>

搜索文档树

，重点介绍find_all( name , attrs , recursive , text ,××kwargs )方法

find_all()方法

用于搜索满足要求的子节点，返回列表对象。

a.name参数,用于查找所有名字为name的子节点tag，字符串对象将自动忽略掉。例如：

soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

b.传入正则表达式，传入正则表达式作为参数,Beautiful Soup会通过正则表达式的 match() 来匹配内容.下面例子中找出所有以b开头的标签,这表示和标签都应该被找到：

import re 
for tag in soup.find_all(re.compile(r"^b")):
    print tag.name

body
b

C.传列表
如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回.下面代码找到文档中所有标签和标签

soup.find_all(['a','b'])

[<b>The Dormouse's story</b>,
 <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

D.传 True
True 可以匹配任何值,下面代码查找到所有的tag,但是不会返回字符串节点

for tag in soup.find_all(True):
    print tag.name

html
head
title
body
p
b
p
a
a
a
p

E.传方法
如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数 [4] ,如果这个方法返回 True 表示当前元素匹配并且被找到,如果不是则反回 False

下面方法校验了当前元素,如果包含 class 属性却不包含 id 属性,那么将返回 True:

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')
soup.find_all(has_class_but_no_id)

[<p class="title" name="dromouse"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were\n<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,\n<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and\n<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;\nand they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

b.keyword参数，如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字tag的属性来搜索,如果包含一个名字为 id 的参数,Beautiful Soup会搜索每个tag的”id”属性

soup.find_all(id='link2')

[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

还可以传入多个参数：

soup.find_all("a",class_='sister')
#class在python是关健字，使用下划线class_方式替代

[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

可以通过 find_all() 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的tag

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(attrs={"data-foo": "value"})

[<div data-foo="value">foo!</div>]

c,text参数与name参数可选值相同，limit参数用来设置返回数量

d，recursive 参数

调用tag的 find_all() 方法时,Beautiful Soup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False .
例如：

The Dormouse’s story

…

soup.html.find_all("title")
# [<title>The Dormouse's story</title>]

soup.html.find_all("title", recursive=False)
# []

css选择器

我们在写 CSS 时，标签名不加任何修饰，类名前加点，id名前加 #，在这里我们也可以利用类似的方法来筛选元素，用到的方法是 soup.select()，返回类型是 list

（1）通过标签名查找

print soup.select('title') 
#[<title>The Dormouse's story</title>]

print soup.select('a')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print soup.select('b')
#[<b>The Dormouse's story</b>]

（2）通过类名查找

print soup.select('.sister')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

（3）通过 id 名查找

print soup.select('#link1')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

（4）组合查找
组合查找即和写 class 文件时，标签名与类名、id名进行的组合原理是一样的，例如查找 p 标签中，id 等于 link1的内容，二者需要用空格分开

print soup.select('p #link1')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

直接子标签查找

print soup.select("head > title")
#[<title>The Dormouse's story</title>]

（5）属性查找
查找时还可以加入属性元素，属性需要用中括号括起来，注意属性和标签属于同一节点，所以中间不能加空格，否则会无法匹配到。

print soup.select('a[class="sister"]')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print soup.select('a[href="http://example.com/elsie"]')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

同样，属性仍然可以与上述查找方式组合，不在同一节点的空格隔开，同一节点的不加空格

print soup.select('p a[href="http://example.com/elsie"]')
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

星辰学院

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Beautiful Soup的用法

Beautiful Soup教程
复制链接

扫一扫

专栏目录