beautifulsoup函数

最新推荐文章于 2024-03-19 20:34:59 发布

weixin_41611045

最新推荐文章于 2024-03-19 20:34:59 发布

阅读量1.1k

点赞数 1

分类专栏：自然语言处理

原文链接：https://zhuanlan.zhihu.com/p/28759710

版权

自然语言处理专栏收录该内容

9 篇文章

订阅专栏

1、beautifulsoup
beautifulsoup是一个对网页进行解析转换的包，可以将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象
例如：

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
soup = BeautifulSoup(html,'lxml')  #创建 beautifulsoup 对象
print(soup)

结果：

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

即将html文档变为很规整的形式：
我们可以很容易获取到网页的各个指标：

print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
The Dormouse's story

首先我们声明了一个变量html，它是一个HTML字符串，但是注意到，它并不是一个完整的HTML字符串，和标签都没有闭合，但是我们将它当作第一个参数传给BeautifulSoup对象，第二个参数传入的是解析器的类型，在这里我们使用lxml，这样就完成了BeaufulSoup对象的初始化，将它赋值给soup这个变量。
那么接下来我们就可以通过调用soup的各个方法和属性对这串HTML代码解析了。

我们首先调用了prettify()方法，这个方法可以把要解析的字符串以标准的缩进格式输出，在这里注意到输出结果里面包含了和标签，也就是说对于不标准的HTML字符串BeautifulSoup可以自动更正格式，这一步实际上不是由prettify()方法做的，这个更正实际上在初始化BeautifulSoup时就完成了。
2、标签选择器
刚才我们选择元素的时候直接通过调用标签的名称就可以选择节点元素了，然后再调用string属性就可以得到标签内的文本了，这种选择方式速度非常快，如果单个标签结构话层次非常清晰，可以选用这种方式来解析。
例：

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title)
print(type(soup.title))
print(soup.title.string)
print(soup.head)
print(soup.p)

运行结果：

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story
<head><title>The Dormouse's story</title></head>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

在这里我们依然选用了刚才的HTML代码，我们首先打印输出了title标签的选择结果，输出结果正是title标签加里面的文字内容。接下来输出了它的类型，是bs4.element.Tag类型，这是BeautifulSoup中的一个重要的数据结构，经过选择器选择之后，选择结果都是这种Tag类型，它具有一些属性比如string属性，调用Tag的string属性，就可以得到节点的文本内容了，所以接下来的输出结果正是节点的文本内容。

接下来我们又尝试选择了head标签，结果也是标签加其内部的所有内容，再接下来选择了p标签，不过这次情况比较特殊，我们发现结果是第一个p标签的内容，后面的几个p标签并没有选择到，也就是说，当有多个标签时，这种选择方式只会选择到第一个匹配的标签，其他的后面的标签都会忽略。
（3）获取各个属性内容
可以利用string属性获取节点元素包含的文本内容，比如上面的文本我们获取第一个p标签的文本：

print(soup.p.string)

运行结果：

The Dormouse's story

3.2、嵌套选择
在上面的例子中我们知道每一个返回结果都是bs4.element.Tag类型，它同样可以继续调用标签进行下一步的选择，比如我们获取了head节点元素，我们可以继续调用head来选取其内部的head节点元素。
例

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.head.title)
print(type(soup.head.title))
print(soup.head.title.string)

运行结果：

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story

第一行结果是我们调用了head之后再次调用了title来选择的title节点元素，然后我们紧接着打印输出了它的类型，可以看到它仍然是bs4.element.Tag类型，也就是说我们在Tag类型的基础上再次选择得到的依然还是Tag类型，每次返回的结果都相同，所以这样我们就可以这样做嵌套的选择了。
最后输出了一下它的string属性，也就是标签里的文本内容
(4)子节点和子孙节点

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

则soup的结果为：

<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
            and
            <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
<p class="story">...</p>
</body></html>

打印子节点：

for i, child in enumerate(soup.p.children):
    print(i, child)

结果：

0 
            Once upon a time there were three little sisters; and their names were
            
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4  
            and
            
5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6 
            and they lived at the bottom of a well.

方法2：

print(soup.p.contents)

['\n            Once upon a time there were three little sisters; and their names were\n            ', <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' \n            and\n            ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n            and they lived at the bottom of a well.\n        ']

还是同样的HTML文本，在这里我们调用了children属性来进行选择，返回结果可以看到是生成器类型，所以接下来我们用for循环输出了一下相应的内容，内容其实是一样的，只不过children返回的是生成器类型，而contents返回的是列表类型
4.2、如果我们要得到所有的子孙节点的话可以调用descendants属性。

for i, child in enumerate(soup.p.descendants):
    print(i, child)

结果：

<generator object descendants at 0x10650e678>
0 
            Once upon a time there were three little sisters; and their names were
            
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <span>Elsie</span>
4 Elsie
5 

6 

7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
8 Lacie
9  
            and
            
10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
11 Tillie
12 
            and they lived at the bottom of a well.

可以看出descendants这个函数对于每一个子节点都会进行一次遍历。
5、父节点和祖先节点

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.a.parent)

运行结果：

<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>

在这里我们选择的是第一个a标签的父节点元素，很明显它的父节点是p标签，输出结果便是p标签及其内部的内容。
注意到这里输出的仅仅是a标签的直接父节点，而没有再向外寻找父节点的祖先节点，如果我们要想获取所有的祖先节点，可以调用parents属性。

html = """
<html>
    <body>
        <p class="story">
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
        </p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(type(soup.a.parents))
print(list(enumerate(soup.a.parents)))

运行结果：

<class 'generator'>
[(0, <p class="story">
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>), (1, <body>
<p class="story">
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
</body>), (2, <html>
<body>
<p class="story">
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
</body></html>), (3, <html>
<body>
<p class="story">
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
</body></html>)]

返回结果是一个生成器类型，我们在这里用列表输出了它的索引和内容。
6、兄弟节点的获取

html = """
<html>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            Hello
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print('Next Sibling', soup.a.next_sibling)
print('Prev Sibling', soup.a.previous_sibling)
print('Next Siblings', list(enumerate(soup.a.next_siblings)))
print('Prev Siblings', list(enumerate(soup.a.previous_siblings)))

运行结果

Next Sibling 
            Hello            
Prev Sibling 
            Once upon a time there were three little sisters; and their names were
            
Next Siblings [(0, '\n            Hello\n            '), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, ' \n            and\n            '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n            and they lived at the bottom of a well.\n        ')]
Prev Siblings [(0, '\n            Once upon a time there were three little sisters; and their names were\n            ')]

可以看到在这里我们调用了四个不同的属性，next_sibling和previous_sibling分别可以获取节点的下一个和上一个兄弟元素，next_siblings和previous_siblings则分别返回所有前面和后面的兄弟节点的生成器。
二、方法选择器
前面我们所讲的选择方法都是用.这种运算符来选择元素的，这种选择方法非常快，但是如果要进行比较复杂的选择的话则会比较繁琐，不够灵活。所以BeautifulSoup还为我们提供了一些查询的方法，比如find_all()、find()等方法，我们可以调用方法然后传入相应等参数就可以灵活地进行查询了。

最常用的查询方法莫过于find_all()和find()了，下面我们对它们的用法进行详细的介绍。
1、findall

find_all(name , attrs , recursive , text , **kwargs)
find_all，顾名思义，就是查询所有符合条件的元素，可以给它传入一些属性或文本来得到符合条件的元素，功能十分强大。

我们通过一个实际的例子来感受一下：

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
#找出所有包含ul的内容
print(soup.find_all(name='ul'))
print(type(soup.find_all(name='ul')[0]))

运行结果

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
<class 'bs4.element.Tag'>

因为都是Tag类型，所以我们依然可以进行嵌套查询，还是同样的文本，在这里我们查询出所有ul标签后再继续查询其内部的li标签。由于列表中的结果还是tag，所以可以继续循环查找：

for ul in soup.find_all(name='ul'):
    print(ul.find_all(name='li'))
    for li in ul.find_all(name='li'):
        print(li.string)

运行结果：

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
Foo
Bar
Jay
[<li class="element">Foo</li>, <li class="element">Bar</li>]
Foo
Bar

attrs

除了根据标签名查询，我们也可以传入一些属性来进行查询，我们用一个实例感受一下：

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs={'id': 'list-1'}))
print(soup.find_all(attrs={'name': 'elements'}))

运行结果：

[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]

在这里我们查询的时候传入的是attrs参数，参数的类型是字典类型，比如我们要查询id为list-1的节点，那就可以传入attrs={‘id’: ‘list-1’}的查询条件，得到的结果是列表形式，包含的内容就是符合id为list-1的所有节点，上面的例子中符合条件的元素个数是1，所以结果是长度为1的列表。

对于一些常用的属性比如id、class等，我们可以不用attrs来传递，比如我们要查询id为list-1的节点，我们可以直接传入id这个参数，还是上面的文本，我们换一种方式来查询。

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element'))

运行结果：

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]

在这里我们直接传入id='list-1’就可以查询id为list-1的节点元素了。而对于class来说，由于class在python里是一个关键字，所以在这里后面需要加一个下划线，class_=‘element’，返回的结果依然还是Tag组成的列表。
2.2、text
text参数可以用来匹配节点的文本，传入的形式可以是字符串，可以是正则表达式对象，我们用一个实例来感受一下：

import re
html='''
<div class="panel">
    <div class="panel-body">
        <a>Hello, this is a link</a>
        <a>Hello, this is a link, too</a>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text=re.compile('link')))