解析库的使用 2_可执行解析器库-CSDN博客

本文链接：https://blog.csdn.net/weixin_44595464/article/details/103230214

使用Beautiful Soup

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.学会使用Beautiful Soup会帮你节省数小时甚至数天的工作时间.

1.安装

pip安装(推荐)

pip3 install beautifulsoup4

Wheel安装

从 PyPi 下载 Wheel 文件安装，链接如下： https://pypi.python.org/pypi/beautifulsoup4,然后 Pip 安装 Wheel 文件即可。

2.解析器

BeautifulSoup 在解析的时候实际上是依赖于解析器的，它除了支持 Python 标准库中的 HTML 解析器，还支持一些第三方的解析器比如 LXML。

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup,“html.parser”)	Python的内置标准库执行速度适中文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	BeautifulSoup(markup,“lxml”)	速度快文档容错能力强	需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup,[“lxml-xml”]) BeautifulSoup(markup,“xml”)	速度快唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup,“html5lib”)	最好的容错性以浏览器的方式解析文档生成HTML5格式的文档	速度慢不依赖外部扩展

通过以上对比可看出，推荐使用lxml作为解析器,因为效率更高. 在Python2.7.3之前的版本和Python3中3.2.2之前的版本,必须安装lxml或html5lib, 因为那些Python版本的标准库中内置的HTML解析方法不够稳定.

3.基本使用

首先我们声明了一个变量 html，它是一个 HTML 字符串，但是注意到，它并不是一个完整的 HTML 字符串，body 和 html 节点都没有闭合，但是我们将它当作第一个参数传给 BeautifulSoup 对象，第二个参数传入的是解析器的类型，在这里我们使用 lxml，这样就完成了 BeaufulSoup 对象的初始化，将它赋值给 soup 这个变量。


html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())#这个方法可以把要解析的字符串以标准的缩进格式输出
print(soup.title.string)#输出了 HTML 中 title 节点的文本内容

运行结果

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
The Dormouse's story

4.对象的种类

1Tag

Tag 对象与XML或HTML原生文档中的tag相同

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
type(tag)
# <class 'bs4.element.Tag'>

Tag有很多方法和属性,在遍历文档树和搜索文档树中有详细解释.现在介绍一下tag中最重要的属性: name和attributes

2 Name

每个tag都有自己的名字,通过 .name 来获取:

tag.name
# u'b'

如果改变了tag的name,那将影响所有通过当前Beautiful Soup对象生成的HTML文档:

tag.name = "blockquote"
tag
# <blockquote class="boldest">Extremely bold</blockquote>

3 Attributes

一个tag可能有很多个属性. tag < class=“boldest”> 有一个 “class” 的属性,值为 “boldest” . tag的属性的操作方法与字典相同:

tag['class']
# u'boldest'

也可以直接”点”取属性, 比如: .attrs :

tag.attrs
#{u'class': u'boldest'}

tag的属性可以被添加,删除或修改. 再说一次, tag的属性操作方法与字典一样

tag['class'] = 'verybold'
tag['id'] = 1
tag
# <blockquote class="verybold" id="1">Extremely bold</blockquote>

del tag['class']
del tag['id']
tag
# <blockquote>Extremely bold</blockquote>

tag['class']
# KeyError: 'class'
print(tag.get('class'))
# None

5.节点

子节点

一个Tag可能包含多个字符串或其它的Tag,这些都是这个Tag的子节点.Beautiful Soup提供了许多操作和遍历子节点的属性.
注意: Beautiful Soup中字符串节点不支持这些属性,因为字符串没有子节点
tag的 .contents 属性可以将tag的子节点以列表的方式输出:
.contents 和 .children 属性仅包含tag的直接子节点.例如,< head >标签只有一个直接子节点< title>.
descendants 属性可以对所有tag的子孙节点进行递归循环

父节点

每个tag或字符串都有父节点:被包含在某个tag中
.parent
通过 .parent 属性来获取某个元素的父节点.
.parents
通过元素的 .parents 属性可以递归得到元素的所有父辈节点

兄弟节点

看一段简单的例子:

sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")
print(sibling_soup.prettify())
# <html>
#  <body>
#   <a>
#    <b>
#     text1
#    </b>
#    <c>
#     text2
#    </c>
#   </a>
#  </body>
# </html>

因为< b>标签和< c>标签是同一层:他们是同一个元素的子节点,所以< b>和< c>可以被称为兄弟节点.一段文档以标准格式输出时,兄弟节点有相同的缩进级别.在代码中也可以使用这种关系.

.next_sibling 和 .previous_sibling
在文档树中,使用 .next_sibling 和 .previous_sibling 属性来查询兄弟节点:

sibling_soup.b.next_sibling
# <c>text2</c>

sibling_soup.c.previous_sibling
# <b>text1</b>

< b>标签有 .next_sibling 属性,但是没有 .previous_sibling 属性,因为< b>标签在同级节点中是第一个.同理,标签有 .previous_sibling 属性,却没有 .next_sibling 属性:

print(sibling_soup.b.previous_sibling)
# None
print(sibling_soup.c.next_sibling)
# None

例子中的字符串“text1”和“text2”不是兄弟节点,因为它们的父节点不同:

sibling_soup.b.string
# u'text1'

print(sibling_soup.b.string.next_sibling)
# None

实际文档中的tag的 .next_sibling 和 .previous_sibling 属性通常是字符串或空白.

.next_siblings 和 .previous_siblings
通过 .next_siblings 和 .previous_siblings 属性可以对当前节点的兄弟节点迭代输出:

for sibling in soup.a.next_siblings:
    print(repr(sibling))
    # u',\n'
    # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
    # u' and\n'
    # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
    # u'; and they lived at the bottom of a well.'
    # None

for sibling in soup.find(id="link3").previous_siblings:
    print(repr(sibling))
    # ' and\n'
    # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
    # u',\n'
    # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
    # u'Once upon a time there were three little sisters; and their names were\n'
    # None