BeautifulSoup解析

最新推荐文章于 2024-01-22 09:38:07 发布

光尘92

最新推荐文章于 2024-01-22 09:38:07 发布

阅读量202

点赞数 1

分类专栏： python scrapy

本文链接：https://blog.csdn.net/hanli1992/article/details/82462763

版权

python scrapy 专栏收录该内容

12 篇文章 1 订阅

订阅专栏

官方文档：https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

一、解析器

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup, "html.parser")	Python的内置标准库执行速度适中文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	BeautifulSoup(markup, "lxml")	速度快文档容错能力强	需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup, ["lxml-xml"]) BeautifulSoup(markup, "xml")	速度快唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup, "html5lib")	最好的容错性以浏览器的方式解析文档生成HTML5格式的文档	速度慢不依赖外部扩展

推荐使用lxml作为解析器，因为效率更高。在Python2.7.3之前的版本和Python3中3.2.2之前的版本中因为标准库中内置的HTML解析方法不够稳定，因此必须安装lxml或html5lib。

提示：如果一段HTML或XML文档格式不正确的话，那么在不同的解析器中返回的结果可能是不一样的。

拿“爱丽丝梦游仙境”的文档来做例子：

html_doc = """
<html><head><title>The Dormouse's story</title></head>
    <body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')

二、对象的种类

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构，每个节点都是Python对象，所有对象可以归纳为4种：Tag、NavigableString、BeautifulSoup、Comment

1. Tag

Tag就是HTML中一个个标签，如：

<title>The Dormouse's story</title>

“title”加上里面的内容就是Tag

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'lxml')
tag = soup.b
type(tag)
# <class 'bs4.element.Tag'>

Tag有两个重要的属性：name和attrs

Name

tag.name
# u'b'

Attributes

tag['class']
#或
tag.get('class')
# u'boldest'

tag.attrs
# {u'class': u'boldest'}

2. NavigableString

该属性指的是标签里的内容

tag.string
#Extremely bold

type(tag.string)
#<class 'bs4.element.NavigableString'>

3. BeautifulSoup

该对象表示的是文档的全部内容，大部分时候，可以把它当作Tag对象，是一个特殊的Tag

type(soup.name)
#<type 'unicode'>

soup.name
#[document]

soup.attrs
#{}空字典

4. Comment

是特殊类型的NavigableString对象，是文档的注释部分，但输出的内容不包括注释符号，若不进行处理，可能会带来麻烦

soup = BeautifulSoup('<a><!-- Elsie --></a>', 'lxml')
soup.a
#Elsie

type(soup.a.string)
#<class 'bs4.element.Comment'>

在使用前可做一下判断：

if type(soup.a.string) == bs4.element.Commnet:
    print soup.a.string

三、遍历文档树

子节点

tag的名字

soup.title
# <title>The Dormouse's story</title>

#依次获取
soup.body.b
# <b>The Dormouse's story</b>

通过点取属性的方式只能获得当前名字的第一个tag：

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

#find_all()可以获取所有tag
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

.contents

属性可以将tag的子节点以列表的方式输出:

head_tag = soup.head
head_tag
# <head><title>The Dormouse's story</title></head>

head_tag.contents
#[<title>The Dormouse's story</title>]

.children

通过tag的 .children 生成器，可以对tag的子节点进行循环:

for child in title_tag.children:
    print(child)
    # The Dormouse's story

.contents 和 .children 属性仅包含tag的直接子节点

.descendants

属性可以对所有tag的子孙节点进行递归循环

for child in head_tag.descendants:
    print(child)
    # <title>The Dormouse's story</title>
    # The Dormouse's story

.string

如果tag只有一个 NavigableString 类型子节点，那么这个tag可以使用 .string 得到子节点

title_tag.string
# u'The Dormouse's story'

如果一个tag仅有一个子节点，那么这个tag也可以使用 .string方法，输出结果与当前唯一子节点的 .string 结果相同

head_tag.contents
# [<title>The Dormouse's story</title>]

head_tag.string
# u'The Dormouse's story'

.strings

获取多个内容，不过需要遍历获取

.stripped_strings

去除多余空白内容

父节点

.parent

文档title的字符串也有父节点:<title>标签

title_tag.string.parent
# <title>The Dormouse's story</title>

文档的顶层节点比如<html>的父节点是 BeautifulSoup 对象

BeautifulSoup 对象的 .parent 是None

.parents

递归得到元素的所有父辈节点

兄弟节点

.next_sibling 和 .previous_sibling

注意：实际文档中的tag的.next_sibling和.previouw_sibling属性通常是字符串或空白，因为空白和换行也可以被视作一个节点

.next_siblings 和 .previous_siblings

所有兄弟节点

前后节点

.next_element 和 .previous_element

针对所有节点，不分层次

如head节点:

<head><title>hello world</title></head>

它的下一个节点是title

soup.head.next_element
#<title>hello world</title>

这是“爱丽丝”文档中最后一个<a>标签，<a>标签的 .next_element 属性结果是在<a>标签被解析之后的解析内容，不是<a>标签后的句子部分，所以应该是字符串”Tillie”:

last_a_tag = soup.find("a", id="link3")
last_a_tag
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

last_a_tag.next_element
# u'Tillie'

.next_elements 和 .previous_elements

文档前后的所有内容

四、搜索文档树

1. find_all(name, attrs, recursive, string, kwargs)**

find_all()方法搜索当前tag的所有tag子节点，并判断是否符合过滤器的条件

1）name参数

a. 字符串

soup.find_all('b')
# [<b>The Dormouse's story</b>]

b. 正则表达式

import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
# body
# b

c. 列表

soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

d. True

for tag in soup.find_all(True):
    print(tag.name)
# html
# head
# title
# body
# p
# b
# p
# a
# a
# a
# p

e. 方法

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

soup.find_all(has_class_but_no_id)
# [<p class="title"><b>The Dormouse's story</b></p>,
#  <p class="story">Once upon a time there were...</p>,
#  <p class="story">...</p>]

2）keyword参数

搜索指定名字的属性时可以使用的参数值包括字符串 , 正则表达式 , 列表, True

soup.find_all(id='link2')
#或
soup.find_all(attr={'id':'link2'})
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

soup.find_all(id=True)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

如果传入 href 参数，Beautiful Soup会搜索每个tag的”href”属性

soup.find_all(href=re.compile("elsie"), id='link1')
# [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]

使用多个指定名字的参数可以同时过滤tag的多个属性

soup.find_all(href=re.compile("elsie"), id='link1')
# [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]

由于class是python的关键字，使用 class 做参数会导致语法错误，可以通过 class_ 参数搜索有指定CSS类名的tag

soup.find_all("a", class_="sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

3）string参数

接受字符串 , 正则表达式 , 列表, True

soup.find_all(string="Elsie")
# [u'Elsie']

soup.find_all(string=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie']

4）limit参数

限制返回数量

soup.find_all("a", limit=2)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

5）recursive参数

使用 recursive=False 参数之后只能查找直接子节点

如下文档：

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
...

soup.html.find_all("title")
# [<title>The Dormouse's story</title>]

soup.html.find_all("title", recursive=False)
# []

1.1 像调用 `find_all()` 一样调用tag

soup.find_all("a")
同：
soup("a")

soup.title.find_all(string=True)
同：
soup.title(string=True)

2. find(name, attrs, recursive, text, **kwargs)

find_all() 方法的返回结果是值包含一个元素的列表

find() 方法直接返回符合条件的第一个结果

find_all() 方法没有找到目标是返回空列表，find() 方法找不到目标时，返回 None

3. find_parents() 和 find_parent()

4. find_next_siblings() 合和find_next_sibling()

5. find_previous_siblings() 和 find_previous_sibling()

6. find_all_next() 和 find_next()

7. find_all_previous() 和 find_previous()

注：2、3、4、5、6、7方法的参数用法与find_all()完全相同

五、CSS选择器

一种与find_all()异曲同工的方法，在写CSS时，标签名不加任何修饰，类名前加“.”，id名前加“#”，可以根据类似的方法用soup.select()来筛选元素

1. 通过标签名查找

soup.select("title")
# [<title>The Dormouse's story</title>]

#逐层查找
soup.select("html head title")
# [<title>The Dormouse's story</title>]

soup.select("head > title")
# [<title>The Dormouse's story</title>]

2. 通过类名查找

soup.select(".sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select_one(".sister")
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

3. 通过id查找

soup.select("#link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

4. 组合查找

soup.select('p #link1')
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

5. 通过属性查找

soup.select('a[href]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

6. 通过属性的值查找

soup.select('a[href="http://example.com/elsie"]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

7. 找到兄弟节点标签

8. 同时用多种CSS选择器查询元素

9. 通过语言设置来查找

六、输出

Beautiful Soup输出是会将HTML中的特殊字符转换成Unicode，比如“&lquot;”

soup = BeautifulSoup("&ldquo;Dammit!&rdquo; he said.")
unicode(soup)
# u'<html><head></head><body>\u201cDammit!\u201d he said.</body></html>'

格式化输出

prettify() 方法将Beautiful Soup的文档树格式化后以Unicode编码输出，每个XML/HTML标签都独占一行

BeautifulSoup 对象和它的tag节点都可以调用 prettify() 方法

get_text()

这个方法获取到tag中包含的所有文本内容，包括子孙tag中的内容，并将结果作为Unicode字符串返回：

markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup)

soup.get_text()
#u'\nI linked to example.com\n'

# soup.get_text("|", strip=True)
u'I linked to|example.com'

soup.i.get_text()
#u'example.com'

光尘92

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
BeautifulSoup解析

官方文档：https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/一、解析器解析器使用方法优势劣势 Python标准库 BeautifulSoup(markup, "html.parser") Pytho...
复制链接

扫一扫