BeautifulSoup Study

最新推荐文章于 2022-07-16 10:04:12 发布

snowtower2010

最新推荐文章于 2022-07-16 10:04:12 发布

阅读量172

点赞数

分类专栏： python BeautifulSoup

python 同时被 2 个专栏收录

2 篇文章 0 订阅

订阅专栏

BeautifulSoup

1 篇文章 0 订阅

订阅专栏

BeautifulSoup Study

定义和背景
解析器
简单使用
对象种类
遍历树 Navigating the tree
搜索树 Searching the tree
修改树 Modifying the tree
输出
- prettify()
- get_text()
挑选解析器
比较BS对象
复制BS对象
解析部分文档
诊断 diagnose()

定义和背景

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库。本文主要从官方文档中提取一些简要信息用于快速浏览和回忆。
详情可见BeautifulSoup的官方中文文档和BeautifulSoup的官方英文文档

解析器

Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器。

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup, “html.parser”)	Python的内置标准库;执行速度适中;文档容错能力强	Python 2.7.3 or 3.2.2前的版本中文档容错能力差
lxml HTML 解析器	BeautifulSoup(markup, “lxml”)	速度快;文档容错能力强	需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup, [“lxml-xml”]) BeautifulSoup(markup, “xml”)	速度快;唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup, “html5lib”)	最好的容错性;以浏览器的方式解析文档;生成HTML5格式的文档	速度慢;不依赖外部扩展

简单使用

将一段文档传入BeautifulSoup 的构造方法,就能得到一个文档的对象, 可以传入一段字符串或一个文件句柄。

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("index.html"),'lxml')
soup = BeautifulSoup("<html>data</html>,'html5lib'")

首先,文档被转换成Unicode,并且HTML的实例都被转换成Unicode编码，然后,Beautiful Soup选择最合适的解析器来解析这段文档,如果手动指定解析器那么Beautiful Soup会选择指定的解析器来解析文档。

对象种类

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:

Tag
NavigableString
BeautifulSoup
Comment

Tag

Tag 对象与XML或HTML原生文档中的tag相同。

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
type(tag)
# <class 'bs4.element.Tag'>

Name

每个tag都有自己的名字,通过 .name 来获取，可直接修改。

Attributes

一个tag可能有很多个属性，tag的属性的操作方法与字典相同。也可以直接”点”取属性, 比如: .attrs。tag的属性可以被添加，删除或修改。

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
type(tag)
# <class 'bs4.element.Tag'>
tag.name
# 'b'
tag.name = "blockquote"
tag
# <blockquote class="boldest">Extremely bold</blockquote>
tag['class']
# ['boldest']
tag.attrs
# {'class': ['boldest']}
tag['id'] = 1
tag['class'] = 'verybold'
tag
#<blockquote class="verybold" id="1">Extremely bold</blockquote>
del tag['id']
tag
#<blockquote class="verybold">Extremely bold</blockquote>

多值属性

HTML 4定义了一系列可以包含多个值的属性.在HTML5中移除了一些,却增加更多.最常见的多值的属性是 class (一个tag可以有多个CSS的class). 还有一些属性 rel , rev , accept-charset , headers , accesskey . 在Beautiful Soup中多值属性的返回类型是list。如果某个属性看起来好像有多个值,但在任何版本的HTML定义中都没有被定义为多值属性,那么Beautiful Soup会将这个属性作为字符串返回。修改多值属性时，若赋值列表，会合并为一个值。

css_soup = BeautifulSoup('<p class="body strikeout"></p>','lxml')
css_soup.p['class']
# ['body', 'strikeout']
id_soup = BeautifulSoup('<p id="my id"></p>','lxml')
id_soup.p['id']
# 'my id'
rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>','lxml')
rel_soup.a['rel'] = ['index', 'contents']
print(rel_soup.a)
# <a rel="index contents">homepage</a>

如果转换的文档是XML格式,那么tag中不包含多值属性。

xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
xml_soup.p['class']
# 'body strikeout'

NavigableString

字符串常被包含在tag内.Beautiful Soup用 NavigableString 类来包装tag中的字符串。

tag.string
# 'Extremely bold'
type(tag.string)
# bs4.element.NavigableString

tag中包含的字符串不能被编辑，但可以被替换

tag.string.replace_with("no longer bold")
tag
# <b class="boldest">no longer bold</b>

NavigableString中不能嵌套/包含其它字符串或标签，不支持.contents，.string等点取属性，或者find()方法。
如果想在Beautiful Soup之外使用 NavigableString 对象,需要调用 unicode() 方法（str方法）,将该对象转换成普通的Unicode字符串。否则就算Beautiful Soup已方法已经执行结束，该字符串仍附带着到整个BS解析树的引用，浪费内存。

BeautifulSoup

BeaurifulSoup对象表示整个文档。可以当作一个Tag对象，但由于不关联实际的HTML/XML Tag，因此没有.attribute。为其规定一个特定的name，为[document]。

soup.name
# '[document]'

Comments and other special strings

Comment对象是一种特殊的NavigableString对象。但在html中有着特殊的格式。

markup = "<b><!--Hey,buddy.Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup,'lxml')
comment = soup.b.string
comment
# 'Hey,buddy.Want to buy a used parser?'
type(comment)
# bs4.element.Comment
print(soup.b.prettify())
# <b>
#  <!--Hey,buddy.Want to buy a used parser?-->
# </b>

BS还定义了针对XML的其它类别：CData, ProcessingInstruction, Declaration, Doctype。类似于Comment，它们都是NavigableString的子类，不过是在字符串外加一些额外的格式。

from bs4 import CData
cdata = CData('A CData block')
comment.replace_with(cdata)
print(soup.b.prettify())
# <b>
#  <![CDATA[A CData block]]>
# </b>

遍历树 Navigating the tree

分为两个角度来遍历，第一个角度是家族树角度，子孙节点、父节点、前辈节点、兄弟节点等；第二个角度是解析树的角度，按照解析的线性顺序，只有前后解析元素。
　　以“Three sisters"为例

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

向下访问

Tag，你的名字

使用点取。既可以点取子标签，也可以点取后代标签。如果有多个tag，得到的是第一个。如果想得到全部，必须使用find_all()。

soup.head
# <head><title>The Dormouse's story</title></head>
soup.title
# <title>The Dormouse's story</title>
soup.body.b
# <b>The Dormouse's story</b>
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
soup.find_all('a')
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

子孙后代 .contents 和 .children 以及.descendants

.contents返回tag的children列表，字符串（NaviagableString）没有contents。得到children的另一种形式就是使用.children，返回list iterator迭代器。而.descendants返回所有子孙后代。

soup.head.contents
# [<title>The Dormouse's story</title>]
for child in soup.head.children:
    print(child)
#  <title>The Dormouse's story</title>
for child in soup.head.descendants:
    print(child)
# <title>The Dormouse's story</title>
# The Dormouse's story
type(soup.children)
# list_iterator
type(soup.descendants)
# generator

最小的后代，叶节点 .string .strings .stripped_strings

如果一个tag只有一个child，且为NavigableString。可使用.string。
如果一个tag只有一个子tag，子tag有.string，那么该tag有.string，等同于子tag的.string。
否则，.string返回None。

head_tag = soup.head
head_tag
# <head><title>The Dormouse's story</title></head>
title_tag = soup.title
title_tag
# <title>The Dormouse's story</title>
title_tag.string
# "The Dormouse's story"
head_tag.string
# "The Dormouse's story"
soup.html.string == None
# True

若要获取多个字符串，可以使用.strings和.stripped_strings，都是generator。后一个清除了空白和换行符。

for string in soup.strings:
    print(repr(string))
# '\n'
# "The Dormouse's story"
# '\n'
# '\n'
# "The Dormouse's story"
# '\n'
# 'Once upon a time there were three little sisters; and their names # were\n'
# 'Elsie'
# ',\n'
# 'Lacie'
# ' and\n'
# 'Tillie'
# ';\nand they lived at the bottom of a well.'
# '\n'
# '...'
# '\n'
for string in soup.stripped_strings:
    print(repr(string))
# "The Dormouse's story"
# "The Dormouse's story"
# 'Once upon a time there were three little sisters; and their names # were'
# 'Elsie'
# ','
# 'Lacie'
# 'and'
# 'Tillie'
# ';\nand they lived at the bottom of a well.'
# '...'

向上访问

.parent 父节点

使用.parent属性访问其父标签。BeautifulSoup的parent是None。

title_tag.parent
# <head><title>The Dormouse's story</title></head>
title_tag.string.parent
# <title>The Dormouse's story</title>
print(soup.parent)
# None

.parents 前辈节点

使用.parents属性访问其长辈标签。

link = soup.a
link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
for parent in link.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name) 
# p
# body
# html
# [document]

旁路访问兄弟节点

举例说明，和是兄弟标签（sibling）。

sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></a>", 'lxml')
print(sibling_soup.prettify())
# <html>
#  <body>
#   <a>
#    <b>
#     text1
#    </b>
#    <c>
#     text2
#    </c>
#   </a>
#  </body>
# </html>

.next_sibling 和 .previous_sibling

用来导航到解析树的同级兄弟页面元素。

sibling_soup.b
# <b>text1</b>
sibling_soup.b.next_sibling
# <c>text2</c>
sibling_soup.c.previous_sibling
# <b>text1</b>

.next_siblings and .previous_siblings

可以迭代获取之前或之后的所有兄弟元素。

for sibling in soup.a.next_siblings:
    print(repr(sibling))
# ',\n'
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# ' and\n'
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
# ';\nand they lived at the bottom of a well.'
for sibling in soup.find(id="link3").previous_siblings:
    print(repr(sibling))
# ' and\n'
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# ',\n'
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# 'Once upon a time there were three little sisters; and their names were\n'

回退和前进解析的视角看元素

<html><head><title>The Dormouse's story</title></head>
<p class="title"><b>The Dormouse's story</b></p>

HTML解析动作分解为一系列事件，打开html标签，打开一个head标签，打开一个title标签，添加一个字符串，关闭title标签，等等。BeautifulSoup提供了工具以重现文档初始化解析的过程。

.next_element 和 .previous_element

string或tag的.next_element属性指向之后立马被解析的元素，或许等同于.next_sibling，很多情况下不同。

last_a_tag = soup.find("a", id="link3")
last_a_tag
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
last_a_tag.next_sibling
# ';\nand they lived at the bottom of a well.'
last_a_tag.next_element
# 'Tillie'
last_a_tag.next_element.next_element
# ';\nand they lived at the bottom of a well.'
last_a_tag.previous_sibling
# ' and\n'
last_a_tag.previous_element
# ' and\n'
last_a_tag.previous_sibling.previous_sibling
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
last_a_tag.previous_element.previous_element
# 'Lacie'

a标签的 .next_element 属性结果是在a标签被解析之后的解析内容,不是a标签后的句子部分,应该是字符串”Tillie”。因为在原始文档中,字符串“Tillie” 在分号前出现,解析器先进入a标签,然后是字符串“Tillie”,然后关闭a标签,然后是分号和剩余部分。
.previous_element同理。
element和sibling最关键的区别前者是解析顺序的体现，后者是元素间层级关系的体现。

.next_elements 和 .previous_elements

迭代器。通过循环，可以重现解析过程。

搜索树 Searching the tree

主要方法是find()和find_all()。

过滤器 Filters

将过滤器作为函数参数，限定搜索目标。过滤器种类不少，既可以单独使用，也可以组合使用。

字符串

字符串作为过滤器，表示标签名字。如果传入字节码参数,Beautiful Soup会当作UTF-8编码,可以传入一段Unicode 编码来避免Beautiful Soup解析编码出错。

soup.find_all('b')
# [<b>The Dormouse's story</b>]

正则表达式

传入正则表达式对象，将使用search()匹配标签。例中为以‘b’开头的标签，包含’t’的标签。

import re
for tag in soup.find_all(re.compile('^b')):
    print(tag.name)
# body
# b
for tag in soup.find_all(re.compile('t')):
    print(tag.name)
# html
# title

列表

寻找完全匹配列表中任一字符串的标签。

soup.find_all(['a','b'])
# [<b>The Dormouse's story</b>,
# <a class="sister" href="http://example.com/elsie" 
# id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" 
# id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" #id="link3">Tillie</a>]

True

匹配文档中所有标签。

for tag in soup.find_all(True):
    print(tag.name)
# html
# head
# title
# body
# p
# b
# p
# a
# a
# a
# p

函数

以函数作为参数，对函数的限定是，只有一个参数；匹配返回True否则返回False。

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')
soup.find_all(has_class_but_no_id)
# [<p class="title"><b>The Dormouse's story</b></p>,
# <p class="story">Once upon a time there were three little ...</p>,
# <p class="story">...</p>]

如果要过滤特定属性，如href，函数参数必须是属性值。not_lacie函数寻找具有href属性，但其href属性又不符合正则模式，找到返回True，否则返回false。即href中不能包含lacie，返回None。

def not_lacie(href):
    return href and not re.compile('lacie').search(href)
soup.find_all(href=not_lacie)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

当然，函数可以设计的更复杂，例如查找被string包围的tag。

from bs4 import NavigableString
def surrounded_by_strings(tag):
    return (isinstance(tag.next_element, NavigableString)
           and isinstance(tag.previous_element, NavigableString))

for tag in soup.find_all(surrounded_by_strings):
    print(tag.name)
# body
# p
# a
# a
# a
# p

find_all()

find_all(name, attrs, recursive, string, limit, **kwargs)，查找当前标签的子孙后代，找到过滤后的匹配者。

name参数

传递值给name，即为查找名为name的tag，字符串没有名字，自动过滤掉。

soup.find_all('b')
# [<b>The Dormouse's story</b>]

keyword参数

如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字tag的属性来搜索。

如果包含一个名字为 id 的参数,Beautiful Soup会搜索每个tag的”id”属性；
如果参数为id=True，意味着id属性不为空的任意值；可以同时搜索多个属性；
HTML5中一些不能直接拿来作为参数的属性，可以使用attrs={“data-foo”: “value”}作为参数；
name不能作为keyword参数，因为会直接匹配标签，但可以使用attrs字典。

soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
soup.find_all(href=re.compile('elsie'))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
soup.find_all(id=True)
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find_all(href=re.compile('elsie'), id='link1')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

基于class属性的搜索

class是python的保留字，使用class作为keyword参数会导致语义错误。两种解决方案，使用attrs，或使用class_来解决。

soup.find_all(class='sister')
# SyntaxError: invalid syntax
soup.find_all(class_='sister')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

针对class的多值属性，或精确匹配，或匹配多值中的任一个。还可以使用css selector。

css_soup = BeautifulSoup('<p class="body strikeout"></p>','lxml')
css_soup.find_all("p", class_="strikeout")
# [<p class="body strikeout"></p>]
css_soup.find_all('p', attrs={"class":"strikeout body"})
# []
css_soup.find_all('p', attrs={"class":"body strikeout"})
# [<p class="body strikeout"></p>]
css_soup.select("p.strikeout.body")
# [<p class="body strikeout"></p>]

string参数

使用string参数，搜索字符串，常常作为搜索节点的匹配或限定条件。类似于name和keyword，可以接受字符串、正则表达式、列表、函数或真值。

soup.find_all(string="Elsie")
# ['Elsie']
soup.find_all(string=["Tillie","Lacie"])
# ['Lacie', 'Tillie']
soup.find_all(string=re.compile("Dormouse"))
# ["The Dormouse's story", "The Dormouse's story"]
def is_the_only_string_within_a_tag(s):
    return s == s.parent.string
soup.find_all(string=is_the_only_string_within_a_tag)
# ["The Dormouse's story",
# "The Dormouse's story",
# 'Elsie',
# 'Lacie',
# 'Tillie',
# '...']

soup.find_all("a", string="Elsie")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

limit参数

类似于SQL中limit参数，限制搜索数量，当寻找到限制数目后，即刻停止。

soup.find_all("a", limit=2)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

recursive参数

默认值为True，确定进行深层搜索所有子孙后代；若指定为False，仅在direct children（亲儿子）中搜索。

soup.find_all("a",recursive=True)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find_all("a",recursive=False)
# []

注意！该参数仅用于find()和find_all()函数。

find_all()的等价模式

bs对象或tag对象调用find_all()方法，等价于，将bs或tag对象看作函数直接调用。因常用，故简化。

soup.find_all('b')
# [<b>The Dormouse's story</b>]
soup('b')
# [<b>The Dormouse's story</b>]
soup.title.find_all(string=True)
# ["The Dormouse's story"]
soup.title(string=True)
# ["The Dormouse's story"]

find()

find(name, attrs, recursive, string, **kwargs)，寻找第一个匹配结果。等价于find_all(, limit=1,)，区别在于，前者返回结果，后者返回长度为1的结果列表。如果未找到，前者返回空列表[]，后者返回None。

soup.find_all('title',limit=1)
# [<title>The Dormouse's story</title>]
soup.find('title')
# <title>The Dormouse's story</title>
soup.find_all('haha',limit=1)
# []
print(soup.find('haha'))
# None

其它函数

find_parents(name, attrs, string, limit, **kwargs)
find_parent(name, attrs, string, **kwargs)
基于.parent和.parents属性，在父辈或长辈中寻找元素。

find_next_siblings(name, attrs, string, limit, **kwargs)
find_next_sibling(name, attrs, string, **kwargs)
基于.next_siblings属性，兄弟节点中找小弟。

find_previous_siblings(name, attrs, string, limit, **kwargs)
find_previous_sibling(name, attrs, string, **kwargs)
基于.previous_siblings属性，兄弟节点中找大哥。

find_all_next(name, attrs, string, limit, **kwargs)
find_next(name, attrs, string, **kwargs)
基于.next_elements属性，寻找下一个或所有的解析元素。

find_all_previous(name, attrs, string, limit, **kwargs)
find_previous(name, attrs, string, **kwargs)
基于.previous_elements属性，寻找前一个或之前所有的解析元素。

CSS Selector

BeautifulSoup支持最常用的CSS选择器，bs对象或tag直接调用select()方法即可。
css selector定位灵活多变，应专门开一篇文章总结，这里简述一下基本方法：

直接寻找Tag

soup.select("title")
# [<title>The Dormouse's story</title>]
soup.select("a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

寻找子孙tag

soup.select("body a")
soup.select("p > a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

寻找tag的兄弟

soup.select("#link1 ~ .sister")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.select("#link1 + .sister")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

通过css class寻找

soup.select(".sister")
soup.select("[class*='sis']")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通过ID寻找

soup.select("#link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

其它方法

包括基于属性值等方法。

select_one()

返回匹配的第一个元素（非长度为1的列表）

修改树 Modifying the tree

修改树，首先要定位元素，其次进行修改。

目标任务	方式
修改tag名字	tag.name赋值；
增删属性，修改属性值	tag[‘attr’]赋值，del tag[‘attr’]
修改.string属性	tag.string赋值，若有其它tag等，会被覆盖掉
添加Tag的contents	tag.append(字符串/NavigableString(字符串)/Comment(字符串))
添加一个新Tag	tag.append(soup.new_tag(“tagName”, attr=“str”))
定位添加	tag.insert(n, tag/Comment/string/NavigableString)； insert_before()，按解析顺序，在元素之前添加 insert_after()，按解析顺序，在元素之后添加；
清除contents	tag.clear()
提取元素	PageElement.extract() 移除tag或string，返回移除的元素
删除元素	Tag.decompose() 移除tag，毁灭之
替换元素	PageElement.replace_with()，返回被替换的元素
包装元素	PageElement.wrap()，为string或tag包装上标签 p.string.wrap(soup.new_tag(“b”)) p.wrap(soup.new_tag(“div”)
去包装	Tag.unwrap()，wrap()的反向操作

输出

prettify()

将BeautifulSoup解析树完美格式化为Unicode字符串，每一个html/xml标签占一行。

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup,'lxml')
soup.prettify()
# '<html>\n <body>\n  <a href="http://example.com/">\n   I linked to\n   <i>\n    example.com\n   </i>\n  </a>\n </body>\n</html>'
print(soup.prettify())
'''
<html>
 <body>
  <a href="http://example.com/">
   I linked to
   <i>
    example.com
   </i>
  </a>
 </body>
</html>
'''

get_text()

返回文档或Tag中的Unicode字符串。

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup,'lxml')
soup.get_text()
# 'I linked to example.com'
soup.i.get_text()
# 'example.com'
soup.get_text("|")
# 'I linked to |example.com'
soup.get_text(strip = True)
# 'I linked toexample.com'
soup.get_text("|", strip = True)
# 'I linked to|example.com'
[text for text in soup.stripped_strings]
# ['I linked to', 'example.com']

挑选解析器

呼应第二节解析器，特点如表所述。
解析的方式不同，遵循的标准不同。且看例子。

BeautifulSoup("<a><b /></a>", 'lxml')
# <html><body><a><b></b></a></body></html>
BeautifulSoup("<a><b /></a>", 'xml')
# <?xml version="1.0" encoding="utf-8"?>
# <a><b/></a>

BeautifulSoup("<a></p>", "lxml")
# <html><body><a></a></body></html>
BeautifulSoup("<a></p>", "html5lib")
# <html><head></head><body><a><p></p></a></body></html>
BeautifulSoup("<a></p>", "html.parser")
# <a></a>

比较BS对象

两个 NavigableString 或 Tag 对象具有相同的HTML或XML结构时, Beautiful Soup就判断这两个对象相同，用==判断。用is判断两变量是否引用同一个对象。

markup = "<p>I want <b>pizza</b> and more <b>pizza</b>!</p>"
soup = BeautifulSoup(markup, 'html.parser')
first_b,second_b = soup.find_all('b')
first_b == second_b
# True
first_b is second_b
# False

复制BS对象

copy.copy() 方法可以复制任意 Tag 或 NavigableString 对象，复制后的对象跟与对象是相等的, 但指向不同的内存地址。源对象和复制对象的区别是源对象在文档树中, 而复制后的对象是独立的还没有添加到文档树中。复制后对象的效果跟调用了extract() 方法相同。

import copy
p_copy = copy.copy(soup.p)
soup.p == p_copy
# True
soup.p is p_copy
# False
print(p_copy.parent)
# None

解析部分文档

如果仅仅因为想要查找文档中的a标签而将整片文档进行解析,实在是浪费内存和时间.最快的方法是从一开始就把a标签以外的东西都忽略掉.
SoupStrainer 类可以定义文档的某段内容,这样搜索文档时就不必先解析整篇文档,只会解析在 SoupStrainer 中定义过的文档. 创建一个 SoupStrainer 对象并作为 parse_only 参数传递给 BeautifulSoup 的构造方法即可。
注意，该方法不适合html5lib解析器，因其必须解析整个文档、重排解析树。

诊断 diagnose()

如果想知道Beautiful Soup到底怎样处理一份文档,可以将文档传入 diagnose() 方法(Beautiful Soup 4.2.0中新增),Beautiful Soup会输出一份报告,说明不同的解析器会怎样处理这段文档,并标出当前的解析过程会使用哪种解析器,diagnose() 方法的输出结果可能帮助你找到问题的原因.

官网图片，don't know meaning