BeautifulSoup 库常用方法详解

最新推荐文章于 2024-04-24 13:35:29 发布

abolbee

最新推荐文章于 2024-04-24 13:35:29 发布

阅读量2.1k

点赞数 7

分类专栏： python 文章标签： python

原文链接：https://www.crummy.com/software/BeautifulSoup/bs4/doc/

版权

python 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

解析器

解析器用来解析文档，本文不比较他们之间的不同了，给出建议的优先顺序: lxml, html5lib，Python标准库即html.parser（前3种需另外安装）。

Beautiful Soup对象

bs将复杂HTML文档转换成一个复杂的树形结构，每个节点都是Python对象，所有对象可以归纳为4种: Tag 、 NavigableString 、BeautifulSoup 、Comment 。

Tag 对象
与XML或HTML原生文档中的tag相同
属性: name和attributes

tag.name
# u'b'
tag.name = "blockquote"#更改名字
tag['class']
tag.attrs
#tag的属性可以被添加，删除或修改（操作方法与字典一样）
tag['id'] = 1

NavigableString，可以遍历的字符串
字符串常被包含在tag内。Beautiful Soup用 NavigableString类来包装tag中的字符串，支持遍历文档树和搜索文档树中定义的大部分属性。

type(tag.string)
# <class 'bs4.element.NavigableString'>

BeautifulSoup 对象
表示一个文档的全部内容。大部分时候可以当作 Tag 对象，它支持遍历文档树和搜索文档树中描述的大部分的方法
Comment，注释及特殊字符串（文档的注释部分）

遍历文档树

使用tag，即标签名字

#当前名字的第一个tag
soup.head
soup.body.b
#所有的tag
soup.find_all('a')

#将tag的子节点以列表的方式输出
head_tag = soup.head
head_tag.contents

#父节点
title_tag.string.parent
for parent in link.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)
        
#对tag的子节点进行循环
for child in title_tag.children:
    print(child)
#对所有tag的子孙节点进行递归循环
for child in head_tag.descendants:
    print(child)
len(list(soup.children))
len(list(soup.descendants))

#如果tag只有一个 NavigableString 类型子节点,那么这个tag可以使用 .string 得到子节点
title_tag = head_tag.contents[0]
title_tag.string# u'The Dormouse's story'
#如果一个tag仅有一个子节点,那么这个tag也可以使用 .string 方法
head_tag.string# u'The Dormouse's story'
#当tag中包含多个字符串
for string in soup.strings:
    print(repr(string))
#去掉空格或空行
for string in soup.stripped_strings:
    print(repr(string))

#对当前节点的兄弟节点
last_a_tag = soup.find("a", id="link3")
last_a_tag.next_sibling
last_a_tag.next_element

注：如果tag包含了多个子节点，.string 的输出结果是 None ，因为不知道应该调用哪个子节点的内容

得到tag中包含的文本内容，返回Unicode字符串

markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup)

soup.get_text() #u'\nI linked to example.com\n'
soup.i.get_text() #u'example.com'

#指定tag的文本内容的分隔符
soup.get_text("|")
#去除前后空白
soup.get_text("|", strip=True)
[text for text in soup.stripped_strings]

搜索文档树

1.find_all() 方法，将检索当前tag的所有子孙节点
find_all( name , attrs , recursive , string , **kwargs )
name参数：查找所有名字为 name 的tag，接受字符串、正则表达式、列表、True。
keyword 参数：如id、href
string 参数：可以搜搜文档中的字符串内容
limit参数：限制返回数量
recursive=False：只想搜索tag的直接子节点

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
#字符串
soup.find_all("a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.find_all("p", "title")# [<p class="title"><b>The Dormouse's story</b></p>]
soup.find_all(string="Elsie")#返回匹配的string
soup.find_all("a", string="Elsie")#返回<a>标签
soup.find_all(string=re.compile("Dormouse"))

#正则表达式
import re
soup.find(string=re.compile("sisters"))
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
#返回与列表中任一元素匹配的内容
soup.find_all(["a", "b"])
#查找所有的tag,但是不会返回字符串节点
for tag in soup.find_all(True):
    print(tag.name)

#按CSS类名搜索
soup.find_all("a", class_="sister")
soup.find_all(class_=re.compile("itl"))
soup.find_all("a", attrs={"class": "sister"})

#注意，这两行代码是等价的
soup.title.find_all(string=True)
soup.title(string=True)

2.find()，只得到一个结果
find( name , attrs , recursive , string , **kwargs )

#这两行代码是等价的
soup.find('title')
soup.find_all('title', limit=1)

find_all返回结果是值包含一个元素的列表，find直接返回结果；没有找到目标时前者返回空列表，后者none

3.CSS选择器

soup.select("title")
#通过tag标签逐层查找
soup.select("html head title")# [<title>The Dormouse's story</title>]
#找到某个tag标签下的直接子标签
soup.select("head > title")# [<title>The Dormouse's story</title>]
#找到兄弟节点标签
soup.select("#link1 ~ .sister")#全部
soup.select("#link1 + .sister")# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
#通过CSS的类名查找，以下两条等价
soup.select("[class~=sister]")
soup.select(".sister")
#通过是否存在某个属性来查找
soup.select('a[href]')
#通过属性的值来查找
soup.select('a[href="http://example.com/elsie"]')

4.其他
find_parents() 和 find_parent()
find_next_siblings() 合 find_next_sibling()
find_previous_siblings() 和 find_previous_sibling()
find_all_next() 和 find_next()
find_all_previous() 和 find_previous()

本文仅列出了一些常用的函数和知识，还有一些其他的具体问题，如编码问题、常见错误等可参考官方文档等资料：

1.bs官方文档
2.CSS选择器

abolbee

关注

7
点赞
踩
13

收藏

觉得还不错? 一键收藏
0
评论
BeautifulSoup 库常用方法详解

解析器解析器用来解析文档，本文不比较他们之间的不同了，给出建议的优先数序: lxml, html5lib，Python标准库即html.parser（前3种需另外安装）。Beautiful Soup对象bs将复杂HTML文档转换成一个复杂的树形结构，每个节点都是Python对象，所有对象可以归纳为4种: Tag 、 NavigableString 、BeautifulSoup 、Commen...
复制链接

扫一扫