bs4

最新推荐文章于 2024-01-23 17:55:23 发布

陈起之（已退出IT行业）

最新推荐文章于 2024-01-23 17:55:23 发布

阅读量397

点赞数

分类专栏： Spider

Spider 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

bs4

说明：bs4是一个强大的解析工具，它借助网页的结构和属性等特性来解析网页。

bs4的代码非常简洁
示例：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
result = soup.tilte.string)

说明：html是被解析的html，result是解析的结果，title是html的标题，string是使得结果直接是title内的字符串，而不会包含html的标签等字符。
示例：

#不使用string结果
<title>The Dormouse's story</title>
#使用string的结果
The Dormouse's story

节点属性获取

说明：bs4可以反向获取对应节点的属性
示例，获取p标签的class属性：

方法一，使用attrs方法：

from bs4 import BeautifulSoup
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story ppppp</b></p>
<p class="story">Once upon a time there were three little sister</p>
"""
soup = BeautifulSoup(html, 'lxml')

print(soup.p.attrs['class'])	#title

方法二，节点元素后加括号：

from bs4 import BeautifulSoup
html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story ppppp</b></p>
<p class="story">Once upon a time there were three little sister</p>
"""
soup = BeautifulSoup(html, 'lxml')

print(soup.p['class'])	#title

bs4的几个属性

注：以下表格中的示例均取材下面的示例。

名称	功能	示例	示例说明	特别说明
contents	获取直接子节点	soup.p.contents	获取第一个p节点的所有内容
children	获取直接子节点	soup.p.children	获取第一个p节点的所有内容	contents和children的区别在于，content是得到的是一个列表，children得到的是一个迭代器
descendants	获取所有子节点	soup.p.descendants	获取第一个p节点下的所有节点的内容	descendants与children的区别在于，descendants列出了所包含的所有标签的标签节点及文本，children只列出了标签所含的内容
parents	获取指定节点的父节点	soup.a.parents	获取a节点的父节点的全部内容	返回的结果为生成器类型
neit_sibling	获取节点的下一个兄弟元素	soup.a.parents	获取a节点的下一个同级的节点的内容
previous_siblings	获取节点的上一个兄弟元素	soup.a.previous_siblings	获取a节点的上一个同级的节点的内容

示例代码：

from bs4 import BeautifulSoup
html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
soup = BeautifulSoup(html, 'lxml')

方法选择器

find_all：
说明：查询所有符合条件的元素。
语法：find_all(name,attrs,recursive,text,**kwargs)

其中（同样以上面代码为例）：

名称	说明	示例	示例说明
name	查询参数	soup.find_all(name=‘ul’)	查询所有ul节点
attrs	节点属性	soup.find_all(attrs={‘id’:‘link1’})	查询id属性为link1的节点
text	匹配节点的文本，可以是字符串和正则	soup.find_all(text=re.compile(‘link’))	查询含有link文本信息的节点

其他类似方法：

名称	说明
find()	find查询第一个符合条件的元素
find_parents()	返回所有祖先节点
find_parent()	返回直接父节点
find_next()_siblings()	返回后面所有兄弟节点
find_next()_sibling()	返回后面的下一个兄弟节点
find_previous_siblings()	返回前面的所有兄弟节点
find_previous_sibling()	返回前面的上一个兄弟节点