基于bs4库的HTML内容查找方法

OneTwoThreeGo-1-2

已于 2023-04-01 01:51:52 修改

阅读量749

点赞数 1

分类专栏： BeautifulSoup python 文章标签： python beautifulsoup

于 2023-04-01 01:49:15 首次发布

本文链接：https://blog.csdn.net/LXX_1991/article/details/129891481

版权

python 同时被 2 个专栏收录

5 篇文章 0 订阅

订阅专栏

BeautifulSoup

2 篇文章 0 订阅

订阅专栏

文章详细介绍了如何使用Python的BeautifulSoup库来查找HTML文档中的内容，包括通过标签名、属性、递归和字符串进行精确或模糊查找。同时，展示了结合正则表达式的使用方法，以及find_all和find等方法的运用，帮助理解BS4库在解析HTML时的强大功能。

摘要由CSDN通过智能技术生成

一、bs4库中中用于查找HTML内容的方法：

方法	说明
<tag>.find_all()	返回一个列表类型，存储查找结果

简写形式：
<tag>() 等价于 <tag>.find_all()
soup() 等价于 soup.find_all()

find_all的参数说明：

参数	说明
name	对标签名称检索字符串
attrs	对标签属性值的检索字符串，可标注属性检索
recursive	是否对子孙全部检索，默认为True
string	对标签中间的字符串进行检索

可以与正则表达式配合使用，进行模糊查找，需要引入正则表达式库。

先引入html内容：

from bs4 import BeautifulSoup
import requests
import re	#正则表达式库

r = requests.get("https://python123.io/ws/demo.html")
demo = r.text

soup = BeautifulSoup(demo, "html.parser")

1、name：对标签名称检索字符串

打印所有a标签：

print(soup.find_all('a')})

运行结果：

[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

a、b标签作为列表形式传入，所以输出内容中有a、b标签的内容：

print(soup.find_all(['a', 'b'])})

运行结果：

[<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

参数输入True打印所有标签

for i in soup.find_all(True):
    print(i.name)

运行结果：

html
head
title
body
p
b
p
a
a

使用正则表达式，查找所有标签名以b开头的标签：

import re
for i in soup.find_all(re.compile('b')):
    print(i.name)

运行结果：

body
b

2、attrs：对标签属性值的检索字符串，可标注属性检索

查找p标签中属性包含course字符串的信息：

print(soup.find_all('p', 'course'))

运行结果：

[<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]

通过约定属性进行查找，查找属性中id = link1的标签信息：

print(soup.find_all(id='link1'))

运行结果：

[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]

使用正则表达式，查找id包含link的标签：

import re
for i in soup.find_all(id=re.compile('link')):
    print(i)

运行结果：

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>

3、recursive：是否对子孙全部检索，默认为True

设置为False只检索儿子节点

print("recursive is True:")
print(soup.find_all('a'))

print("recursive is False:")
# 只会检索儿子节点是否有a标签
print(soup.find_all('a', recursive=False))

运行结果：

recursive is True:
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
recursive is False:
[]

4、string：对标签中间的字符串进行检索

print(soup.find_all(string='Basic Python'))

运行结果（精准查找，需要准确的输入字符串的值）：

['Basic Python']

使用正则表达式，查找所有包含Python字符串的值：

import re
print(soup.find_all(string=re.compile('Python')))

运行结果：

['Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n', 'Basic Python', 'Advanced Python']

二、其他七个扩展常用方法

方法	说明
<tag>.find()	搜索且返回一个结果，字符串类型，同<tag>.find_all()参数
<tag>.find_parent()	在先辈节点中返回一个结果，字符串类型，同<tag>.find_all()参数
<tag>.find_parents()	在先辈节点中搜索，返回列表类型，同<tag>.find_all()参数
<tag>.next_siblings()	在后续平行节点中搜索，返回列表类型，同<tag>.find_all()参数
<tag>.next_sibling()	在后续平行节点中返回一个结果，字符串类型，同<tag>.find_all()参数
<tag>.previous_siblings()	在前续平行节点中搜索，返回列表类型，同<tag>.find_all()参数
<tag>.previous_sibling()	在前续平行节点中返回一个结果，字符串类型，同<tag>.find_all()参数