Bs4简介

最新推荐文章于 2024-01-23 17:55:23 发布

洗手不上厕所

最新推荐文章于 2024-01-23 17:55:23 发布

阅读量994

点赞数

分类专栏： python爬虫笔记文章标签：爬虫 python 正则表达式 html pycharm

本文链接：https://blog.csdn.net/weixin_50560109/article/details/119208393

版权

python爬虫笔记专栏收录该内容

3 篇文章 1 订阅

订阅专栏

Bs4简介

作用：

将复杂的HTML文档转换成一个树形结构，每个节点都是Python对象，所有对象可以归纳为5种：

1、Tag——利用标签，获取标签及其内容（多个重复标签，默认拿取第一个）

file = open("baidu.html", "rb")         # 打开一个html文件
html = file.read()
bs = BeautifulSoup(html, "html.parser")     # 实例化BeautifulSoup对象来解析网页，第一个参数是指定解析文件的类型，第二个参数是指定解析器

# 1、Tag——利用标签，获取标签及其内容（多个重复标签，默认拿取第一个）
print(bs.title)                # 指定标签，获取标签及其内容，返回类型为Tag


# 返回结果
<title>百度一下，你就知道</title>

原html文档中下标签为title中的信息：
在这里插入图片描述

2、NavigableString——指定标签获取标签后的字符串，返回类型为NavigableString

from bs4 import BeautifulSoup

file = open("baidu.html", "rb")         # 打开一个html文件
html = file.read()
bs = BeautifulSoup(html, "html.parser")     # 实例化BeautifulSoup对象来解析网页，第一个参数是指定解析文件的类型，第二个参数是指定解析器

# 2、NavigableString——指定标签获取标签后的字符串，返回类型为NavigableString
print(bs.title.string)


# 返回结果
百度一下，你就知道

原html文档中下标签为title下的字符串：
在这里插入图片描述

3、dict——指定标签获取标签后的属性，返回类型为dict（多个重复标签，默认拿取第一个）

from bs4 import BeautifulSoup

file = open("baidu.html", "rb")         # 打开一个html文件
html = file.read()
bs = BeautifulSoup(html, "html.parser")     # 实例化BeautifulSoup对象来解析网页，第一个参数是指定解析文件的类型，第二个参数是指定解析器

# 3、dict——指定标签获取标签后的属性，返回类型为dict（多个重复标签，默认拿取第一个）
print(bs.a.attrs)				


# 返回结果
{'href': 'http://news.baidu.com', 'target': '_blank', 'class': ['mnav', 'c-font-normal', 'c-color-t']}

原html文档中下第一个标签为a中的信息：
在这里插入图片描述

4、BeautifulSoup——获取整个文档信息，返回类型为BeautifulSoup

from bs4 import BeautifulSoup

file = open("baidu.html", "rb")         # 打开一个html文件
html = file.read()
bs = BeautifulSoup(html, "html.parser")     # 实例化BeautifulSoup对象来解析网页，第一个参数是指定解析文件的类型，第二个参数是指定解析器

# 4、BeautifulSoup——获取整个文档信息，返回类型为BeautifulSoup
print(bs)


# 返回结果
注：返回结果是整个html文档，由于文档内容过多在此不予显示

5、Comment——标签a后边的字符串中的注释信息，不会被返回，此返回类型是NavigableString中的一种为：Comment

from bs4 import BeautifulSoup

file = open("baidu.html", "rb")         # 打开一个html文件
html = file.read()
bs = BeautifulSoup(html, "html.parser")     # 实例化BeautifulSoup对象来解析网页，第一个参数是指定解析文件的类型，第二个参数是指定解析器

# 5、Comment——标签a后边的字符串中由注释信息，不会被返回，此返回类型是NavigableString中的一种为：Comment
print(bs.a.string)


# 返回结果
新闻

应用：

1、文档的遍历

from bs4 import BeautifulSoup

file = open("baidu.html", "rb")         # 打开一个html文件
html = file.read()
bs = BeautifulSoup(html, "html.parser")     # 实例化BeautifulSoup对象来解析网页，第一个参数是指定解析文件的类型，第二个参数是指定解析器

# 文档的遍历
print(bs.head.contents)             # 获取头部中的contents属性信息，返回类型为list
print(bs.head.contents[1])			# 返回list中的第二个元素

2、文档的搜索

1）find_all()查找所有

from bs4 import BeautifulSoup

file = open("baidu.html", "rb")          # 打开一个html文件
html = file.read()
bs = BeautifulSoup(html, "html.parser")  # 实例化BeautifulSoup对象来解析网页，第一个参数是指定解析文件的类型，第二个参数是指定解析器

# 1）find_all() 查找所有
t_list = bs.find_all("head")             # 获取所有相同标签(完全匹配)的内容(索引与内容必须完全相同)，返回类型为list
print(t_list)

2）正则表达式搜索：使用search方法来匹配内容，搜索符合正则表达式规则的内容

from bs4 import BeautifulSoup
import  re

file = open("baidu.html", "rb")         # 打开一个html文件
html = file.read()
bs = BeautifulSoup(html, "html.parser")     # 实例化BeautifulSoup对象来解析网页，第一个参数是指定解析文件的类型，第二个参数是指定解析器

t_list = bs.find_all(re.compile("a"))       # 获取含索引内容的信息
print(t_list)

3）方法：传入一个函数（方法），根据函数的要求搜索

from bs4 import BeautifulSoup

file = open("baidu.html", "rb")          # 打开一个html文件
html = file.read()
bs = BeautifulSoup(html, "html.parser")  # 实例化BeautifulSoup对象来解析网页，第一个参数是指定解析文件的类型，第二个参数是指定解析器

def name_is_exists(tag):
    return tag.has_attr("name")
t_list = bs.find_all(name_is_exists)	 # 这里为什么不需传参数，目前不明白

4）kwargs，按参数搜索

from bs4 import BeautifulSoup

file = open("baidu.html", "rb")          # 打开一个html文件
html = file.read()
bs = BeautifulSoup(html, "html.parser")  # 实例化BeautifulSoup对象来解析网页，第一个参数是指定解析文件的类型，第二个参数是指定解析器

t_list = bs.find_all(id='head')                                 # 查找id为head中的所有内容，包括子内容
t_list = bs.find_all(href='http://news.baidu.com') # 查找href为http://news.baidu.com中所有内容，包括子内容
t_list = bs .find_all(class_=True)                 # 查找有class下的所有内容
t_list = bs.find_all(text="hao123")                # 返回text内容为指定内容的text
t_list = bs.find_all("a", limit=3)                 # 查找标签为a的内容，并只返回3条内容

5）css选择器

from bs4 import BeautifulSoup

file = open("baidu.html", "rb")          # 打开一个html文件
html = file.read()
bs = BeautifulSoup(html, "html.parser")  # 实例化BeautifulSoup对象来解析网页，第一个参数是指定解析文件的类型，第二个参数是指定解析器

# 5）css选择器
t_list = bs.select('title')             # 通过标签来查找
t_list = bs.select('. mnav')            # 通过类名来查找
t_list = bs.select("#u1")               # 通过id来查找
t_list = bs.select("a[class='bri']")    # 通过属性来查找(a标签下类名为bri的内容)
t_list = bs.select("head > title")      # 通过子标签来查找(head标签下的title标签后的内容）
t_list = bs.select(".mnav ~ .bri")      # 通过兄弟结点来查找

洗手不上厕所

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Bs4简介

Bs4简介作用：将复杂的HTML文档转换成一个树形结构，每个节点都是Python对象，所有对象可以归纳为5种：1、Tag——利用标签，获取标签及其内容（多个重复标签，默认拿取第一个）file = open("baidu.html", "rb") # 打开一个html文件html = file.read()bs = BeautifulSoup(html, "html.parser") # 实例化BeautifulSoup对象来解析网页，第一个参数是指定解析文件的类型，第
复制链接

扫一扫

专栏目录