(二）bs4模块学习

最新推荐文章于 2023-05-31 21:28:12 发布

难得 yx

最新推荐文章于 2023-05-31 21:28:12 发布

阅读量871

点赞数

分类专栏： python爬虫

本文链接：https://blog.csdn.net/weixin_45649763/article/details/107388422

版权

python爬虫专栏收录该内容

8 篇文章

订阅专栏

文章目录

简介
应用
应用正则表达式来查找包含特定文本的内容
limit参数
css选择器

简介

Beautiful Soup是一个可以从HTML或XML文件中提取数据的Python库。它能够通过你喜欢的转换器实现惯用的文档导航，查找，修改文档的方式。BeautifulSoup会帮你节省数小时甚至数天的工作时间。

美丽的汤(靓汤）将复杂的HTML文档转换成一个复杂的树形结构，每个节点都是Python的对象，所有对象可以归纳为4种： Tag，NavigableString，BeautifulSoup，Comment。

取出标签及其里面的内容

### Tag标签
from bs4 import BeautifulSoup

file=open("./baidu.html","rb") #以读二进制的形式打开file
html1=file.read()
bs=BeautifulSoup(html1,"html.parser")
print(bs.title)

from bs4 import BeautifulSoup

file=open("./baidu.html","rb") #以读二进制的形式打开file
html1=file.read()
bs=BeautifulSoup(html1,"html.parser")
print(bs.head)  #这里的head指的是head标签，此时会打印出head标签中的所有内容
print(type(bs.head))

在这里插入图片描述
注意：

这里默认只能取出匹配到的遇到的第一个标签的内容。

只取出标签里面的内容(字符串)

from bs4 import BeautifulSoup

file=open("./baidu.html","rb") #以读二进制的形式打开file
html1=file.read()
bs=BeautifulSoup(html1,"html.parser")
print(bs.title.string) #只取出标签里面的内容
print(type(bs.title.string))

在这里插入图片描述

拿到标签里的属性值（以字典的形式保存）

file=open("./baidu.html","rb") #以读二进制的形式打开file
html1=file.read()
bs=BeautifulSoup(html1,"html.parser")
print(bs.a.attrs) #以字典的形式取出标签里的属性值

在这里插入图片描述
源文件

取出文件自身（Beautifulsop表示整个文档）

file=open("./baidu.html","rb") #以读二进制的形式打开file
html1=file.read()
bs=BeautifulSoup(html1,"html.parser")
#print(bs.a.attrs) #以字典的形式取出标签里的属性值
# print(type(bs.title.string))
print(type(bs))

在这里插入图片描述

file=open("./baidu.html","rb") #以读二进制的形式打开file
html1=file.read()
bs=BeautifulSoup(html1,"html.parser")
#print(bs.a.attrs) #以字典的形式取出标签里的属性值
# print(type(bs.title.string))
print(bs)

在这里插入图片描述
还可以print bs.name 等属性

输出的内容不包含注释符号

file=open("./baidu.html","rb") #以读二进制的形式打开file
html1=file.read()
bs=BeautifulSoup(html1,"html.parser")
#print(bs.a.attrs) #以字典的形式取出标签里的属性值
# print(type(bs.title.string))
#print(bs)
print(bs.a.string)
print(type(bs.a.string)) #comment是一个特殊的navigablestring,输出的内容不包含注释符号

在这里插入图片描述

应用

文档的遍历

from bs4 import BeautifulSoup

file=open("./baidu.html","rb") #以读二进制的形式打开file
html1=file.read()
bs=BeautifulSoup(html1,"html.parser")
#print(bs.a.attrs) #以字典的形式取出标签里的属性值
# print(type(bs.title.string))
#print(bs)
# 遍历文件树中的遍历文档
print(bs.head.contents) #以列表的形式将head标签中的内容存储并打印出来
print(bs.head.contents[1]) #取出列表的第二个元素

在这里插入图片描述

文档的搜索

find_all

字符串过滤，会查找与字符串完全匹配的内容

from bs4 import BeautifulSoup

file=open("./baidu.html","rb") #以读二进制的形式打开file
html1=file.read()
bs=BeautifulSoup(html1,"html.parser")
#print(bs.a.attrs) #以字典的形式取出标签里的属性值
# print(type(bs.title.string))
#print(bs)
# 遍历文件树中的遍历文档
# print(bs.head.contents) #以列表的形式将head标签中的内容存储并打印出来
# print(bs.head.contents[1]) #取出列表的第二个元素
t_list=bs.find_all("a")
print(t_list)

在这里插入图片描述

正则表达式搜索(使用search(）方法来匹配

from bs4 import BeautifulSoup
import re #正则表达式的模块
file=open("./baidu.html","rb") #以读二进制的形式打开file
html1=file.read()
bs=BeautifulSoup(html1,"html.parser")
t_list=bs.find_all(re.compile("a")) #打印出含有a的所有内容+标签
print(t_list)

在这里插入图片描述

传入一个函数（方法），根据函数的要求来搜索

from bs4 import BeautifulSoup
import re
file=open("./baidu.html","rb") #以读二进制的形式打开file
html1=file.read()
bs=BeautifulSoup(html1,"html.parser")
def name_is_exists(tag):
    return tag.has_attr("name") #把标签中含有name属性的标签加内容返回出来
t_list = bs.find_all(name_is_exists)
print(t_list)

在这里插入图片描述

for item in t_list:
    print(item)

在这里插入图片描述

kwargs参数

from bs4 import BeautifulSoup
import re
file=open("./baidu.html","rb") #以读二进制的形式打开file
html1=file.read()
bs=BeautifulSoup(html1,"html.parser")
def name_is_exists(tag):
    return tag.has_attr("name") #把标签中含有name属性的标签加内容返回出来
t_list = bs.find_all(name_is_exists)
t_list=bs.find_all(id='head') # 之前匹配的都是规则，现在匹配关键字
for item in t_list:
    print(item)

在这里插入图片描述

from bs4 import BeautifulSoup
import re
file=open("./baidu.html","rb") #以读二进制的形式打开file
html1=file.read()
bs=BeautifulSoup(html1,"html.parser")
def name_is_exists(tag):
    return tag.has_attr("name") #把标签中含有name属性的标签加内容返回出来
t_list = bs.find_all(name_is_exists)
t_list=bs.find_all(class_=True) # 之前匹配的都是规则，现在匹配关键字
for item in t_list:
    print(item)

在这里插入图片描述

text参数

from bs4 import BeautifulSoup
import re
file=open("./baidu.html","rb") #以读二进制的形式打开file
html1=file.read()
bs=BeautifulSoup(html1,"html.parser")
# def name_is_exists(tag):
#     return tag.has_attr("name") #把标签中含有name属性的标签加内容返回出来

t_list = bs.find_all(text=["hao123",'地图','贴吧'])
for item in t_list:
    print(item)

在这里插入图片描述

应用正则表达式来查找包含特定文本的内容

from bs4 import BeautifulSoup
import re
file=open("./baidu.html","rb") #以读二进制的形式打开file
html1=file.read()
bs=BeautifulSoup(html1,"html.parser")
# def name_is_exists(tag):
#     return tag.has_attr("name") #把标签中含有name属性的标签加内容返回出来
t_list = bs.find_all(text=re.compile("\d")) #正则表达式匹配含有数字的字符串
for item in t_list:
    print(item)

在这里插入图片描述

limit参数

from bs4 import BeautifulSoup
import re
file=open("./baidu.html","rb") #以读二进制的形式打开file
html1=file.read()
bs=BeautifulSoup(html1,"html.parser")
# def name_is_exists(tag):
#     return tag.has_attr("name") #把标签中含有name属性的标签加内容返回出来
t_list = bs.find_all("a",limit=3) #只显示出3个a
for item in t_list:
    print(item)

在这里插入图片描述

css选择器

from bs4 import BeautifulSoup
import re
file=open("./baidu.html","rb") #以读二进制的形式打开file
html1=file.read()
bs=BeautifulSoup(html1,"html.parser")
# def name_is_exists(tag):
#     return tag.has_attr("name") #把标签中含有name属性的标签加内容返回出来
t_list=bs.select('title') #指定标签来匹配
t_list1=bs.select(".mnav") #通过类名来查找
t_list2=bs.select("#u1") #通过id来查找



for item in t_list:
    print(item)
for item in t_list1:
    print(item)

在这里插入图片描述