Spider2BeautifulSoup（2）__find_all

苏格拉没底——

于 2020-12-03 18:52:08 发布

阅读量176

点赞数

分类专栏： Python/爬虫/可视化/数分

本文链接：https://blog.csdn.net/qq_924485343/article/details/110560596

版权

Python/爬虫/可视化/数分专栏收录该内容

37 篇文章 0 订阅

订阅专栏

import re
# bs4  BeautifulSoup 可以解析 xml文件，jason文件，html文件
from bs4 import BeautifulSoup

#打开文件 并 放入内存
file = open("./baidu.html","rb")

html = file.read().decode("utf-8")
# 以 字节形式 打开并读取了 html 则  html的类型是 bytes 字节形式
# print("html: ",type(html))

# 在内存中 建立了 一个 树形 的 结构
bs = BeautifulSoup(html,"html.parser")


#bs.find_all : 字符串过滤 （ 会查找与字符串完全匹配的内容 ）
# t_list = bs.find_all("a")
# print(t_list[0].string)
# print(t_list[0].attrs)

# 所有包含a的标签都会被选取
# t_list = bs.find_all(re.compile("a"))
# print(t_list)

# 正则表达式 搜索  使用 search（） 方法 来匹配内容

# 也可以传入一个函数  根据函数的要求 来 访问
# def name_is_exists(tag):
#     return tag.has_attr("name")
#
# t_list = bs.find_all(name_is_exists)
# #
# # print(t_list)
#

# t_list = bs.find_all(id="news_banner_data")
# t_list = bs.find_all(href="http://news.baidu.com") # 有href属性 且 href 属性为 **** 的 给显示出来
# t_list = bs.find_all(class_=True)  # 有 class 属性的 给显示出来

# t_list = bs.find_all(text="贴吧")
# t_list = bs.find_all(text=["贴吧","地图"])
# t_list = bs.find_all(text=re.compile("\d"))

t_list = bs.find_all("a",limit=3) # 只要前三个 a

苏格拉没底——

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spider__2__BeautifulSoup（2）__find_all

import re# bs4 BeautifulSoup 可以解析 xml文件，jason文件，html文件from bs4 import BeautifulSoup#打开文件并放入内存file = open("./baidu.html","rb")html = file.read().decode("utf-8")# 以字节形式打开并读取了 html 则 html的类型是 bytes 字节形式# print("html: ",type(html))# 在内存中建立了 .
复制链接

扫一扫