数据爬取的概念和分类

最新推荐文章于 2023-11-03 11:56:12 发布

ch_zs

最新推荐文章于 2023-11-03 11:56:12 发布

阅读量1.6k

点赞数

分类专栏：爬虫

本文链接：https://blog.csdn.net/qq_43665891/article/details/109598606

版权

爬虫专栏收录该内容

4 篇文章 0 订阅

订阅专栏

爬虫中数据的分类

在爬虫爬取到数据中有很多不同类型的数据, 根据数据的不同类型有规律的提取和解析数据

结构化数据 :json , xml等

处理方式 : 直接转化为python类型
非结构化数据 : HTML

处理方式:正则表达式,xpath,BS4

结构化 :

json数据的转换

#请求json数据
json_str = r.content.decode()
# 转化为python的对象
python_dict = json.loads(json_str)
# 逆操作(将python字典转化为json格式)
# json_str = json.dumps(python_dict)

jsonpath

JsonPath 可以快速解析json数据
安装模块:

pip install jsonpath -i https://pypi.tuna.tsinghua.edu.cn/simple

$ 根节点
@ 现行节点
/ 子节点
… 不管位置, 匹配符合的条件

# 1. 提取第1本书的title
print("\n1. 提取第1本书的title")
ret = jsonpath.jsonpath(info, "$.store.book[0].title")

ret = jsonpath.jsonpath(info, "$['store']['book'][0]['title']")

# 2. 提取2、3、4本书的标题
print("\n2. 提取2、3、4本书的标题")
ret = jsonpath.jsonpath(info, "$.store.book[1,2,3].title")

非结构化 :

Xpath语法

/ 从根节点选取, 或者用来过渡
// 从当前节点选择文档中的节点, 不考虑位置
@ 选取属性
text() 选取文本
href() 选择链接

想要在代码中使用xpath 需要下载lxml模块

pip install lxml -i https://pypi.tuna.tsinghua.edu.cn/simple

# 使用etree.HTML 将字符串转为Element对象
html = etree.HTML(text)

href_list = html.xpath("//li[@class='item-1']/a/@href")
title_list = html.xpath("//li[@class='item-1']/a/text()")


# 组装成字典
for href in href_list:
    item = dict()
    item["href"] = href
    item["title"] = title_list[href_list.index(href)]
    print(item)

BS4

在这里插入图片描述

安装模块:

pip install beautifulsoup4 -i https://pypi.tuna.tsinghua.edu.cn/simple

示例: 根据标签查找

from bs4 import BeautifulSoup

html = '''

'''
# 创建 Beautiful Soup 对象
soup = BeautifulSoup(html, features="lxml")

# 找所有b标签 返回列表
ret = soup_all('a')

# 可以搜索属性 (类class特殊一点)
ret = soup.find_all(**{'class':'sister'})
ret = soup.find_all(id='link2')

# 搜索内容
ret= soup.find_all(text='Elsie')

示例:根据css选择器查找

ret = select('title') #标签直接写
ret = select('.sister') #类名前面加.
ret = select('#link1') #id前面加#
ret = select('p #link1') # 层级选择
ret = select('a[clas="sister"]') # 属性选择
ret = select('title')[0].get_text() # 获取文本内容
ret = select('a')[0].get('href') # 获取属性值

ch_zs

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
数据爬取的概念和分类

爬虫中数据的分类在爬虫爬取到数据中有很多不同类型的数据, 根据数据的不同类型有规律的提取和解析数据结构化数据 :json , xml等处理方式 : 直接转化为python类型非结构化数据 : HTML处理方式:正则表达式,xpath,BS4结构化 :json数据的转换#请求json数据json_str = r.content.decode()# 转化为python的对象python_dict = json.loads(json_str)# 逆操作(将python字典
复制链接

扫一扫