网络爬虫概述
网络爬虫与浏览器的区别
浏览器是展示数据的,而网络爬虫是采集数据的
网络爬虫的定义
模拟客户端发送请求获取响应数据,按照一定规则,自动从万维网上获取信息的程序
网络爬虫的作用
从万维网上,获取我们需要的信息
requests请求库
介绍
requests是一个优雅而简单的python HTTP请求库,作用是发送请求获取响应数据
安装
在终端运行如下命令:
pip install requests
使用步骤
- 导入模块:
import requests
- 发送get请求,获取响应:
response=requests.get('http://www.baidu.com')
- 从响应中获取数据:
print(response.text)
常见属性
- response.text:响应体str类型
- response.ecoding:二进制转换字符使用的编码
- response.content:响应体bytes类型
案例
获取丁香园首页内容 首页URL为: https://ncov.dxy.cn/ncovh5/view/pneumonia
# 1. 导入模块
import requests
# 2. 发送请求, 获取响应
response = requests.get('https://ncov.dxy.cn/ncovh5/view/pneumonia')
# 3. 从响应中, 获取数据
# print(response.text)
print(response.content.decode())
BeautifuSoup解析库
Beautiful Soup 介绍与安装
- Beautiful Soup 介绍
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库
- Beautiful Soup 安装
pip install bs4
pip install lxml
BeautifulSoup对象介绍与创建
- BeautifulSoup对象
代表要解析整个文档树, 它支持遍历文档树和搜索文档树中描述的大部分的方法。
- 创建 BeautifulSoup 对象
# 1. 导入模块
from bs4 import BeautifulSoup
# 2. 创建BeautifulSoup对象
soup = BeautifulSoup('<html>data</html>', 'lxml')#指定解析器为lxml
print(soup)
BeautifulSoup对象的find方法
- find方法的作用
搜索文档树
- 语法
find(self, name=None, attrs={}, recursive=True, text=None, **kwargs)
- 参数
- name: 标签名
- attrs: 属性字典
- recursive: 是否递归循环查找
- text: 根据文本内容查找
- 返回
查找到的第一个元素对象
- 案例
需求: 获取下面文档中的title标签和a标签
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="title">
<b>The Dormouse's story</b>
</p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
</body>
</html>
- 根据标签名查找
# 1. 导入模块 from bs4 import BeautifulSoup # 2. 准备文档字符串 html = ''' <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="title"> <b>The Dormouse's story</b> </p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. </p> <p class="story">...</p> </body> </html> ''' # 3. 创建BeautifulSoup对象 soup = BeautifulSoup(html, 'lxml') # 4. 查找title标签 title = soup.find('title') print(title) # 5. 查找a 标签 a = soup.find('a') print(a) # 查找所有的a标签 a_s = soup.find_all('a') print(a_s)
- 根据属性查找
# 查找id为link1的标签 # 方式1: 通过命名参数进行指定的 a = soup.find(id='link1') print(a) # 方式2: 使用attrs来指定属性字典, 进行查找 a = soup.find(attrs={'id': 'link1'}) print(a)
- 根据文本查找
text = soup.find(text='Elsie') print(text)
- Tag 对象介绍
Tag对象对应于原始文档中的XML或HTML标签 Tag有很多方法和属性, 可用遍历文档树和搜索文档树以及获取标签内容
- Tag 对象常见属性
- name: 获取标签名称
- attrs: 获取标签所有属性的键和值
- text: 获取标签的文本字符串
print(type(a)) # <class 'bs4.element.Tag'>
print('标签名', a.name)
print('标签所有属性', a.attrs)
案例
# 1. 导入相关模块
import requests
from bs4 import BeautifulSoup
# 2. 发送请求, 获取首页内容
response = requests.get('https://ncov.dxy.cn/ncovh5/view/pneumonia')
home_page = response.content.decode()
#print(home_page)
# 3. 使用BeautifulSoup提取数据
soup = BeautifulSoup(home_page, 'lxml')
script = soup.find(id='fetchIndexMallList')
text = script.text
print(text)
正则表达式
正则表达式的概念与作用
- 概念
正则表达式(regular expression) 是一种字符串匹配的模式(pattern)。
- 作用
- 检查一个字符串是否含有某种子串
- 替换匹配的子串
- 提取某个字符串中匹配的子串
正则表达式的常见语法
re.findall() 方法
re.findall(pattern, string, flags=0)(重点)
- 作用:
扫描整个string字符串,返回所有与pattern匹配的列表
- 参数:
pattern: 正则表达式 string: 从那个字符串中查找 flags: 匹配模式
- 返回
返回string中与pattern匹配的结果列表
- 举例:
re.findall("\d","chuan1zhi2") >> ["1","2"]
- findall() 特点: (切记)
如果正则表达式中有没有()则返回与整个正则匹配的列表
如果正则表达式中有(),则返回()中匹配的内容列表, 小括号两边东西都是负责确定提取数据所在位置.
有如下的例子
import re
rs = re.findall("a.+bc", "a\nbc", re.DOTALL)
print(rs) #['a\nbc']
rs = re.findall("a(.+)bc", "a\nbc", re.DOTALL)
print(rs)#['\n']
正则表达式中的r原串的使用
正则中使用r原始字符串, 能够忽略转义符号带来影响
待匹配的字符串中有多少个\,r原串正则中就添加多少个\即可
import re
rs = re.findall("a\nb","a\nb")
print(rs)#['a\nb']
rs = re.findall("a\\nb","a\\nb")
print(rs)#[]
rs = re.findall("a\\\\nb","a\\nb")
print(rs)#['a\\nb']
rs = re.findall(r"a\nb","a\nb")
print(rs)#['a\nb']
案例: 提取数据的json字符串
# 1. 导入相关模块
import requests
from bs4 import BeautifulSoup
import re
# 2. 发送请求, 获取首页内容
response = requests.get('https://ncov.dxy.cn/ncovh5/view/pneumonia')
home_page = response.content.decode()
# print(home_page)
# 3. 使用BeautifulSoup提取数据
soup = BeautifulSoup(home_page, 'lxml')
script = soup.find(id='fetchIndexMallList')
text = script.text
# print(text)
# 4. 使用正则表达式, 提取json字符串
json_str = re.findall(r'\[.+\]', text)[0]
print(json_str)
json模块
json模块介绍
- json模块
json模块是python自带的模块, 用于json与python数据之间的相互转换.
- json与python数据类型的对应关系
json | python |
---|---|
object | dict |
array | list |
string | str |
number(int) | int,long |
number(real) | float |
true | True |
false | False |
null | None |
json 转换为 python
import json
# 1. 把JSON字符串, 转换为PYTHON数据
# 1.1 准备JSON字符串
json_str = '''[{"commodityId":"3591210463042011540","commodityLogo":"https://img1.dxycdn.com/2022/1230/768/5942165308020931953-68.png","commodityName":"现货抗原检测试剂盒25人份","cornerMark":"限量抢购中","miniProgramLink":"/pages/common/cms/activity/index?name=jk_zxcgvhbjk&moduleId=3588838986124169233&from=other&chdShareFromId=3589899615392486250&chdShareEntityId=3370441352948220030&chdShareType=2","miniProgramShortLink":"https://dxy.cool/YTGSHf","skuId":"3591210463042011547","price":25800,"discountPrice":2900,"sortId":110,"sellStatus":0},{"commodityId":"3549260783022588925","commodityLogo":"https://img1.dxycdn.com/2022/1205/807/2086858196898476853-68.png","commodityName":"维生素C凝胶糖果60粒","cornerMark":"","miniProgramLink":"/pages/common/cms/activity/index?name=jk_zxcgvhbjk&moduleId=3587310742271172996&from=other&chdShareFromId=3589406510197798564&chdShareEntityId=3370441352948220031&chdShareType=2","miniProgramShortLink":"https://dxy.cool/13pZ2L","skuId":"3551181722177885802","price":5900,"discountPrice":4900,"sortId":90,"sellStatus":0}]'''
# 1.2 把JSON字符串, 转换为PYTHON数据
rs = json.loads(json_str)
print(rs)
print(type(rs)) # <class 'list'>
print(type(rs[0])) # <class 'dict'>
# 2. 把JSON格式文件, 转换为PYTHON类型的数据
# 2.1 构建指向该文件的文件对象
with open('data/test.json') as fp:
# 2.2 加载该文件对象, 转换为PYTHON类型的数据
python_list = json.load(fp)
print(python_list)
print(type(python_list)) # <class 'list'>
print(type(python_list[0])) # <class 'dict'>
python 转换为 json
import json
# 1. 把PYTHON转换为JSON字符串
# 1.1 PYTHON类型的数据
json_str = '''[{"commodityId":"3591210463042011540","commodityLogo":"https://img1.dxycdn.com/2022/1230/768/5942165308020931953-68.png","commodityName":"现货抗原检测试剂盒25人份","cornerMark":"限量抢购中","miniProgramLink":"/pages/common/cms/activity/index?name=jk_zxcgvhbjk&moduleId=3588838986124169233&from=other&chdShareFromId=3589899615392486250&chdShareEntityId=3370441352948220030&chdShareType=2","miniProgramShortLink":"https://dxy.cool/YTGSHf","skuId":"3591210463042011547","price":25800,"discountPrice":2900,"sortId":110,"sellStatus":0},{"commodityId":"3549260783022588925","commodityLogo":"https://img1.dxycdn.com/2022/1205/807/2086858196898476853-68.png","commodityName":"维生素C凝胶糖果60粒","cornerMark":"","miniProgramLink":"/pages/common/cms/activity/index?name=jk_zxcgvhbjk&moduleId=3587310742271172996&from=other&chdShareFromId=3589406510197798564&chdShareEntityId=3370441352948220031&chdShareType=2","miniProgramShortLink":"https://dxy.cool/13pZ2L","skuId":"3551181722177885802","price":5900,"discountPrice":4900,"sortId":90,"sellStatus":0}]'''
# 1.2 把JSON字符串, 转换为PYTHON数据
rs = json.loads(json_str)
# 1.2 把PYTHON转换为JSON字符串
json_str = json.dumps(rs, ensure_ascii=False)
print(json_str)
# 2. 把PYTHON以JSON格式存储到文件中
# 2.1 构建要写入文件对象
with open('data/test1.json', 'w') as fp:
# 2.2 把PYTHON以JSON格式存储到文件中
json.dump(rs, fp, ensure_ascii=False)
案例: 解析数据的json字符串
# 1. 导入相关模块
import requests
from bs4 import BeautifulSoup
import re
import json
# 2. 发送请求, 获取首页内容
response = requests.get('https://ncov.dxy.cn/ncovh5/view/pneumonia')
home_page = response.content.decode()
# print(home_page)
# 3. 使用BeautifulSoup提取数据
soup = BeautifulSoup(home_page, 'lxml')
script = soup.find(id='fetchIndexMallList')
text = script.text
# print(text)
# 4. 使用正则表达式, 提取json字符串
json_str = re.findall(r'\[.+\]', text)[0]
# print(json_str)
# 5. 把json字符串转换为Python类型的数据
data = json.loads(json_str)
print(data)