python进阶——爬虫(二)
Requests模块
- 特征
基于网络请求的模块,功能强大,简单便捷,效率高 - 作用
模拟浏览器发送请求
urllib模块
request 模块
安装 cmd命令行输入
windows pip install requests
linux sudo pip install request
如何使用(编码过程):
a:指定URL
– UA伪装
– 请求参数的处理
b:发起请求
c:获取响应
d:存储
获取响应数据:在requests内置了一些方法:
r.text: 字符串方式的响应体
r.json: json解释器
r.status_code:响应的状态码
r.url:获取请求的链接
r.encoding:获取编码方式
请求方式
Http请求的方式
GET
requests.get()
import requests
# 1、指定URL
url = "https://www.sogou.com/"
# 2、发起请求
# get方式会返回一个响应对象
response = requests.get(url=url)
# 3、获取响应数据
page_text = response.text
print(page_text)
# 4、存储数据
with open('./sogou.html','w',encoding='utf-8') as p:
p.write(page_text)
print('结束')
在多次爬起过程中我们会出现乱码等现象
可以试试加上请求头:
import requests
# 1、指定URL
url = "https://www.sogou.com/web"
# 请求头header,每个访问者不同,情况不同,按F12查看NETWORK可查看到User-Agent
header = {'User-Agent': 'xxx'}
# 2、发起请求
# get方式会返回一个响应对象
param = {"query":"广州天气"}
response = requests.get(url=url,params=param,headers=header)
# 3、获取响应数据
page_text = response.text
print(page_text)
# 4、存储数据
with open('./sogou1.html','w',encoding='utf-8') as p:
p.write(page_text)
print('结束')
POST
常说的提交表单,请求参数data(类型、字典、元组、列表、json)
requests.post()
import requests
import json
url = 'https://fanyi.baidu.com/sug'
headers = {'User-Agent': 'xxx'}
# post请求
word = input('word:')
data = {'kw':word}
# 发送请求
resp = requests.post(url=url,data=data,headers=headers)
dic_obj = resp.json()
filename = word + '.json'
fp = open(filename,'w',encoding='utf-8')
json.dump(dic_obj,fp=fp,ensure_ascii=False)
print('结束')
不带参数
https://www.baidu.com
带参数
https://www.baidu.com/s?wd=xxx
爬取实例
1、豆瓣
爬取豆瓣电影分类排行榜中的电影详情数据(任意一个分类中第11~22部电影的详情)
import requests
import json
url = 'https://movie.douban.com/j/chart/top_list'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'}
param = {
'type': '5',
'interval_id': '100:90',
'action': '',
'start': 10,
'limit': 11
}
response = requests.get(url=url,params=param,headers=headers)
dic_obj = response.json()
# dic_obj = response.tedxt
filename = 'aa.json'
fp = open(filename,'w',encoding='utf-8')
json.dump(dic_obj,fp=fp,ensure_ascii=False)
# fp.write(dic_obddj)
print('结束')
国监局化妆品许可证
import requests
import json
url = 'http://scxk.nmpa.gov.cn:81/xk/itownet/portalAction.do?method=getXkzsList'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36'}
# 存储id
id_list = []
detail_list = []
# page = input('第几页:')
for page in range(1,3):
page = str(page)
data={
'on': 'true',
'page': page,
'pageSize': '15',
'productName': '',
'conditionType': '1',
'applyname': ''
}
resp = requests.post(url=url,data=data,headers=headers)
dic_obj = resp.json()
# print(dic_obj)
# fp = open('许可证_' + page + '.json','w',encoding='utf-8')
for dic in dic_obj['list']:
# print(dic['ID'])
id_list.append(dic['ID'])
# print(id_list)
detail_url = 'http://scxk.nmpa.gov.cn:81/xk/itownet/portalAction.do?method=getXkzsById'
for id in id_list:
print(id)
detail_data = {
'id': id
}
detail = requests.post(url=detail_url,data=detail_data,headers=headers)
detail_obj = detail.json()
detail_list.append(detail_obj)
fp = open('./许可证详情1.json','w',encoding='utf-8')
json.dump(detail_list,fp,ensure_ascii=False)
print('结束')