文章目录
一、urllib模块
1.什么是urllib模块?
python的内置网络请求模块
为什么要学习这个模块?
1,有些比较老的爬虫项目用的就是这个技术
2.有的时候我们去爬取一些数据需要请求和urllib的配合使用
3.内置模块是标准库
示例1
# 保存'未来汽车'图片到本地
import requests
response = requests.get(
'https://gimg2.baidu.com/image_search/src=http%3A%2F%2Fphotocdn.sohu.com%2F20120823%2FImg351337268.jpg&refer=http%3A%2F%2Fphotocdn.sohu.com&app=2002&size=f9999,10000&q=a80&n=0&g=0n&fmt=jpeg?sec=1621479722&t=5cda4533dad4056d5809bf5f2450a22f').content
with open('未来汽车.jpg', 'wb') as f:
f.write(response)
示例2
# 保存'未来汽车'图片到本地
from urllib.request import urlretrieve
response = urlretrieve(
'https://gimg2.baidu.com/image_search/src=http%3A%2F%2Fphotocdn.sohu.com%2F20120823%2FImg351337268.jpg&refer=http%3A%2F%2Fphotocdn.sohu.com&app=2002&size=f9999,10000&q=a80&n=0&g=0n&fmt=jpeg?sec=1621479722&t=5cda4533dad4056d5809bf5f2450a22f',
'未来汽车.jpg')
1.urllib.request模块
python2:urllib2,urllib
python3:把urllib和urllib2合并常用的方法
- urllib.request.urlopen(“网址”) 作用 :向网站发起一个请求并获取响应
- 字节流 = response.read()
- 字符串 = response.read().decode(“utf-8”)
- urllib.request.Request"网址",headers=“字典”) urlopen()不支持重构User-Agent
示例1
# urllib.request实现
# urllib.request.urlopen('网址')
# 作用:向网站发起请求并响应
import urllib.request
response = urllib.request.urlopen('https://www.baidu.com/')
print(type(response)) # <class 'http.client.HTTPResponse'>
print(response.read())
'''
b'<html>\r\n<head>\r\n\t<script>\r\n\t\tlocation.replace(location.href.replace("https://","http://"));\r\n\t</script>\r\n</head>\r\n<body>\r\n\t<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>\r\n</body>\r\n</html>'
''' # 1.字节流bytes,需要解码 2.数据不对(网站做了反爬),需要添加ua
示例2
# 示例2
import urllib.request
headers = {'User-Agent': 'Mozilla/5.0'}
req = urllib.request.Request("https://www.baidu.com/",headers=headers)
response = urllib.request.urlopen(req)
print(response.read().decode('utf-8'))
# 这样出来的数据没有问题
2.响应对象
- read() 读取服务器响应的内容
- getcode() 返回HTTP的响应码
- geturl() 返回实际数据的URL(防止重定向问题)
示例
import urllib.request
# 向指定的url地址发起请求,并返回服务器响应的数据(文件的对象)
url = "http://www.baidu.com"
# 编码
newUrl2 = urllib.request.quote(url)
print(newUrl2) # http%3A//www.baidu.com
# 解码
newUrl1 = urllib.request.unquote(newUrl2)
print(newUrl1) # http://www.baidu.com
response = urllib.request.urlopen(newUrl1)
data = response.read()
print(data)
'''
b'<!DOCTYPE html><!--STATUS OK-->\n\n\n <html><head><meta http-equiv="Content-Type" content="text/html;charset=utf-8"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><meta content="always" name="referrer"><meta name="theme-color" content="#2932e1"><meta name="description" content="\xe5\x85\xa8\xe7\x90\x83\xe6\x9c\x80\xe5\xa4\xa7\xe7\x9a\x84\xe4\xb8\xad\xe6\x96\x87\xe6\x90\x9c\xe7\xb4\xa2\xe5\xbc\x95\xe6\x93\x8e\xe3\x80\x81\xe8\x87\xb4\xe5\x8a\x9b\xe4\xba\x8e\xe8\xae\xa9\xe7\xbd\x91\xe6\xb0\x91\xe6\x9b\xb4\xe4\xbe\xbf\xe6\x8d\xb7\xe5\x9c\xb0\xe8\x8e\xb7\xe5\x8f\x96\xe4\xbf\xa1\xe6\x81\xaf\xef\xbc\x8c\xe6\x89\xbe\xe5\x88\xb0\xe6\x89\x80\xe6\xb1\x82\xe3\x80\x82\xe7\x99\xbe\xe5\xba\xa6\xe8\xb6\x85\xe8\xbf\x87\xe5\x8d\x83\xe4\xba\xbf\xe7\x9a\x84\xe4\xb8\xad\xe6\x96\x87\xe7\xbd\x91\xe9\xa1\xb5\xe6\x95\xb0\xe6\x8d\xae\xe5\xba\x93\xef\xbc\x8c\xe5\x8f\xaf\xe4\xbb\xa5\xe7\x9e\xac\xe9\x97\xb4\xe6\x89\xbe\xe5\x88\xb0\xe7\x9b\xb8\xe5\x85\xb3\xe7\x9a\x84\xe6\x90\x9c\xe7\xb4\xa2\xe7\xbb\x93\xe6\x9e\x9c\xe3\x80\x82"><link rel="shortcut icon" href="/favicon.ico" type="image/x-icon" /><link rel="search" type="application/opensearchdescription+xml" ...
...
'''
# 返回当前环境的有关信息
print(response.info())
'''
Bdpagetype: 1
Bdqid: 0x0932eb0001c8f
Cache-Control: private
Content-Type: text/html;charset=utf-8
Date: Thu, 22 Apr 2021 02:28:48 GMT
Expires: Thu, 22 Apr 2021 02:27:56 GMT
P3p: CP=" OTI DSP COR IVA OUR IND COM "
P3p: CP=" OTI DSP COR IVA OUR IND COM "
Server: BWS/1.1
Set-Cookie: BAIDUID=1153AD40DBAF90F5435353FC10B:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=214453447; path=/; domain=.baidu.com
Set-Cookie: BIDUPSID=1153AD40DBAF90F5432D31787A9FC10B; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: PSTM=163534548; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
Set-Cookie: BAIDUID=1153AD35DBAF90F65468EA7EB4533595:FG=1; max-age=345435300; expires=Fri, 22-Apr-22 02:28:48 GMT; domain=.baidu.com; path=/; version=1; comment=bd
Set-Cookie: BDSVRTM=0; path=/
Set-Cookie: BD_HOME=1; path=/
Set-Cookie: H_PS_PSSID=33345_3345_315353_3364_3465_265740_28657; path=/; domain=.baidu.com
Traceid: 1745674674688087589780678584646519
Vary: Accept-Encoding
Vary: Accept-Encoding
X-Ua-Compatible: IE=Edge,chrome=1
Connection: close
Transfer-Encoding: chunked'''
# 返回状态码
print(response.getcode()) # 200
# if response.getcode() == 200 or response.getcode() == 304:
# 处理网页信息
# pass
# 返回当前只在爬取的URL地址
print(response.geturl()) # http://www.baidu.com
3.urllib.parse模块
常用方法
- urlencode(字典)
- quote(字符串) (这个里面的参数是个字符串)
示例1
import urllib.request
# 如何编码 3个%是一个汉字
url = 'https://tieba.baidu.com/f?fr=wwwt&ie=utf-8&kw=%E7%BE%BD%E5%93%A5'
url1 = 'https://tieba.baidu.com/f?fr=wwwt&ie=utf-8&kw=羽哥'
# res = urllib.request.urlopen(url1)
# 如果我通过urllib向一个携带中文字样的url发起请求,这个时候需要注意把中文转换为 % + 十六进制 的这种数据类型:%E7%BE%BD%E5%93%A5
import urllib.parse
wd = {'wd': '羽哥'}
result = urllib.parse.urlencode(wd)
print(result) # wd=%E7%BE%BD%E5%93%A5
new_url = 'https://tieba.baidu.com/f?fr=wwwt&ie=utf-8&' + result
示例2
# 示例2
import urllib.request
url2 = "https://tieba.baidu.com/f?kw=%E7%BE%BD%E5%93%A5"
# 解码
newUrl = urllib.request.unquote(url2)
print(newUrl)
'''
https://tieba.baidu.com/f?kw=羽哥'''
# 编码
newUrl2 = urllib.request.quote(newUrl)
print(newUrl2)
'''
https%3A//tieba.baidu.com/f%3Fkw%3D%E7%BE%BD%E5%93%A5'''
案例1:爬取王者荣耀高清壁纸
```python
# 爬取王者荣耀高清壁纸
# 网页分析:
# 主页: https://pvp.qq.com/web201605/wallpaper.shtml
# 第一页:https://pvp.qq.com/web201605/wallpaper.shtml
# 最后一页:https://pvp.qq.com/web201605/wallpaper.shtml 网址一样,说明下一页的图片是动态加载
# 每一页图片数量4*5=20张,共25页(第25页是6张图),一共20*24+6=486张高清图
# 第一张图地址:http://shp.qpic.cn/ishow/2735042018/1618915966_84828260_2160_sProdImgNo_7.jpg/0
# 第二张图地址:http://shp.qpic.cn/ishow/2735041519/1618485631_84828260_22420_sProdImgNo_7.jpg/0
# 第三张图地址:http://shp.qpic.cn/ishow/2735040920/1617970550_84828260_22886_sProdImgNo_7.jpg/0
# 第486张图地址:http://shp.qpic.cn/ishow/2735122518/1545733077_-888937974_7302_sProdImgNo_7.jpg/0
# No_2表示分辨率:1024x768,No_5表示分辨率:1440x900,No_7表示分辨率:1920x1200 图片地址结尾是.jpg/0
# 第1页js加载出来的数据:地址为https://apps.game.qq.com/cgi-bin/ams/module/ishow/V1.0/query/workList_inc.cgi?activityId=2735&sVerifyCode=ABCD&sDataType=JSON&iListNum=20&totalpage=0&page=0&iOrder=0&iSortNumClose=1&jsoncallback=jQuery17103347171427601099_1619236609881&iAMSActivityId=51991&_everyRead=true&iTypeId=2&iFlowId=267733&iActId=2735&iModuleId=2735&_=1619236924589
# 第2页js加载出来的数据:地址为https://apps.game.qq.com/cgi-bin/ams/module/ishow/V1.0/query/workList_inc.cgi?activityId=2735&sVerifyCode=ABCD&sDataType=JSON&iListNum=20&totalpage=0&page=1&iOrder=0&iSortNumClose=1&jsoncallback=jQuery17103347171427601099_1619236609882&iAMSActivityId=51991&_everyRead=true&iTypeId=2&iFlowId=267733&iActId=2735&iModuleId=2735&_=1619237018804
# 第25页js加载出来的数据:地址为https://apps.game.qq.com/cgi-bin/ams/module/ishow/V1.0/query/workList_inc.cgi?activityId=2735&sVerifyCode=ABCD&sDataType=JSON&iListNum=20&totalpage=0&page=24&iOrder=0&iSortNumClose=1&jsoncallback=jQuery17103347171427601099_1619236609886&iAMSActivityId=51991&_everyRead=true&iTypeId=2&iFlowId=267733&iActId=2735&iModuleId=2735&_=1619237111998
import requests
from requests.packages.urllib3.exceptions import InsecureRequestWarning
import re
import urllib.parse
class WangzheSpider:
def __init__(self):
self.headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36'}
def get_url(self):
urls = []
for i in range(25):
url = 'https://apps.game.qq.com/cgi-bin/ams/module/ishow/V1.0/query/workList_inc.cgi?activityId=2735&sVerifyCode=ABCD&sDataType=JSON&iListNum=20&totalpage=0&page={}&iOrder=0&iSortNumClose=1&jsoncallback=jQuery17103347171427601099&iAMSActivityId=51991&_everyRead=true&iTypeId=2&iFlowId=267733&iActId=2735'.format(
i)
urls.append(url)
return urls
def req_page(self, url, headers):
reponse = requests.get(url, verify=False, headers=self.headers).content
return reponse
def write_page(self, html, filename):
print('正在保存%s' % filename)
response = self.req_page(html,headers=self.headers)
with open(filename,'wb')as f:
f.write(response)
print('%s保存完毕' % filename)
def main(self):
urls = self.get_url()
hero_names = []
hero_images = []
for i in urls:
response = self.req_page(i, headers=self.headers).decode('utf-8')
pat1 = r'"sProdImgNo_7":"(.*?)",'
content1 = re.compile(pat1, re.S)
sProdImgNo_7_list = content1.findall(response)
image_address_list = []
for item in sProdImgNo_7_list:
i = urllib.parse.unquote(item)
image_address_list.append(i)
# print(image_address_lists, len(image_address_lists))
hero_images += image_address_list
pat2 = r'"sProdName":"(.*?)",'
content2 = re.compile(pat2, re.S)
sProdName_list = content2.findall(response)
hero_name_list = []
for item in sProdName_list:
i = urllib.parse.unquote(item)
hero_name_list.append(i)
# print(hero_name_list, len(hero_name_list))
hero_names += hero_name_list
# print(hero_names,hero_images)
finall_list = zip(hero_names, hero_images)
for i in finall_list:
html = i[1].replace('7.jpg/200', '5.jpg/0')
filename = './image/%s.jpg' % i[0]
self.write_page(html, filename)
if __name__ == '__main__':
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
spider = WangzheSpider()
spider.main()
发送 POST请求
示例:# 简单的翻译小软件
# 需求简单的翻译小软件
import urllib.request
import urllib.parse
import json
# 请输入您要翻译的内容
content = input("请输入您要翻译的内容:")
# 目标url 发请求
# url = 'https://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule' # 需要去掉其中的'_o',否则返回 {"errorCode":50}
url = 'https://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule'
headers = {
'User-Agent': '省略'}
# 携带数据
data = {
'i': content,
'from': 'AUTO',
'to': 'AUTO',
'smartresult': 'dict',
'client': 'fanyideskweb',
'salt': '16100656169337',
'sign': '94a4176fmdr3lc0rcb9e8r410ra3a158',
'lts': '1619064616935',
'bv': 'b77c859e9me719deegcc45c5s42bda',
'doctype': 'json',
'version': '2.1',
'keyfrom': 'fanyi.web',
'action': 'FY_BY_REALTlME',
}
data = urllib.parse.urlencode(data)
data = bytes(data, 'utf-8')
req = urllib.request.Request(url, data=data, headers=headers)
res = urllib.request.urlopen(req)
html = res.read().decode('utf-8')
# print(html)
'''{"type":"ZH_CN2EN","errorCode":0,"elapsedTime":0,"translateResult":[[{"src":"你好","tgt":"hello"}]]}''' # 这是一个json类型的字符串。
# 解析数据
# json类型的str --> python类型的字典
r_dict = json.loads(html)
print(r_dict['translateResult'][0][0]['tgt'])
'''
请输入您要翻译的内容:你好
hello'''
5.练习1:输入指定内容在百度中搜索,并保存网页内容
import urllib.parse
import urllib.request
# url = "https://www.baidu.com/s?wd=%E7%BE%BD%E5%93%A5"
# 构造url
key = input("请输入要搜索的内容:")
wd = {'wd': key}
result = urllib.parse.urlencode(wd) # 编码
url = 'https://www.baidu.com/s?' + result
# 创建请求对象
headers = {
'User-Agent': '省略'}
req = urllib.request.Request(url, headers=headers)
# 获取响应对象
response = urllib.request.urlopen(req)
# 读取数据
html = response.read().decode('utf-8')
# 保存数据
with open('%s.html' % key, 'w', encoding='utf-8')as f:
f.write(html)
6.练习2:输入指定内容在百度贴吧中搜索,并保存多个网页内容
# 百度贴吧练习
# 输入要爬取的贴吧主题
# 进行翻页爬取 起始页和终止页
# 保存数据
import urllib.parse
import urllib.request
# 1.分析网页:
'''
第一页:https://tieba.baidu.com/f?kw=%E5%92%8C%E5%B9%B3%E7%B2%BE%E8%8B%B1&ie=utf-8&pn=0
第二页:https://tieba.baidu.com/f?kw=%E5%92%8C%E5%B9%B3%E7%B2%BE%E8%8B%B1&ie=utf-8&pn=50
最后一页:https://tieba.baidu.com/f?kw=%E5%92%8C%E5%B9%B3%E7%B2%BE%E8%8B%B1&ie=utf-8&pn=506500
共10131页'''
# url = "https://tieba.baidu.com/f?"
# 2.构造url
name = input("请输入要搜索的贴吧名称:")
begin = int(input("请输入起始页:"))
end = int(input("请输入结束页:"))
kw = {'kw': name}
result = urllib.parse.urlencode(kw)
for i in range(begin, end + 1):
pn = (i - 1) * 50
url = 'https://tieba.baidu.com/f?' + result + '&pn=' + str(pn) # 可省略&ie=utf-8
# 3.创建请求对象
headers = {
'User-Agent': '省略'}
req = urllib.request.Request(url, headers=headers,)
# 4.获取响应对象
res = urllib.request.urlopen(req, timeout=20)
# 读取数据
html = res.read().decode('utf-8')
# 保存数据
with open('第%d页.html' % i, 'w', encoding='utf-8')as f:
print('正在爬取第%d页.html' % i)
f.write(html)
'''
请输入要搜索的贴吧名称:和平精英
请输入起始页:1
请输入结束页:3
正在爬取第1页.html
正在爬取第2页.html
正在爬取第3页.html
'''
7.优化代码
# 练习:输入指定内容在百度贴吧中搜索,并保存多个网页内容
import urllib.parse
import urllib.request
class BaiduSpider:
def __init__(self):
self.headers = {
'User-Agent': '省略'}
self.base_url = 'https://tieba.baidu.com/f?'
def readPage(self, url, headers):
req = urllib.request.Request(url, headers=self.headers)
res = urllib.request.urlopen(req, timeout=20)
html = res.read().decode('utf-8')
return html
def writePage(self, filename, html):
with open(filename, 'w', encoding='utf-8')as f:
f.write(html)
print('写入成功')
def main(self):
name = input("请输入要搜索的贴吧名称:")
begin = int(input("请输入起始页:"))
end = int(input("请输入结束页:"))
kw = {'kw': name}
result = urllib.parse.urlencode(kw)
for i in range(begin, end + 1):
pn = (i - 1) * 50
url = self.base_url + result + '&pn=' + str(pn)
html = self.readPage(url, headers=self.headers)
filename = './file/第%d页.html' % i
self.writePage(filename, html)
if __name__ == '__main__':
spider = BaiduSpider()
spider.main()
'''
请输入要搜索的贴吧名称:法拉利
请输入起始页:5
请输入结束页:8
写入成功
写入成功
写入成功
写入成功'''
二、requests模块
1.安装
- pip install requests
- 在开发工具中安装
2.requests常用方法
- requests.get(网址)
示例1
import requests
r = requests.get('http://www.baidu.com/').text
print(r) # 返回网页数据
示例2
import requests
'''
response = requests.get(url, headers=headers)
1.url是最基本的url 不包含参数的
2.params中的键值对为参数
response = requests.get(url, params=params, headers=headers)
'''
# 示例1
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B
url = 'https://tieba.baidu.com/f?'
params = {'kw': '海贼王', 'pn': '250'}
headers = {
'User-Agent': '省略'}
response1 = requests.get(url, params=params, headers=headers, verify=False)
# print(response1.text) # 成功返回百度贴吧关于海贼王的第6页html数据
# 示例2
# https://tieba.baidu.com/f?kw=%E6%B5%B7%E8%B4%BC%E7%8E%8B
url = 'https://tieba.baidu.com/f?kw=海贼王&pn=250'
headers = {
'User-Agent': '省略'}
response2 = requests.get(url, headers=headers, verify=False)
# print(response2.text) # 成功返回百度贴吧关于海贼王的第6页html数据
# 示例3
url = 'https://qq.yh31.com/zjbq/2920180.html'
headers = {
'User-Agent': '省略'}
response3 = requests.get(url, headers=headers, verify=False)
# print(response3.text) # 显示<title>喜羊羊QQ表æƒ
,å¯çˆ±çš„懒羊羊æžç¬‘图片_第1页_表æƒ
å
š</title>
# print(response3.content.decode('utf-8')) # 正常返回html数据
'''
response.content 它是直接从网站上抓取数据,没有做任何处理
response.text 它是requests模块将response.content编码之后所得到的数据
requests就会猜一个解码方式
如果出现乱码
第一种方式
response.content.decode('utf-8')
第二种方式
response3.encoding='utf-8'
print(response3.text)
'''
3.响应对象response的方法
- response.text 返回unicode格式的数据(str)
- response.content 返回字节流数据(二进制)
- response.content.decode(‘utf-8’) 手动进行解码
- response.url 返回url
- response.encoding=‘utf-8’
print(response.text)
示例
import requests
r = requests.get('http://www.baidu.com')
print(r.text)
'''
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta
...
...
href=http://jianyi.baidu.com/ class=cp-feedback>æ„è§å馈</a> 京ICPè¯030173å· <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>
'''
print(r.content)
'''
b'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min
...
...
src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'
'''
print(r.content.decode('utf-8'))
'''
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta
...
...
Baidu <a href=http://www.baidu.com/duty/>使用百度前必读</a> <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a> 京ICP证030173号 <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>
'''
print(r.url) # http://www.baidu.com/
r.encoding = 'utf-8'
print(r.text)
'''
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta
...
...
Baidu <a href=http://www.baidu.com/duty/>使用百度前必读</a> <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a> 京ICP证030173号 <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>
'''
4.requests模块发送 POST请求
示例1
import requests
r = requests.post('http://httpbin.org/post', data={'key': 'value'}).text
print(r) # 正常返回网页数据
'''
{
"args": {},
"data": "",
"files": {},
"form": {
"key": "value"
},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Content-Length": "9",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.25.1",
"X-Amzn-Trace-Id": "Root=1-6083f17f-02c7589e52e792b40c5960af"
},
"json": null,
"origin": "省略",
"url": "http://httpbin.org/post"
}'''
示例2:简单的翻译小软件
# 简单的翻译小软件
import requests
import json
# 请输入您要翻译的内容
content = input("请输入您要翻译的内容:")
# 目标url 发请求
# url = 'https://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule' # 需要去掉其中的'_o',否则返回 {"errorCode":50}
url = 'https://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule'
headers = {
'User-Agent': 'Mozilla/5.0'}
# 携带数据
data = {
'i': content,
'from': 'AUTO',
'to': 'AUTO',
'smartresult': 'dict',
'client': 'fanyideskweb',
'salt': '16190646169357',
'sign': '94a417e26fdc3c0cb59e843108a3a158',
'lts': '1619064616935',
'bv': 'b77c8593ce9e7129dee4cc45ac542b2a',
'doctype': 'json',
'version': '2.1',
'keyfrom': 'fanyi.web',
'action': 'FY_BY_REALTlME',
}
response = requests.post(url, data=data, headers=headers, verify=False)
html = response.text
# print(html)
'''{"type":"ZH_CN2EN","errorCode":0,"elapsedTime":0,"translateResult":[[{"src":"你好","tgt":"hello"}]]}''' # 这是一个json类型的字符串。
# 解析数据
# json类型的str --> python类型的字典
r_dict = json.loads(html)
print(r_dict['translateResult'][0][0]['tgt'])
'''
请输入您要翻译的内容:猫
The cat'''
示例3:简单的翻译小软件,通过js逆向
第一步:确定url:https://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule
第二步:查看请求里携带的参数,分析
i: 中国
from: AUTO
to: AUTO
smartresult: dict
client: fanyideskweb
salt: 16195811191097
sign: ffc58c4904e3b84538f1a324b00a141d
lts: 1619581119109
bv: b77c8593ce9e7129dee4cc45ac542b2a
doctype: json
version: 2.1
keyfrom: fanyi.web
action: FY_BY_REALTlME
重点解决
salt: 16195811191097
sign: ffc58c4904e3b84538f1a324b00a141d
lts: 1619581119109
这三个参数的问题
双击Initiator里面的 fanyi.min.js:1文件,点击{},查看json文件。
Ctrl+F 查找‘salt’,
首先:r = “” + (new Date).getTime()
复制(new Date).getTime()到Console里查看它,发现它是一个13位数字的时间戳
写入模仿时间戳的程序:
import time
r = str(int(time.time()*1000))
其次:i = r + parseInt(10 * Math.random(), 10),parseInt(10 * Math.random(), 10)为0到9的随机值。
模拟生成i:
import random
i = random.randint(0, 9)
i = r + str(i)
最后:sign: n.md5(“fanyideskweb” + e + i + “Tbh5E8=q6U3EXe+&L[4c@”),它是md5加密,先找到e,设置断点查看e是啥?原来就是输入的内容
模拟生成sign
import hashlib
def data_new(e):
str_sign = “fanyideskweb” + e + i + “Tbh5E8=q6U3EXe+&L[4c@”
md5 = hashlib.md5()
md5.update(str_sign.encode())
sign = md5.hexdigest()
# print(sign) # e8b710fe24c560f01dbb1f724899bdfd
data = {
‘i’: e,
‘from’: ‘AUTO’,
‘to’: ‘AUTO’,
‘smartresult’: ‘dict’,
‘client’: ‘fanyideskweb’,
‘salt’: i,
‘sign’: sign,
‘lts’: r,
‘bv’: ‘b77c8593ce9e7129dee4cc45ac542b2a’,
‘doctype’: ‘json’,
‘version’: ‘2.1’,
‘keyfrom’: ‘fanyi.web’,
‘action’: ‘FY_BY_REALTlME’,
}
return data
data = data_new(e)
# 简单的翻译小软件 不去掉'_o',进行js逆向
# 分析:
# 'salt': '16190646169357',
# 'sign': '94a417e26fdc3c0cb59e843108a3a158',
# 'lts': '1619064616935',
import random
import time
import requests
import json
import hashlib
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
e = input("请输入您要翻译的内容:")
r = str(int(time.time()*1000))
i = random.randint(0, 9)
i = r + str(i)
# print(r,i)
# 目标url
url = 'https://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36',
'Referer': 'https://fanyi.youdao.com/',
'Cookie': '你的cookie'
}
# 携带数据
def data_new(e):
str_sign = "fanyideskweb" + e + i + "Tbh5E8=q6U3EXe+&L[4c@"
md5 = hashlib.md5()
md5.update(str_sign.encode())
sign = md5.hexdigest()
# print(sign) # e8b710fe24c560f01dbb1f724899bdfd
data = {
'i': e,
'from': 'AUTO',
'to': 'AUTO',
'smartresult': 'dict',
'client': 'fanyideskweb',
'salt': i,
'sign': sign,
'lts': r,
'bv': 'b77c8593ce9e7129dee4cc45ac542b2a',
'doctype': 'json',
'version': '2.1',
'keyfrom': 'fanyi.web',
'action': 'FY_BY_REALTlME',
}
return data
data = data_new(e)
response = requests.post(url, data=data, headers=headers, verify=False)
html = response.text
print(html)
'''{"translateResult":[[{"tgt":"The fox","src":"狐狸"}]],"errorCode":0,"type":"zh-CHS2en","smartResult":{"entries":["","[脊椎] fox\r\n"],"type":1}}{"type":"ZH_CN2EN","errorCode":0,"elapsedTime":0,"translateResult":[[{"src":"你好","tgt":"hello"}]]}''' # 这是一个json类型的字符串。
# 解析数据
# json类型的str --> python类型的字典
r_dict = json.loads(html)
print(r_dict['translateResult'][0][0]['tgt'])
'''
请输入您要翻译的内容:狐狸
The fox'''
5.requests设置代理
# 代理ip
# 爬虫去爬取别的网站数据的时候,如果短时间内爬取的频次过高或者一些其他的原因,被对方识别出是爬虫后
# 需要换个ip 就需要通过代理ip来解决 应对反爬策略
# 作用 1.隐藏真实的ip 2.应对反爬的策略
# 代理ip的匿名度 1.透明:服务器知道你使用了代理ip,也知道你的真实ip 2.匿名:知道你使用了代理ip,不知道你的真实ip
# 3.高匿 不知道你使用了代理ip,也不知道你的真实ip
# 使用豌豆ip代理:1.注册 2.设置白名单(加入自己外网的ip) 3.点击工具--提取api
import requests
url = 'http://httpbin.org/ip'
# 设置代理
ips = [
'223.240.245.57:23564',
'223.241.51.205:3617',
'114.232.64.153:36410',
'183.141.100.99:3617',
'61.191.85.17:36410',
'114.98.148.7:36410',
'60.174.189.138:766',
'117.57.21.134:3617',
'117.70.39.253:5412',
'114.227.163.5:766',
'125.123.120.238:36410',
'114.100.1.181:3617',
'42.59.102.21:23564',
'183.92.238.218:36410',
'121.233.207.1:5412',
'223.240.247.104:3617',
'60.174.188.26:36410',
'182.87.241.109:766',
'114.227.11.169:766',
'218.91.0.33:894',
]
available_list = []
for i in range(20):
ip = ips[i]
print(ip)
try:
response = requests.get(url, proxies={'http': ip}, timeout=0.5)
print(response.text)
available_list.append(ip)
except Exception as e:
print("出现异常")
print(available_list)
'''
['114.232.64.153:36410']'''
6.处理不信任的SSL证书
什么是SSL证书?
- SSL证书是数字证书的一种,类似于驾驶证,护照和营业执照的电子副本因为配置在服务器上,也称为SSL服务器证书.SSL证书就是遵守SSL协议,由受信任的数字证书颁发机构CA,在验证服务器身份后提交,具有服务器身份验证和数据传输加密功能
测试网站https://inv-veri.chinatax.gov.cn/
示例
import requests
# response = requests.get('https://inv-veri.chinatax.gov.cn/').text
# print(response)
'''
requests.exceptions.SSLError: HTTPSConnectionPool(host='inv-veri.chinatax.gov.cn', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1091)')))'''
response = requests.get('https://inv-veri.chinatax.gov.cn/', verify=False).text
print(response) # 正常返回网页数据
7.cookie
cookie:通过在客户端记录的信息确定用户身份HTTP是一种无连接协议,客户端和服务器交互仅连接请求/响应过程,结束后重新连接,下一次请求时,服务器会认为是一个新的客户端,为了维护他们之间的连接,让服务器知道这是前一个用户发起的请求,必须在一个地方保存客户端信息。
作用:
1.模拟登录
模拟登录知乎
目标url: ‘https://www.zhihu.com/hot’
发起请求,获取响应
示例1
# 模拟登录知乎
# 目标 url=https://www.zhihu.com/hot
# 发起请求,获取响应
import requests
url = 'https://www.zhihu.com/hot'
headers = {
'Cookie': '省略',
'User-Agent': '省略'
}
response = requests.get(url, headers=headers).text
print(response) # 因为没有登录,所有无法显示登录之后的页面,添加'Cookie',可以返回正常数据
2.反反爬机制
12306官网
查票 杭州-上海 5号 -->查询
第一个问题:为什么页面中有数据而在网页的源码中没有呢?
总结:在网页中有数据,而在源代码中没有数据,是不是服务器传输了多次数据,导致我们在网页源代码中没有找到
第二个问题:G9314关键字如何找出来呢?
网页整体没有发生变化,但是局部发生了变化,ajax
解决方法:
1.分析它真正的数据接口query
2.通过selenium
示例2
import re
import requests
from requests.packages.urllib3.exceptions import InsecureRequestWarning
class Ticket12306:
def __init__(self):
self.headers = {
'User-Agent': 'Mozilla/5.0 '
}
def requests_url(self, cookie, url):
cookie.update(self.headers)
response = requests.get(url, headers=cookie, verify=False)
response.encoding = 'utf-8'
return response
def get_station(self, cookie, url):
response = self.requests_url(cookie, url)
station = re.findall(r'([\u4e00-\u9fa5]+)\|([A-Z]+)', response.text) # \u4e00-\u9fa5代表所有的中文字符,也就是找到一个中文和与之对应的英文字符
# 将列表转成字典
station_data = dict(station)
# 将键和对应的值互换
station_names = {} # 空字典,用于将key和value进行交换
for item in station_data:
station_names[station_data[item]] = item
return station_names
def main(self, cookie_1, url_1, cookie_2, url_2):
response = self.requests_url(cookie_1,url_1,)
json_tickets = response.json()
data_list = json_tickets['data']['result']
station_names = self.get_station(url=url_2, cookie=cookie_2)
for item in data_list:
data = item.split('|')
l = list(data[13])
l.insert(4, "-")
l.insert(7, "-")
data[13] = ''.join(l)
print("车次:" + data[3],
"出发站:" + station_names[data[6]],
"到达站:" + station_names[data[7]],
"出发时间:" + data[8],
"到达时间:" + data[9],
"历时:" + data[10],
"是否可预订:" + data[11],
"始发站:" + station_names[data[4]],
"终点站:" + station_names[data[5]],
"出行时间:" + data[13],
"商务特等座:" + data[32],
"一等座:" + data[31],
"二等座/二等包座:" + data[30],
"高级软卧" + data[21],
"软卧/一等卧:" + data[23],
"动卧:" + data[33],
"硬卧/二等卧:" + data[28],
"软座" + data[24],
"硬座:" + data[29])
if __name__ == '__main__':
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
get_ticket_12306 = Ticket12306()
cookie_1 = {
'Cookie': '省略', }
url_1 = 'https://kyfw.12306.cn/otn/leftTicket/query?leftTicketDTO.train_date=2021-05-05&leftTicketDTO.from_station=HZH&leftTicketDTO.to_station=SHH&purpose_codes=ADULT'
cookie_2 = {
'Cookie': '省略'}
url_2 = 'https://kyfw.12306.cn/otn/resources/js/framework/station_name.js?station_version=1.9188'
get_ticket_12306.main(cookie_1, url_1, cookie_2, url_2)
总结:
发现每个数据以‘|’分隔的这个时候,我们需要知道个别数据的位置
以‘|’进行分隔,通过列表的下标索引值就可以知道个别数据的位置,就可以做后期的逻辑编写
8.会话
session:通过在服务端记录的信息确定用户身份,此处这个session指的是会话
案例:突破12306图片验证
网址:https://kyfw.12306.cn/otn/resources/login.html
1.账号正确,密码错误,验证码错误
2.账号正确,密码错误,验证码正确
3.账号正确,密码正确,验证码正确 ok
-
第1种情况:
查看验证码错误下加载的js文件
返回的结果为:/**/jQuery191018415675635795536_1619398855906({result_message: “验证码校验失败”, result_code: “5”}); -
第2种情况:
Request URL: https://kyfw.12306.cn/passport/captcha/captcha-check?callback=jQuery19109716061695448353_1619405746616&answer=197%2C45&rand=sjrand&login_site=E&_=1619405746618
携带的参数:
callback: jQuery19109716061695448353_1619405746616
answer: 197,45
rand: sjrand
login_site: E
_: 1619405746618
返回的结果为:/**/jQuery19109716061695448353_1619405746616({“result_message”:“验证码校验成功”,“result_code”:“4”}); -
同时加载了XHR的login文件:
Request URL: https://kyfw.12306.cn/passport/web/login
携带参数:
sessionId: 01d2TIqaddEzxCU28_GKB5Vcx6pP744fcOUyRfChk3c8ipKNCQPUrw6EEG98nN5ql6XKac_gGEGflLST6xpxnGguWMsIsoEuz0kKp9vrymPCFPwwIWh5-mCKTHbuJ6JjJm9GOGs3FoKnRG4ekQumkiHipl-wh1fnhhp61Bca3DA7Eovt_bdEryA7r-P1XrPVhVRegW3nON-AG5VHfGAR7ESg
sig: 05XqrtZ0EaFgmmqIQes-s-CJXzPZeUryxboUG9ElN6m-Gluzj13p46YPFqGVUE13mwXLW9LePExNtTkfJbYwQx-SiDQkK0HgJuFMYzM4p78PFxKeRNvi0NcYUY_IvyYkChfVWcqh3BWyF92Tiszkl7vqhX7-KltDfOK_bDcSEC2-Bm7srz5Pm38t5tc6pY-tmg-CO_6Z8xNxewxRapD0iP30diKryST_sDaSZDYJNFYHFaUJU2g-Dpi_XenL-nsYWqCD7RBriG6_I3-IMPUHLq6d5yFpBFfH7act7AMeQErOAkktFlZ9147ZpgWCtCYmyosyaBjFn8j4_HQW9ZQlh_Agxq8w7fEASqbOQNfLm2HUM1Z6zD-wn314_uKIkFv2QiTQSNCXnM8LKGpZ9NRO_5J3FdUaNyYgPBu0uZ1chQAtaDXVkPG-z0HdogKCoeBSAyBEdv5Sx7EdbjOaTUSbuyiuhheYynx6CpZ6ZE0aItv3A
if_check_slide_passcode_token: FFFF0N000000000085DE:1619405993688:0.096504509317295041
scene: nc_login
tk:
username: 18582868483
password: @grRrViQiBQgpTr59DNzcVw==
appid: otn
逻辑:首先要验证码正确,才能向网页提交用户、密码请求。
1.明确目标url:https://kyfw.12306.cn/passport/captcha/captcha-check
2.发送post请求,并携带数据:
callback: jQuery19109716061695448353_1619405746616
answer: 197,45
rand: sjrand
login_site: E
_: 1619405746618
3.获取12306图片验证码
方法一:
在网页中点击鼠标右键,复制图片地址为:
img_url=‘’
# base64伪加密:根本不算是一种加密算法,只不过它的数据看上去更像密文而已
# 64个字符来表示任意的二进制数据的方法
# 使用A-Z a-z 0-9 + / 这64个字符进行加密
import base64
img = ''
img_data = base64.b64decode(img) # 返回的是二进制数据
print(type(img_data)) # <class 'bytes'>
fn = open('code.png', 'wb')
fn.write(img_data)
fn.close()
'''
我们打开了一个有base64加密的图片数据
binascii.Error: Incorrect padding填充不正确
去掉头部的data:image/jpg;base64,
'''
方法二:
第一步:获取验证码图片的请求地址Request URL: https://kyfw.12306.cn/passport/captcha/captcha-image64?login_site=E&module=login&rand=sjrand&1619414089185&callback=jQuery19109716061695448353_1619405746616&_=1619405746621
第二步:浏览器打开查看数据: https://kyfw.12306.cn/passport/captcha/captcha-image64?login_site=E&module=login&rand=sjrand
第三步:去掉浏览器地址里的64
总结:https://kyfw.12306.cn/passport/captcha/captcha-image?login_site=E&module=login&rand=sjrand 请求图片不使用64伪加密
4.点击正确的图片
# 突破12306图片验证码
import requests
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
req = requests.session() # 保持会话
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36'}
def get_img():
# 获取验证码图片
pic_response = req.get(
'https://kyfw.12306.cn/passport/captcha/captcha-image?login_site=E&module=login&rand=sjrand', headers=headers,verify=False).content
with open('code.png', 'wb')as f:
f.write(pic_response)
def login():
# 从验证码图片的左上角开始截屏获取位置坐标
codeStr = input('请输入验证码坐标:')
data = {
'answer': codeStr,
'rand': 'sjrand',
'login_site': 'E'
}
response = req.post('https://kyfw.12306.cn/passport/captcha/captcha-check', data=data, headers=headers,verify=False)
print(response.text) # {"result_message":"验证码校验失败,信息为空","result_code":"8"}
get_img()
login()
'''
请输入验证码坐标:50,40,185,114
{"result_message":"验证码校验成功","result_code":"4"}'''