3 Fiddler抓包工具
3.1 安装
安装包文件可到官方网址:https://www.telerik.com/fiddler 下载,安装过程略。
3.2 配置
切换到 HTTPS 标签,选上下面的选项
然后点击Actions按钮,在弹出框选择 Trust Root Certificate
如果弹出如下对话框,选择Yes
配置完成后,重启Fiddler,重启浏览器
打开百度贴吧,留意User-Agent的信息
运行爬虫程序,再留意一下User-Agent的信息
有可能运行了Fiddler,再运行爬虫程序的时候,会出现认证失败的问题
SSL:CERTIFICATE ERROR
此时,需要添加未经过验证的上下文的代码
from urllib import request, parse
import ssl
# 加载一个页面
def loadPage(url):
# 发起一个请求
req = request.Request(url)
#print(req) # <urllib.request.Request object at 0x007B1370>
# 创建未经过验证的上下文的代码
context = ssl._create_unverified_context()
# 打开响应的对象
response = request.urlopen(req, context=context)
#print(response) # <http.client.HTTPResponse object at 0x01F36BF0>
# 获取响应的内容
html = response.read()
# 对获取到的unicode编码进行解码
content = html.decode('utf-8')
return content
3.3 反爬虫与防反爬虫
爬虫 —— 反爬虫 —— 反反爬虫
关于反爬虫的博客:https://segmentfault.com/a/1190000005840672
3.3.1 改变User-Agent
添加请求头,修改User-Agent
通过抓包工具捕获正确的浏览器访问服务器的User-Agent
把请求头添加到 headers 参数中
# 加载一个页面
def loadPage(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36'
}
# 发起一个请求
req = request.Request(url, headers = headers)
#print(req) # <urllib.request.Request object at 0x007B1370>
# 创建未经过验证的上下文的代码
context = ssl._create_unverified_context()
# 打开响应的对象
response = request.urlopen(req, context=context)
#print(response) # <http.client.HTTPResponse object at 0x01F36BF0>
# 获取响应的内容
html = response.read()
# 对获取到的unicode编码进行解码
content = html.decode('utf-8')
return content
3.3.2 随机选择User-Agent
考虑更换使用不同的User-Agent
from urllib import request, parse
import ssl
import random
# 常用User-Agent列表
ua_list = [
'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50',
'Mozilla/5.0 (Windows; U; Windows NT 6.1; ) AppleWebKit/534.12 (KHTML, like Gecko) Maxthon/3.0 Safari/534.12',
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.41 Safari/535.1 QQBrowser/6.9.11079.201',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)',
]
# 加载一个页面
def loadPage(url):
# 在ua_list列表中随机选择一个UserAgent
userAgent = random.choice(ua_list)
headers = {
'User-Agent': userAgent
}
# 发起一个请求
req = request.Request(url, headers = headers)
#print(req) # <urllib.request.Request object at 0x007B1370>
# 创建未经过验证的上下文的代码
context = ssl._create_unverified_context()
# 打开响应的对象
response = request.urlopen(req, context=context)
#print(response) # <http.client.HTTPResponse object at 0x01F36BF0>
# 获取响应的内容
html = response.read()
# 对获取到的unicode编码进行解码
content = html.decode('utf-8')
return content
以上就是Fiddler抓包工具与爬虫及反爬虫的内容了。