常用反反爬虫的方法

最新推荐文章于 2024-08-20 19:02:42 发布

Genera1Z

最新推荐文章于 2024-08-20 19:02:42 发布

阅读量4.1k

点赞数 6

文章标签：爬虫

1 反爬虫的原因

保护网站数据
节省网站资源

2 反爬虫的表现

大致有三种表现形式。

2.1 不返回网页

如不返回内容和延迟网页返回时间

2.2 返回数据非网页

如返回错误页，返回空白页和爬取多页时均返回同一页

2.3 增加获取数据的难度

如登录才可以查看和登录时设置验证码

3 反反爬虫的方法

常见的反爬虫的原理有：检查User-Agent；检验访问频率次数，封掉异常IP；设置验证码；Ajax异步加载等。相应的对策如下。

3.1 修改请求头

如下所示：

import requests
r = requests.get(url)
print(r.request.headers)
import requests
link = "www.baidu.com"
headers = {'User-Agent':'Mozilla/5.0(Windows;U;Windows NT6.1;en-US;rv:1.9.1.6)Gecko/20091201 Firfox/3.5.6'}
r = requests.get(link,headers =headers)

也可以建立一个User-Agent池，并且随机切换User-Agent。
还可以在Headers中写上Host和Referer。

3.2 修改爬虫访问周期

爬虫访问太密集很容易被反，应有适当间隔；爬虫访问间隔相同也会被识别，应该有些随机性。

import time
import random
sleep_time = random.randint(1, 5) + random.random()
time.sleep(sleep_time)

3.3 使用代理

代理（Proxy）是一种网络服务，允许一个网络终端（客户端）与另一个网络终端（服务器）间接连接。
我们可以维护一个代理的IP池，从而让爬虫隐藏自己真实的IP。有很多代理但良莠不齐，需要筛选。维护代理IP池比较麻烦。

import requests
link = 'http://santostang.com'
proxies = { 'http': 'http://xxx.xxx.xxx.xxx' }
resp = requests.get(link, proxies = proxies)

3.X 模仿人的操作

使用Selenium和PhantomJS。

参考文献

哭泣的毛毛虫。第八章，反爬虫问题。https://blog.csdn.net/qq_39661704/article/details/78572375 。
野火研习社1。浅谈爬虫及绕过网站反爬取机制。http://www.freebuf.com/articles/web/156204.html。
红发香克斯。经典反爬虫绕过总结和最简单方式爬工控漏洞库，附源码。https://bbs.ichunqiu.com/thread-39661-1-1.html?from=singlemessage。
baidu_20735905。scrapy绕过反爬虫。https://blog.csdn.net/baidu_20735905/article/details/78560979。

Genera1Z

关注

6
点赞
踩
27

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫