Nginx设置防爬虫策略

最新推荐文章于 2023-03-20 09:57:00 发布

shangrila_kun

最新推荐文章于 2023-03-20 09:57:00 发布

阅读量7.4k

点赞数 3

分类专栏： Nginx 文章标签： Nginx防爬虫蜘蛛反爬虫

本文链接：https://blog.csdn.net/shangrila_kun/article/details/89501343

版权

Nginx 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

有助于网站的爬虫可以提升网站排名，比如百度蜘蛛。但有些爬虫对服务器恶意获取网站信息，不遵守robots规则，我们需要进行拦截。可以禁止某些User Agent抓取网站。

新建配置配置文件

（例如进入到nginx安装目录下的conf目录，创建： agent_deny.conf）

#禁止Scrapy等工具的抓取
if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) {
     return 403;
}
 
#禁止指定UA及UA为空的访问
if ($http_user_agent ~ "WinHttp|WebZIP|FetchURL|node-superagent|java/|
FeedDemon|Jullo|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|
CrawlDaddy|Java|Feedly|Apache-HttpAsyncClient|UniversalFeedParser|ApacheBench|
Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|
lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|BOT/0.1|
YandexBot|FlightDeckReports|Linguee Bot|^$" ) {
     return 403;             
}

然后，在网站相关配置中的 server段插入如下代码：

include agent_deny.conf;

在这里插入图片描述

重启nginx：

/usr/local/nginx/sbin/nginx -s reload

测试

使用curl -A 模拟抓取即可，比如：

curl -I -A 'YYSpider' www.haoeasy.cn

结果

[root@izwz93bcx7adgtozg4rvanz conf]# curl -I -A 'YYSpider' www.haoeasy.cn
HTTP/1.1 403 Forbidden
Server: nginx/1.12.0
Date: Wed, 24 Apr 2019 11:35:21 GMT
Content-Type: text/html
Content-Length: 169
Connection: keep-alive

模拟UA为空的抓取：

curl -I -A' ' www.haoeasy.cn

结果

[root@izwz93bcx7adgtozg4rvanz conf]# curl -I -A' ' www.haoeasy.cn
HTTP/1.1 403 Forbidden
Server: nginx/1.12.0
Date: Wed, 24 Apr 2019 11:36:06 GMT
Content-Type: text/html
Content-Length: 169
Connection: keep-alive

模拟百度蜘蛛的抓取：

curl -I -A 'Baiduspider' www.haoeasy.cn

[root@izwz93bcx7adgtozg4rvanz conf]# curl -I -A 'Baiduspider' www.haoeasy.cn
HTTP/1.1 200 OK
Server: nginx/1.12.0
Date: Wed, 24 Apr 2019 11:36:47 GMT
Content-Type: text/html
Content-Length: 612
Last-Modified: Fri, 12 Apr 2019 13:49:36 GMT
Connection: keep-alive
ETag: "5cb09770-264"
Accept-Ranges: bytes

UA类型

FeedDemon             内容采集
BOT/0.1 (BOT for JCE) sql注入
CrawlDaddy            sql注入
Java                  内容采集
Jullo                 内容采集
Feedly                内容采集
UniversalFeedParser   内容采集
ApacheBench           cc攻击器
Swiftbot              无用爬虫
YandexBot             无用爬虫
AhrefsBot             无用爬虫
YisouSpider           无用爬虫（已被UC神马搜索收购，此蜘蛛可以放开！）
jikeSpider            无用爬虫
MJ12bot               无用爬虫
ZmEu phpmyadmin       漏洞扫描
WinHttp               采集cc攻击
EasouSpider           无用爬虫
HttpClient            tcp攻击
Microsoft URL Control 扫描
YYSpider              无用爬虫
jaunty                wordpress爆破扫描器
oBot                  无用爬虫
Python-urllib         内容采集
Indy Library          扫描
FlightDeckReports Bot 无用爬虫
Linguee Bot           无用爬虫

shangrila_kun

关注

3
点赞
踩
19

收藏

觉得还不错? 一键收藏
1
评论
Nginx设置防爬虫策略

有助于网站的爬虫可以提升网站排名，比如百度蜘蛛。但有些爬虫对服务器恶意获取网站信息，不遵守robots规则，我们需要进行拦截。可以禁止某些User Agent抓取网站。新建配置配置文件（例如进入到nginx安装目录下的conf目录，创建： agent_deny.conf）#禁止Scrapy等工具的抓取if ($http_user_agent ~* (Scrapy|Curl|Ht...
复制链接

扫一扫

专栏目录