背景
蜘蛛爬虫频繁访问网站会严重浪费服务器的带宽和资源。通过判断user agent,在nginx、apache中禁用这些蜘蛛可以节省一些流量,也可以防止一些恶意的访问
nginx
1、进入/opt/sudytech/nginx/conf目录下,修改nginx.conf在server段中添加如下配置
#禁止Scrapy等工具的抓取
if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) {
return 403;
}
#禁止指定UA及UA(user agent)为空的访问
if ($http_user_agent ~* "BaiduSpider|FeedDemon|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|YisouSpider|HttpClient|MJ12bot|heritrix|EasouSpider|LinkpadBot|Ezooms|360Spider|^$" )
{
return 403;
}
#禁止非GET|HEAD|POST方式的抓取
if ($request_method !~ ^(GET|HEAD|POST)$) {
return 403;
}
2、nginx重启
3、测试通过curl -A 可以随意指定自己这次访问所宣称浏览器信息 -v参数可以打印出User-Agent头信息
[root@localhost conf]# curl -v -I -A "BaiduSpider" 192.168.11.134
.................
> User-Agent: BaiduSpider
> Host: 192.168.11.134
> Accept: */*
>
< HTTP/1.1 403 Forbidden
HTTP/1.1 403 Forbidden
..........
apache
#禁止指定UA及UA(user agent)为空的访问
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (^$|360Spider) [NC]
RewriteRule ^(.*)$ - [F]
#禁止trace、track、options的抓取
RewriteEngine On
RewriteCond %{REQUEST_METHOD} ^(TRACE|TRACK|OPTIONS)
RewriteRule .* - [F]