最近网站访问非常慢,cpu占用非常高,服务器负载整体也非常高,打开日志发现有很多不知名的蜘蛛一直在爬行我的站点,根据经验肯定是这里的问题,于是根据我的情况写了规则做了屏蔽,屏蔽后负载降下来了,下面整理下iis及nginx及apache环境下如何屏蔽不知名的蜘蛛ua。海宁育婴师
注意(请根据自己的情况调整删除或增加ua信息,我提供的规则中包含了不常用的蜘蛛ua,几乎用不着,若您的网站比较特殊,需要不同的蜘蛛爬取,建议仔细分析规则,将指定ua删除即可)
IIS7.5测试ok
指定特征禁止UA访问,返回代码403
<rule name="NoUserAgent" stopProcessing="true">
<match url=".*" />
<conditions>
<add input="{HTTP_USER_AGENT}" pattern="|特征1|特征2|特征3" />
</conditions>
<action type="CustomResponse" statusCode="403" statusReason="Forbidden: Access is denied." statusDescription="You did not present a User-Agent header which is required for this site" />
</rule>
例如只禁止空UA
<add input="{HTTP_USER_AGENT}" pattern="|^$|特征2|特征3" />
例如禁止其他UA+空UA
<add input="{HTTP_USER_AGENT}" pattern="^$|Webdup|AcoonBot|AhrefsBot|Ezooms|EdisterBot|EC2LinkFinder|jikespider|Purebot|MJ12bot" />
禁止特定蜘蛛
<rewrite>
<rules>
<rule name="Block Some Ip Adresses OR Bots" stopProcessing="true">
<match url="(.*)" />
<conditions logicalGrouping="MatchAny">
<add input="{HTTP_USER_AGENT}" pattern="蜘蛛名称" ignoreCase="true" /> <!-- 来禁止特定蜘蛛 -->
<add input="{HTTP_USER_AGENT}" pattern="^$" /> <!-- 禁止空 UA 访问 -->
<add input="{REMOTE_ADDR}" pattern="单独IP或使用正则表达的IP地址" />
</conditions>
<!-- 你也可以使用<action type="AbortRequest" />来直接代替下面这段代码 -->
<action type="CustomResponse" statusCode="403" statusReason="Access is forbidden." statusDescription="Access is forbidden." />
</rule>
</rules>
</rewrite>
禁止浏览某文件
<rule name="Block spider">
<match url="(^robotssss.txt$)" ignoreCase="false" negate="true" /> <!-- 禁止浏览某文件 -->
<action type="CustomResponse" statusCode="403" statusReason="Forbidden" statusDescription="Forbidden" />
</rule>
1、nginx禁止垃圾蜘蛛访问,把下列代码放到你的nginx配置文件里面。
#禁止Scrapy等工具的抓取
if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) {
return 403;
}
#禁止指定UA及UA为空的访问
if ($http_user_agent ~ "opensiteexplorer|MauiBot|FeedDemon|SemrushBot|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|semrushbot|alphaseobot|semrush|Feedly|UniversalFeedParser|webmeup-crawler|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|^$" ) {
return 403;
}
#禁止非GET|HEAD|POST方式的抓取
if ($request_method !~ ^(GET|HEAD|POST)$) {
return 403;
}
2、IIS7/IIS8/IIS10及以上web服务请在网站根目录下创建web.config文件,并写入如下代码即可;
<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<system.webServer>
<rewrite>
<rules>
<rule name="Block spider">
<match url="(^robots.txt$)" ignoreCase="false" negate="true" />
<conditions>
<add input="{HTTP_USER_AGENT}" pattern="MegaIndex|MegaIndex.ru|BLEXBot|Qwantify|qwantify|semrush|Semrush|serpstatbot|hubspot|python|Bytespider|Go-http-client|Java|PhantomJS|SemrushBot|Scrapy|Webdup|AcoonBot|AhrefsBot|Ezooms|EdisterBot|EC2LinkFinder|jikespider|Purebot|MJ12bot|WangIDSpider|WBSearchBot|Wotbox|xbfMozilla|Yottaa|YandexBot|Jorgee|SWEBot|spbot|TurnitinBot-Agent|mail.RU|perl|Python|Wget|Xenu|ZmEu|^$"
ignoreCase="true" />
</conditions>
<action type="AbortRequest" />
</rule>
</rules>
</rewrite>
</system.webServer>
</configuration>
3、apache请在.htaccess文件中添加如下规则即可:
<IfModule mod_rewrite.c>
RewriteEngine On
#Block spider
RewriteCond %{HTTP_USER_AGENT} "MegaIndex|MegaIndex.ru|BLEXBot|Qwantify|qwantify|semrush|Semrush|serpstatbot|hubspot|python|Bytespider|Go-http-client|Java|PhantomJS|SemrushBot|Scrapy|Webdup|AcoonBot|AhrefsBot|Ezooms|EdisterBot|EC2LinkFinder|jikespider|Purebot|MJ12bot|WangIDSpider|WBSearchBot|Wotbox|xbfMozilla|Yottaa|YandexBot|Jorgee|SWEBot|spbot|TurnitinBot-Agent|mail.RU|perl|Python|Wget|Xenu|ZmEu|^$" [NC]
RewriteRule !(^robots\.txt$) - [F]
</IfModule>
注:规则中默认屏蔽部分不明蜘蛛,要屏蔽其他蜘蛛按规则添加即可
附各大蜘蛛名字:
google蜘蛛:googlebot
百度蜘蛛:baiduspider
百度手机蜘蛛:baiduboxapp
yahoo蜘蛛:slurp
alexa蜘蛛:ia_archiver
msn蜘蛛:msnbot
bing蜘蛛:bingbot
altavista蜘蛛:scooter
lycos蜘蛛:lycos_spider_(t-rex)
alltheweb蜘蛛:fast-webcrawler
inktomi蜘蛛:slurp
有道蜘蛛:YodaoBot和OutfoxBot
热土蜘蛛:Adminrtspider
搜狗蜘蛛:sogou spider
SOSO蜘蛛:sosospider
360搜蜘蛛:360spider
网络上常见的垃圾UA列表
内容采集
FeedDemon
Java 内容采集
Jullo 内容采集
Feedly 内容采集
UniversalFeedParser 内容采集
SQL注入
BOT/0.1 (BOT for JCE)
CrawlDaddy
无用爬虫
EasouSpider
Swiftbot
YandexBot
AhrefsBot
jikeSpider
MJ12bot
YYSpider
oBot
CC攻击器
ApacheBench
WinHttp
TCP攻击
HttpClient
扫描
Microsoft URL Control
ZmEu phpmyadmin
jaunty