IIS6/IIS7以上、Nginx、Apache拦截屏蔽垃圾蜘蛛UA爬行降低负载方法IIS7.5如何限制某UserAgent 禁止访问

最近网站访问非常慢,cpu占用非常高,服务器负载整体也非常高,打开日志发现有很多不知名的蜘蛛一直在爬行我的站点,根据经验肯定是这里的问题,于是根据我的情况写了规则做了屏蔽,屏蔽后负载降下来了,下面整理下iis及nginx及apache环境下如何屏蔽不知名的蜘蛛ua。海宁育婴师

注意(请根据自己的情况调整删除或增加ua信息,我提供的规则中包含了不常用的蜘蛛ua,几乎用不着,若您的网站比较特殊,需要不同的蜘蛛爬取,建议仔细分析规则,将指定ua删除即可)

IIS7.5测试ok

指定特征禁止UA访问,返回代码403

<rule name="NoUserAgent" stopProcessing="true">
<match url=".*" />
<conditions>
<add input="{HTTP_USER_AGENT}" pattern="|特征1|特征2|特征3" />
</conditions>
<action type="CustomResponse" statusCode="403" statusReason="Forbidden: Access is denied." statusDescription="You did not present a User-Agent header which is required for this site" />
</rule>

例如只禁止空UA

<add input="{HTTP_USER_AGENT}" pattern="|^$|特征2|特征3" />

例如禁止其他UA+空UA

<add input="{HTTP_USER_AGENT}" pattern="^$|Webdup|AcoonBot|AhrefsBot|Ezooms|EdisterBot|EC2LinkFinder|jikespider|Purebot|MJ12bot" />

禁止特定蜘蛛

<rewrite>  
<rules>  
<rule name="Block Some Ip Adresses OR Bots" stopProcessing="true">  
<match url="(.*)" />  
<conditions logicalGrouping="MatchAny">  
<add input="{HTTP_USER_AGENT}" pattern="蜘蛛名称" ignoreCase="true" /> <!-- 来禁止特定蜘蛛 -->  
<add input="{HTTP_USER_AGENT}" pattern="^$" /> <!-- 禁止空 UA 访问 -->  
<add input="{REMOTE_ADDR}" pattern="单独IP或使用正则表达的IP地址" />  
</conditions>  
<!--  你也可以使用<action type="AbortRequest" />来直接代替下面这段代码  -->  
<action type="CustomResponse" statusCode="403" statusReason="Access is forbidden." statusDescription="Access is forbidden." />  
</rule>  
</rules>  
</rewrite>  

禁止浏览某文件

<rule name="Block spider">  
      <match url="(^robotssss.txt$)" ignoreCase="false" negate="true" /> <!-- 禁止浏览某文件 -->  
      <action type="CustomResponse" statusCode="403" statusReason="Forbidden" statusDescription="Forbidden" />  
</rule>






1、nginx禁止垃圾蜘蛛访问,把下列代码放到你的nginx配置文件里面。
#禁止Scrapy等工具的抓取

if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) {
return 403;
}
#禁止指定UA及UA为空的访问
if ($http_user_agent ~ "opensiteexplorer|MauiBot|FeedDemon|SemrushBot|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|semrushbot|alphaseobot|semrush|Feedly|UniversalFeedParser|webmeup-crawler|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|^$" ) {
return 403;
}
#禁止非GET|HEAD|POST方式的抓取
if ($request_method !~ ^(GET|HEAD|POST)$) {
return 403;
}



2、IIS7/IIS8/IIS10及以上web服务请在网站根目录下创建web.config文件,并写入如下代码即可;

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<system.webServer>
<rewrite>
<rules>
<rule name="Block spider">
<match url="(^robots.txt$)" ignoreCase="false" negate="true" />
<conditions>
<add input="{HTTP_USER_AGENT}" pattern="MegaIndex|MegaIndex.ru|BLEXBot|Qwantify|qwantify|semrush|Semrush|serpstatbot|hubspot|python|Bytespider|Go-http-client|Java|PhantomJS|SemrushBot|Scrapy|Webdup|AcoonBot|AhrefsBot|Ezooms|EdisterBot|EC2LinkFinder|jikespider|Purebot|MJ12bot|WangIDSpider|WBSearchBot|Wotbox|xbfMozilla|Yottaa|YandexBot|Jorgee|SWEBot|spbot|TurnitinBot-Agent|mail.RU|perl|Python|Wget|Xenu|ZmEu|^$"
ignoreCase="true" />
</conditions>
<action type="AbortRequest" />
</rule>
</rules>
</rewrite>
</system.webServer>
</configuration>



3、apache请在.htaccess文件中添加如下规则即可:

<IfModule mod_rewrite.c>
RewriteEngine On
#Block spider
RewriteCond %{HTTP_USER_AGENT} "MegaIndex|MegaIndex.ru|BLEXBot|Qwantify|qwantify|semrush|Semrush|serpstatbot|hubspot|python|Bytespider|Go-http-client|Java|PhantomJS|SemrushBot|Scrapy|Webdup|AcoonBot|AhrefsBot|Ezooms|EdisterBot|EC2LinkFinder|jikespider|Purebot|MJ12bot|WangIDSpider|WBSearchBot|Wotbox|xbfMozilla|Yottaa|YandexBot|Jorgee|SWEBot|spbot|TurnitinBot-Agent|mail.RU|perl|Python|Wget|Xenu|ZmEu|^$" [NC]
RewriteRule !(^robots\.txt$) - [F]
</IfModule>



注:规则中默认屏蔽部分不明蜘蛛,要屏蔽其他蜘蛛按规则添加即可

附各大蜘蛛名字:

google蜘蛛:googlebot

百度蜘蛛:baiduspider

百度手机蜘蛛:baiduboxapp

yahoo蜘蛛:slurp

alexa蜘蛛:ia_archiver

msn蜘蛛:msnbot

bing蜘蛛:bingbot

altavista蜘蛛:scooter

lycos蜘蛛:lycos_spider_(t-rex)

alltheweb蜘蛛:fast-webcrawler

inktomi蜘蛛:slurp

有道蜘蛛:YodaoBot和OutfoxBot

热土蜘蛛:Adminrtspider

搜狗蜘蛛:sogou spider

SOSO蜘蛛:sosospider

360搜蜘蛛:360spider




网络上常见的垃圾UA列表
内容采集

FeedDemon
Java 内容采集
Jullo 内容采集
Feedly 内容采集
UniversalFeedParser 内容采集
SQL注入

BOT/0.1 (BOT for JCE)
CrawlDaddy
无用爬虫

EasouSpider
Swiftbot
YandexBot
AhrefsBot
jikeSpider
MJ12bot
YYSpider
oBot
CC攻击器

ApacheBench
WinHttp
TCP攻击

HttpClient
扫描

Microsoft URL Control
ZmEu phpmyadmin
jaunty




  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值