IIS6/IIS7以上、Nginx、Apache拦截屏蔽垃圾蜘蛛UA爬行降低负载方法IIS7.5如何限制某UserAgent 禁止访问

最新推荐文章于 2024-05-08 20:36:23 发布

wangxingps

最新推荐文章于 2024-05-08 20:36:23 发布

阅读量2.2k

点赞数 1

分类专栏： seo

本文链接：https://blog.csdn.net/wangxingps/article/details/109446261

版权

seo 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

最近网站访问非常慢，cpu占用非常高，服务器负载整体也非常高，打开日志发现有很多不知名的蜘蛛一直在爬行我的站点，根据经验肯定是这里的问题，于是根据我的情况写了规则做了屏蔽，屏蔽后负载降下来了，下面整理下iis及nginx及apache环境下如何屏蔽不知名的蜘蛛ua。海宁育婴师

注意（请根据自己的情况调整删除或增加ua信息，我提供的规则中包含了不常用的蜘蛛ua，几乎用不着，若您的网站比较特殊，需要不同的蜘蛛爬取，建议仔细分析规则，将指定ua删除即可）

IIS7.5测试ok

指定特征禁止UA访问，返回代码403

<rule name="NoUserAgent" stopProcessing="true">
<match url=".*" />
<conditions>
<add input="{HTTP_USER_AGENT}" pattern="|特征1|特征2|特征3" />
</conditions>
<action type="CustomResponse" statusCode="403" statusReason="Forbidden: Access is denied." statusDescription="You did not present a User-Agent header which is required for this site" />
</rule>

例如只禁止空UA

<add input="{HTTP_USER_AGENT}" pattern="|^$|特征2|特征3" />

例如禁止其他UA+空UA

<add input="{HTTP_USER_AGENT}" pattern="^$|Webdup|AcoonBot|AhrefsBot|Ezooms|EdisterBot|EC2LinkFinder|jikespider|Purebot|MJ12bot" />

禁止特定蜘蛛

<rewrite>  
<rules>  
<rule name="Block Some Ip Adresses OR Bots" stopProcessing="true">  
<match url="(.*)" />  
<conditions logicalGrouping="MatchAny">  
<add input="{HTTP_USER_AGENT}" pattern="蜘蛛名称" ignoreCase="true" /> <!-- 来禁止特定蜘蛛 -->  
<add input="{HTTP_USER_AGENT}" pattern="^$" /> <!-- 禁止空 UA 访问 -->  
<add input="{REMOTE_ADDR}" pattern="单独IP或使用正则表达的IP地址" />  
</conditions>  
<!--  你也可以使用＜action type="AbortRequest" /＞来直接代替下面这段代码  -->  
<action type="CustomResponse" statusCode="403" statusReason="Access is forbidden." statusDescription="Access is forbidden." />  
</rule>  
</rules>  
</rewrite>

禁止浏览某文件

<rule name="Block spider">  
      <match url="(^robotssss.txt$)" ignoreCase="false" negate="true" /> <!-- 禁止浏览某文件 -->  
      <action type="CustomResponse" statusCode="403" statusReason="Forbidden" statusDescription="Forbidden" />  
</rule>

1、nginx禁止垃圾蜘蛛访问，把下列代码放到你的nginx配置文件里面。
#禁止Scrapy等工具的抓取

if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) {
return 403;
}
#禁止指定UA及UA为空的访问
if ($http_user_agent ~ "opensiteexplorer|MauiBot|FeedDemon|SemrushBot|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|semrushbot|alphaseobot|semrush|Feedly|UniversalFeedParser|webmeup-crawler|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|^$" ) {
return 403;
}
#禁止非GET|HEAD|POST方式的抓取
if ($request_method !~ ^(GET|HEAD|POST)$) {
return 403;
}

2、IIS7/IIS8/IIS10及以上web服务请在网站根目录下创建web.config文件,并写入如下代码即可;

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
<system.webServer>
<rewrite>
<rules>
<rule name="Block spider">
<match url="(^robots.txt$)" ignoreCase="false" negate="true" />
<conditions>
<add input="{HTTP_USER_AGENT}" pattern="MegaIndex|MegaIndex.ru|BLEXBot|Qwantify|qwantify|semrush|Semrush|serpstatbot|hubspot|python|Bytespider|Go-http-client|Java|PhantomJS|SemrushBot|Scrapy|Webdup|AcoonBot|AhrefsBot|Ezooms|EdisterBot|EC2LinkFinder|jikespider|Purebot|MJ12bot|WangIDSpider|WBSearchBot|Wotbox|xbfMozilla|Yottaa|YandexBot|Jorgee|SWEBot|spbot|TurnitinBot-Agent|mail.RU|perl|Python|Wget|Xenu|ZmEu|^$"
ignoreCase="true" />
</conditions>
<action type="AbortRequest" />
</rule>
</rules>
</rewrite>
</system.webServer>
</configuration>

3、apache请在.htaccess文件中添加如下规则即可：

<IfModule mod_rewrite.c>
RewriteEngine On
#Block spider
RewriteCond %{HTTP_USER_AGENT} "MegaIndex|MegaIndex.ru|BLEXBot|Qwantify|qwantify|semrush|Semrush|serpstatbot|hubspot|python|Bytespider|Go-http-client|Java|PhantomJS|SemrushBot|Scrapy|Webdup|AcoonBot|AhrefsBot|Ezooms|EdisterBot|EC2LinkFinder|jikespider|Purebot|MJ12bot|WangIDSpider|WBSearchBot|Wotbox|xbfMozilla|Yottaa|YandexBot|Jorgee|SWEBot|spbot|TurnitinBot-Agent|mail.RU|perl|Python|Wget|Xenu|ZmEu|^$" [NC]
RewriteRule !(^robots\.txt$) - [F]
</IfModule>

注：规则中默认屏蔽部分不明蜘蛛，要屏蔽其他蜘蛛按规则添加即可

附各大蜘蛛名字：

google蜘蛛：googlebot

百度蜘蛛：baiduspider

百度手机蜘蛛：baiduboxapp

yahoo蜘蛛：slurp

alexa蜘蛛：ia_archiver

msn蜘蛛：msnbot

bing蜘蛛：bingbot

altavista蜘蛛：scooter

lycos蜘蛛：lycos_spider_(t-rex)

alltheweb蜘蛛：fast-webcrawler

inktomi蜘蛛：slurp

有道蜘蛛：YodaoBot和OutfoxBot

热土蜘蛛：Adminrtspider

搜狗蜘蛛：sogou spider

SOSO蜘蛛：sosospider

360搜蜘蛛：360spider

网络上常见的垃圾UA列表
内容采集

FeedDemon
Java 内容采集
Jullo 内容采集
Feedly 内容采集
UniversalFeedParser 内容采集
SQL注入

BOT/0.1 (BOT for JCE)
CrawlDaddy
无用爬虫

EasouSpider
Swiftbot
YandexBot
AhrefsBot
jikeSpider
MJ12bot
YYSpider
oBot
CC攻击器

ApacheBench
WinHttp
TCP攻击

HttpClient
扫描

Microsoft URL Control
ZmEu phpmyadmin
jaunty

wangxingps

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
IIS6/IIS7以上、Nginx、Apache拦截屏蔽垃圾蜘蛛UA爬行降低负载方法IIS7.5如何限制某UserAgent 禁止访问

最近网站访问非常慢，cpu占用非常高，服务器负载整体也非常高，打开日志发现有很多不知名的蜘蛛一直在爬行我的站点，根据经验肯定是这里的问题，于是根据我的情况写了规则做了屏蔽，屏蔽后负载降下来了，下面整理下iis及nginx及apache环境下如何屏蔽不知名的蜘蛛ua。注意（请根据自己的情况调整删除或增加ua信息，我提供的规则中包含了不常用的蜘蛛ua，几乎用不着，若您的网站比较特殊，需要不同的蜘蛛爬取，建议仔细分析规则，将指定ua删除即可）IIS7.5测试ok指定特征禁止UA访问，返回代码403<
复制链接

扫一扫

专栏目录