关于larbin useragent 与 robot.txt设置

最新推荐文章于 2020-12-11 08:15:47 发布

coder_WeiSong

最新推荐文章于 2020-12-11 08:15:47 发布

阅读量2k

点赞数

分类专栏：网络爬虫文章标签： larbin http

本文链接：https://blog.csdn.net/coder_weisong/article/details/11675133

版权

本文介绍了如何更改larbin爬虫的useragent以及如何通过设置robot.txt来阻止爬虫抓取网站内容，以此保护网站免受不必要的访问。

摘要由CSDN通过智能技术生成

更改larbin的useragent

由于larbin默认遵守robots.txt,所以如果我要下载百度百科的话就不行，如下百度百科的robots.txt:
User-agent: Baiduspider
Allow: /
Disallow: /w?

User-agent: Googlebot
Allow: /
Disallow: /update
Disallow: /history
Disallow: /usercard
Disallow: /usercenter

User-agent: MSNBot
Allow: /

User-Agent: YoudaoBot
Allow: /

User-agent: *
Disallow: /

除了robots.txt中提到的几个爬虫外，都不允许下载，解决方法：1.可简单的修改larbin.conf中的userAgent为这几个爬虫中的一个 2.不修改userAgent,做一些代码修改，使得是否遵守robots.txt是可配置的

设置robot.txt防止爬虫抓取网页

全球互联网上有多少搜索引擎机器人（爬行蜘蛛）在工作？这个问题很难回答，你要知道有很多人开发了自己的机器人来窃取别人的信息，也有很多为了其他的利益而开发了机器人。这些机器人，都是垃圾机器人，不但占用了网站的带宽，而且网站的用户信息都可能已经被偷走了。这里，我根据国外同行的一些小技巧介绍下如何给自己的添加robot.txt并进行设置，来彻底与垃圾搜索引擎机器人说Goodbye。
首先，打开记事本，复制下面这些代码。代码解释：下面是126个国际上公认的垃圾搜索引擎机器人、蜘蛛、搜索代理等，通过disallow:/来完全禁止他们爬行网站。

User-agent: larbin
Disallow: /

User-agent: b2w/0.1
Disallow: /

User-agent: Copernic
Disallow: /

User-agent: psbot
Disallow: /

User-agent: Python-urllib
Disallow: /

User-agent: Googlebot-Image
Disallow: /

User-agent: URL_Spider_Pro
Disallow: /

User-agent: CherryPicker
Disallow: /

User-agent: EmailCollector
Disallow: /

User-agent: EmailSiphon
Disallow: /