Python 3网络爬虫开发实战

最新推荐文章于 2023-08-03 15:41:43 发布

lxcl96

最新推荐文章于 2023-08-03 15:41:43 发布

阅读量674

点赞数

文章标签： python 爬虫

本文链接：https://blog.csdn.net/qq_38946031/article/details/116231449

版权

分析Robots协议

书中以简书为例，对robots.txt文件分析。

robots.txt

简书robots.txt文件内容如下：

# See http://www.robotstxt.org/wc/norobots.html for documentation on how to use the robots.txt file
#
# To ban all spiders from the entire site uncomment the next two lines:
User-agent: *
Disallow: /search
Disallow: /convos/
Disallow: /notes/
Disallow: /admin/
Disallow: /adm/
Disallow: /p/0826cf4692f9
Disallow: /p/d8b31d20a867
Disallow: /collections/*/recommended_authors
Disallow: /trial/*
Disallow: /keyword_notes
Disallow: /stats-2017/*

User-agent: trendkite-akashic-crawler
Request-rate: 1/2 # load 1 page per 2 seconds
Crawl-delay: 60

User-agent: YisouSpider
Request-rate: 1/10 # load 1 page per 10 seconds
Crawl-delay: 60

User-agent: Cliqzbot
Disallow: /

User-agent: Googlebot
Request-rate: 2/1 # load 2 page per 1 seconds
Crawl-delay: 10
Allow: /

User-agent: Mediapartners-Google
Allow: /

爬虫代码：

书中代码返回为True，False，但是我在实际中执行一直是False，False：

from urllib.robotparser import RobotFileParser
import urllib.request

rp = RobotFileParser()
rp.set_url('http://www.jianshu.com/robots.txt')

# 必不可少的一部，虽然没有返回
rp.read()
print(rp.can_fetch('*', 'http://www.jianshu.com/p/b67554025d7d'))
print(rp.can_fetch('*', "http://www.jianshu.com/search?q=python&page=l&type=collections"))
print(rp.mtime())

于是我就开始尝试分析RobotFileParser中read()方法，发现它使用的是urlopen，怀疑是urlopen在请求连接时没有加上浏览器头部信息被服务端认为是爬虫拒绝访问导致。

# 尝试用urlopen方法打开此链接
f = urllib.request.urlopen('http://www.jianshu.com/robots.txt')
print(f)

结果果然如此：

解决这种情况有两种办法：

①：对read()方法进行修改，弃用urlopen启用request方法：

read()源代码如下：

    def read(self):
        """Reads the robots.txt URL and feeds it to the parser."""
        try:
            f = urllib.request.urlopen(self.url)
        except urllib.error.HTTPError as err:
            if err.code in (401, 403):
                self.disallow_all = True
            elif err.code >= 400 and err.code < 500:
                self.allow_all = True
        else:
            raw = f.read()
            self.parse(raw.decode("utf-8").splitlines())

修改后read()方法：

    def read(self):
        """Reads the robots.txt URL and feeds it to the parser."""
        try:
            f = urllib.request.Request(self.url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'})
        except urllib.error.HTTPError as err:
            if err.code in (401, 403):
                self.disallow_all = True
            elif err.code >= 400 and err.code < 500:
                self.allow_all = True
        else:
            raw = urllib.request.urlopen(f).read()
            self.parse(raw.decode("utf-8").splitlines())

再次运行结果如下：成功！

②：不用read()改用 parse()方法即

使用parse()方法来执行robots.txt分析和读取

程序代码如下：

rps = RobotFileParser()
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
response = urllib.request.Request('http://www.jianshu.com/robots.txt', headers=headers)
req = urllib.request.urlopen(response).read().decode('utf-8').split('\n')

# 对robots.txt 内容进行分析
rps.parse(req)
# 使用了request 加上heders信息 返回True了
print(rps.can_fetch('*', 'http://www.jianshu.com/p/b67554025d7d'))
print(rps.can_fetch('*', "http://www.jianshu.com/search?q=python&page=l&type=collections"))
print(rps.mtime())

执行结果如下：成功

lxcl96

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python 3网络爬虫开发实战

分析Robots协议书中以简书为例，对robots.txt文件分析。robots.txt简书robots.txt文件内容如下：# See http://www.robotstxt.org/wc/norobots.html for documentation on how to use the robots.txt file## To ban all spiders from the entire site uncomment the next two lines:User-agent
复制链接

扫一扫