python爬虫读书笔记（3）

最新推荐文章于 2021-10-29 16:10:11 发布

FSexperience

最新推荐文章于 2021-10-29 16:10:11 发布

阅读量319

点赞数

分类专栏：爬虫

本文链接：https://blog.csdn.net/FSexperience/article/details/83896641

版权

爬虫专栏收录该内容

6 篇文章 0 订阅

订阅专栏

上一篇笔记提到链接爬虫，在书中还提到，可以添加一些其他功能，可以在爬取其他网站时更加有用。

1.解析robots.txt

我们需要解析robots.txt 文件，以避免下载禁止爬取的URL。使用Python自带的robotparser模块，就可以轻松完成这项工作。

>>>import robotparser

>>>rp=robotparser.RobotFileParser()

>>>rp.set_url('http: //example.webscraping.com/robots.txt')

>>>rp.read()

>>>url='http: //example.webscraping. com'

>>>user_agent='BadCrawler'

>>>rp.can_fetch(user_agent,url)

False

>>>user_agent='GoodCrawler'

>>>rp.can_fetch(user_agent,url)

True

robotparser模块首先加载robots.txt文件，然后通过can_fetch()函数确定指定的用户代理是否允许访问网页。

为了将该功能集成到爬虫中，我们需要在crawl循环中添加该检查。

while crawl_queue:
    url=crawl_queue.pop()
    #检查url是否通过robot.txt限制
    if rp.can_fetch(user.agent,url)
        ...
    else:
        print('Blocked by robots.txt:',url)

2.支持代理

def download(url,user_agent='wswp',proxy=None,num_retries=2):
    print('Downloading',url)
    headers={'User-agent':user_agent}
    request=urllib2.Request(url,headers=headers)

    opener=urllib.build_opener()
    if proxy:
        proxy_params={urlparse.urlparse(url).scheme:proxy}
        opener.add_header(urrlib2.ProxyHandler(proxy_params))
    try:
        html=opener.open(request).read()
    except urllib2.URLError as e:
        print 'Download error:',e.reason
        html=None
        if num_retries>0:
            if hasattr(e,'code')and 500<=e.code<600:
            html=download(url,user_agent,proxy,num_retires-1)
    return html

3.下载限速

在两次下载之间添加延时，从而对爬虫限速。

class Throttle:
    #增加一个延迟在相同下载的域中
    def __init__(self,delay):
        #每个域之间的下载的延迟数量
        self.delay=delay
        #一个最近存储的域的时间戳
        self.domains={}
    def wait(self,url):
        domain=urlparse.urlparse(url).netloc
        last_accessed=self.domains.get(domian)

        if self.delay>0 and last_accessed is not None:
            sleep_secs=self.delay-(datetime.datetime.now()-last_accessed).seconds
            if sleep_secs>0:
                time.sleep(sleep_secs)
        #更新最后获取时间
        self.domians[domain]=datetime.datetime.now()

Throttle类记录了每个域名上次访问的时间内，如果距上次访问时间小于制定延迟时间，则执行睡眠操作。

throttle=Throttle(delay)

...

throttle.wait(url)

result=download(url,headers,proxy=proxy,num_retries=num_retries)

4.避免爬虫陷阱

目前，我们的爬虫会跟踪所有之前没有访问过的链接。但是，一些网站会动态生成页面内容，这样就会出现无限多的网页。比如，网站有一个在线日历功能，提供了可以访问下个月和下一年的链接，那么下个月的页面中同样会包含访问再下个月的链接，这样页面就会无止境地链接下去。这种情况被称为爬虫陷阱。

想要避免陷入爬虫陷阱，一个简单的方法是记录到达当前网页经过了多少个链接，也就是深度。当到达最大深度时，爬虫就不再向队列中添加该网页中的链接了。要实现这一功能，我们需要修改 seen 变量。该变量原先只记录访问过的网页链接，现在修改为一个字典，增加了页面深度的记录。

def link_crawler(...,max_depth=2):
    max_depth=2
    seen={}
    ...
    depth=seen[url]
    if depth!=max_depth:
        for link in links:
            if link not in seen:
                seen[link]=depth+1
                crawl_queue.append(link)

禁用该功能，只需要将max_path设为一个负数即可，此时当前深度永远不会与之相等。

FSexperience

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python爬虫读书笔记（3）

上一篇笔记提到链接爬虫，在书中还提到，可以添加一些其他功能，可以在爬取其他网站时更加有用。1.解析robots.txt 我们需要解析robots.txt 文件，以避免下载禁止爬取的URL。使用Python自带的robotparser模块，就可以轻松完成这项工作。&gt;&gt;&gt;import robotparser&gt;&gt;&gt;rp=robotparser....
复制链接

扫一扫

专栏目录