愚公移山日记·8

最新推荐文章于 2020-05-18 20:53:34 发布

Python_G－Dragon

最新推荐文章于 2020-05-18 20:53:34 发布

阅读量193

点赞数 1

分类专栏：日记文章标签： python

本文链接：https://blog.csdn.net/python_g_dragon/article/details/105149770

版权

日记专栏收录该内容

42 篇文章 0 订阅

订阅专栏

愚公移山日记·8

又来了呀o(^▽)o
今天跟着自己买的爬虫书学习，虽然这本书并不是说完全给小白准备的，但是前一个月的时间自己重要的学习方向就是爬虫，所以在刚开始的第一章中，问题不是很大，也有些困难的地方，百度一下完全是可以自己解决的。
下面我先来说一下今天的设计的几个新知识点，然后再来对比以下，之前学到的爬虫与今天的学习到不同的地方。

新知识点

hasattr 语法

函数用于判断对象是否包含对应的属性
hasattr（object，name)
object是对象
name是字符串

itertools模块的count()语法

迭代器
itertools.count(开始位置,步长）
其实我也说不太清楚，举个例子吧。

import itertools
i = 0
for item in itertools.count(3,2):
        i += 1
        if i > 10 : break
        print(item)

输出结果为：3，5，7…21 一共十个数字（要把break带上哦，要不然就会一直循环，当然如果你直接使用print（itertools.coun(3,2)),很遗憾它只会输出，‘count（3,2）’）

比较一下学习的不同之处吧

之前学习的爬取的方法：

import requests
from fake_useragent import UserAgent
def get_html(url):
    count = 0 
    while True:
        headers = {
            'User-agent' : UserAgent().random
        }
        response = requests.get(url,headers=headers)
        if response.status_code == 200:
            response.encoding = 'utf-8'
            return response
        else:
            count += 1
            if count == 3:
                return
            else:
                continue

现在的方法:

import urllib.reqeust
def download (url,user_agent = 'wswp',num_retries = 2,charset = 'utf-8'):
    print('Downloading:',url)
    request = urllib.request.Request(url)
    request.add_header('User-agent',user_agent)
    try:
        resp = urllib.request.urlopen(request)
        cs = resp.headers.get_content_charset()
        if not cs :
            cs = charset
        html = resp.read().decode(cs)
    except (URLError,HTTPError,ContentTooShortError) as e:
        print('Download error:',e.reason)
        html = None
        if num_retries > 0:
            if hasattr (e,'code') and 500 <= e.code < 600:
                return download(url,num_retries - 1)
    return html

当然两者并不太相似，包括功能，和健壮程度都是不相同的，但是我要的说的是一个是引用的是python自带的urllib.request,另一个是安装后的requests库，我感觉，在代码简洁程度上来看，我更倾向去使用requests库。
在第一个代码中使用的是引用一个库fake_useragent,去解决User-Agent的问题，在第二个中直接自己定义的useragent。
好啦今天的分享就到这里