爬虫之urllib

最新推荐文章于 2024-09-28 17:40:14 发布

dianqian4038

最新推荐文章于 2024-09-28 17:40:14 发布

阅读量94

点赞数

文章标签：爬虫 python

原文链接：http://www.cnblogs.com/jyh-py-blog/p/9977221.html

版权

爬虫之urllib

一、request模块

1.urlopen() --返回值为HTTPResponse对象

urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
            *, cafile=None, capath=None, cadefault=False, context=None)

参数：url必传

①data：传入之后请求变为POST，需转为bytes类型的参数

②timeout:设置超时时间，没有得到响应则抛出异常，异常类型为socket.timeout，默认为全局时间

③其他：context：设置SSL，ca:ca证书(使用HTTPS有用)，cadefault已经弃用

2.Request类 --加入Headers信息，伪装成浏览器

class Request:
    def __init__(self, url, data=None, headers={},
                 origin_req_host=None, unverifiable=False,
                 method=None):

参数：url必传

①data:bytes类型

②headers:字典类型，请求头，可通过修改User-Agent来伪装浏览器。等同于实例方法add_header(k,v)

③origin_req_host:请求方的host名称或ip地址

④unverifiable:没有权限为True，默认为False

⑤method:请求方法

3.高级用法

利用Handler来构建Opener。验证(HTTPBasicAuthHandler)、代理(ProxyHandler)、Cookies(HTTPCookieProcessor)

二、error模块

error模块定义了由request模块产生的异常

基类：URLError；子类：HTTPError

三、parse模块

1.urlparse() --返回ParseResult对象(scheme:'http',netloc='www,baidu.com',path='/index.html',params='user',query='id=5',fragment='comment')

http://www.baidu.com/index.html;user?id=5#comment 当url不包含params和query时，fragment会被解析为path的一部分

ParseResult实际上是一个元组，可以用索引或属性名来获取，如result[0],result.scheme

def urlparse(url, scheme='', allow_fragments=True):

参数：url必填

①scheme:协议(http,https)，只有在url中不包含scheme信息时设置的才有效

②allow_fragments:是否忽略fragment，默认不忽略

2.urlunparse()

构造url，参数为一个可迭代对象如['http','www.baidu.com','index.htm','user','a=6','comment']，长度必须为6

3.urlsplit()

和urlparse()类似，不单独解析params，将params合并到path中

--返回SplitResult对象(scheme:'http',netloc='www,baidu.com',path='/index.html;user',query='id=5',fragment='comment')

4.urlunsplit()

与urlunparse()类似

构造url，参数为一个可迭代对象如['http','www.baidu.com','index.htm','a=6','comment']，长度必须为5

5.urljoin()

构造url，该方法会分析base的scheme，netloc和path并对新链接缺失的部分进行补充。

def urljoin(base, url, allow_fragments=True):

6.urlencode()

def urlencode(query, doseq=False, safe='', encoding=None, errors=None,
              quote_via=quote_plus):

在构造GET请求参数的时候非常有用，将字典类型序列化为GET请求参数

from urllib.parse import urlencode

params={'name':'jyh','age':18}
base='http://www.baidu.com?'
url=base+urlencode(params)
print(url)

#---> http://www.baidu.com?name=jyh&age=18

7.parse_qs()

反序列化，将GET请求参数转回字典

8.parse_qsl()

将GET参数转化为元组组成的列表

9.quote()

将内容转化为URL编码的格式

quote('壁纸') --->%E5%A3%81%E7%BA%B8

10.unquote()

URL解码

四、robotparser模块 --Robots协议

规定网页哪些可爬，哪些不可爬　　

本文参考文献：[1]崔庆才.python3网络爬虫开发实战[M].北京:人民邮电出版社,2018:102-122.