post 爬虫request_爬虫（七十三）urllib 的基本使用

最新推荐文章于 2023-12-06 09:21:58 发布

weixin_39686230

最新推荐文章于 2023-12-06 09:21:58 发布

阅读量185

点赞数

文章标签： post 爬虫request

本文链接：https://blog.csdn.net/weixin_39686230/article/details/111621355

版权

爬虫（七十三）urllib 的基本使用mp.weixin.qq.com

从今天开始，从零开始将一些基础的，常用的一些开发包的使用情况，为我们搞正式的爬虫做一些准备工作，哈哈，有兴趣的可以跟着小编一起来学习啊

我的电脑 win10, 我使用的环境是 anaconda3.7 版本，我们一起来看看 urllib目录的组成结构如下

pycache ：其实这个文件就是将相关 py 代码编译成 .pyc 文件，这些文件都放在这个文件夹中

init.py 是一个初始化的文件，可以说是一个模块的入口吧，任何模块导出都会先找到这个 init.py

urllib 是 Python3 中自带的 HTTP 请求库，无需复杂的安装过程即可正常使用，十分适合爬虫入门

urllib 中包含四个模块，分别是

request：请求处理模块

parse：URL 处理模块

error：异常处理模块

robotparser：robots.txt 解析模块

以下我们将会分别讲解 urllib 中各模块的使用方法，但是由于篇幅问题，本文只会涉及模块中比较常用的内容

详细内容可以参考官方文档：https://docs.python.org/3.7/library/urllib.html

二、urllib 使用

在开始讲解前，先给大家提供一个用于测试的网站，

http://www.httpbin.org/

这个网站可以在页面上返回所发送请求的相关信息，十分适合练习使用

好了，下面正式开始！

1、request 模块

request 模块是 urllib 中最重要的一个模块，一般用于发送请求和接收响应

（1）urlopen 方法

urllib.request.urlopen()

urlopen 方法无疑是 request 模块中最常用的方法之一，常见的参数说明如下：

url：必填，字符串，指定目标网站的 URL

data：指定表单数据

该参数默认为 None，此时 urllib 使用 GET 方法发送请求

当给参数赋值后，urllib 使用 POST 方法发送请求，并在该参数中携带表单信息（bytes 类型）

timeout：可选参数，用来指定等待时间，若超过指定时间还没获得响应，则抛出一个异常

该方法始终返回一个 HTTPResponse 对象，HTTPResponse 对象常见的属性和方法如下：

geturl()：返回 URL

getcode()：返回状态码

getheaders()：返回全部响应头信息

getheader(header)：返回指定响应头信息

read()：返回响应体（bytes 类型），通常需要使用 decode('utf-8') 将其转化为 str 类型

例子1：发送 GET 请求

In [1]: import urllib.request

In [2]: url = 'http://www.baidu.com/'

In [3]: response = urllib.request.urlopen(url)

In [4]: type(response)
Out[4]: http.client.HTTPResponse

In [5]: response.geturl()
Out[5]: 'http://www.baidu.com/'

In [6]: response.getcode()
Out[6]: 200

In [7]:  response.getheaders()
Out[7]:
[('Bdpagetype', '1'),
 ('Bdqid', '0xbd15b751005a282a'),
 ('Cache-Control', 'private'),
 ('Content-Type', 'text/html'),
 ('Cxy_all', 'baidu+3cbc531560effd0810042d01f6fc0680'),
 ('Date', 'Sun, 08 Mar 2020 02:33:59 GMT'),
 ('Expires', 'Sun, 08 Mar 2020 02:33:47 GMT'),
 ('P3p', 'CP=" OTI DSP COR IVA OUR IND COM "'),
 ('P3p', 'CP=" OTI DSP COR IVA OUR IND COM "'),
 ('Server', 'BWS/1.1'),
 ('Set-Cookie',
  'BAIDUID=FF888AB675F1A23341FB377B116D82A0:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'),
 ('Set-Cookie',
  'BIDUPSID=FF888AB675F1A23341FB377B116D82A0; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'),
 ('Set-Cookie',
  'PSTM=1583634839; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'),
 ('Set-Cookie',
  'BAIDUID=FF888AB675F1A23331F924C08EC63A7D:FG=1; max-age=31536000; expires=Mon, 08-Mar-21 02:33:59 GMT; domain=.baidu.com; path=/; version=1; comment=bd'),
 ('Set-Cookie', 'delPer=0; path=/; domain=.baidu.com'),
 ('Set-Cookie', 'BDSVRTM=0; path=/'),
 ('Set-Cookie', 'BD_HOME=0; path=/'),
 ('Set-Cookie',
  'H_PS_PSSID=30971_1447_21104_30823_26350; path=/; domain=.baidu.com'),
 ('Traceid', '1583634839059064193013624997806205446186'),
 ('Vary', 'Accept-Encoding'),
 ('X-Ua-Compatible', 'IE=Edge,chrome=1'),
 ('Connection', 'close'),
 ('Transfer-Encoding', 'chunked')]

In [8]: response.getheader('Connection')
Out[8]: 'close'


In [9]: print(response.read().decode('utf-8'))
<!DOCTYPE html>
<!--STATUS OK-->
... 省略其他页面代码

例子2：发送 POST 请求

urllib.parse.urlencode()：进行 URL 编码，实际上是将 dict 类型数据转化成 str 类型数据

encode('utf-8')：将 str 类型数据转化成 bytes 类型数据

In [11]: import urllib.request

In [12]: import urllib.parse

In [13]: url = 'http://www.httpbin.org/post'

In [14]: params = {
    ...:     'from':'AUTO',
    ...:     'to':'AUTO'
    ...: }

In [15]: data = urllib.parse.urlencode(params).encode('utf-8')

In [16]: response = urllib.request.urlopen(url=url,data=data)




>>> html =  response.read().decode('utf-8')
>>> print(html)
# {
#   "args": {}, 
#   "data": "", 
#   "files": {}, 
#   "form": { # 这是我们设置的表单数据
#     "from": "AUTO", 
#     "to": "AUTO"
#   }, 
#   "headers": {
#     "Accept-Encoding": "identity", 
#     "Connection": "close", 
#     "Content-Length": "17", 
#     "Content-Type": "application/x-www-form-urlencoded", 
#     "Host": "www.httpbin.org", 
#     "User-Agent": "Python-urllib/3.6"
#   }, 
#   "json": null, 
#   "origin": "116.16.107.180", 
#   "url": "http://www.httpbin.org/post"
# }

（2）Request 对象

实际上，我们还可以给 urllib.request.open() 方法传入一个 Request 对象作为参数

为什么还需要使用 Request 对象呢？因为在上面的参数中我们无法指定请求头部，而它对于爬虫而言又十分重要

很多网站可能会首先检查请求头部中的 USER-AGENT 字段来判断该请求是否由网络爬虫程序发起

但是通过修改请求头部中的 USER_AGENT 字段，我们可以将爬虫程序伪装成浏览器，轻松绕过这一层检查

这里提供一个查找常用的 USER-AGENT 的网站：

https://techblog.willshouse.com/2012/01/03/most-common-user-agents/

urllib.request.Request()

参数说明如下：

url：指定目标网站的 URL

data：发送 POST 请求时提交的表单数据，默认为 None

headers：发送请求时附加的请求头部，默认为 {}

origin_req_host：请求方的 host 名称或者 IP 地址，默认为 None

unverifiable：请求方的请求无法验证，默认为 False

method：指定请求方法，默认为 None

In [1]: import urllib.request

In [2]: url = 'http://www.baidu.com/'

In [3]: headers = {
   ...:     'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.
   ...: 99 Safari/537.36'
   ...: }

In [4]: req = urllib.request.Request(url, headers=headers, method='GET')

In [5]: response = urllib.request.urlopen(req)

In [6]: html = response.read().decode('utf-8')

In [7]: print(html)
<!DOCTYPE html>
<!--STATUS OK-->
...省略其他代码

（3）使用 Cookie

什么是 Cookie？

Cookie 是指某些网站为了辨别用户身份、

进行 session 跟踪而储存在用户本地终端上的数据

① 获取 Cookie

In [9]: import urllib.request

In [10]: import http.cookiejar

In [11]: cookie = http.cookiejar.CookieJar()

In [12]: cookie_handler = urllib.request.HTTPCookieProcessor(cookie)

In [13]: opener = urllib.request.build_opener(cookie_handler)

In [14]: response = opener.open('http://www.baidu.com')

In [15]: for item in cookie:
    ...:         print(item.name + '=' + item.value)
    ...:
BAIDUID=159700AA1F13A9B0E9BD35925136E444:FG=1
BIDUPSID=159700AA1F13A9B0697A44EC94CE7B47
H_PS_PSSID=30975_1441_21109_30995_30824_30717
PSTM=1583635407
BDSVRTM=0
BD_HOME=1


② 使用 Cookie

In [17]: import urllib.request

In [18]: import http.cookiejar

In [19]: cookie = http.cookiejar.MozillaCookieJar('cookie.txt')

In [20]: cookie_handler = urllib.request.HTTPCookieProcessor(cookie)

In [21]: opener = urllib.request.build_opener(cookie_handler)

In [22]: response = opener.open('http://www.baidu.com')

In [23]: cookie.save(ignore_discard=True,ignore_expires=True)

In [24]: cookie = http.cookiejar.MozillaCookieJar()

In [25]: cookie = cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True)

In [26]: cookie_handler = urllib.request.HTTPCookieProcessor(cookie)

In [27]: opener = urllib.request.build_opener(cookie_handler)

In [28]: response = opener.open('http://www.baidu.com')

（4）使用代理

对于某些网站，如果同一个 IP 短时间内发送大量请求，则可能会将该 IP 判定为爬虫，进而对该 IP 进行封禁

所以我们有必要使用随机的 IP 地址来绕开这一层检查，这里提供几个查找免费的 IP 地址的网站：

西刺代理：http://www.xicidaili.com/nn/

云代理：http://www.ip3366.net/free/

快代理：https://www.kuaidaili.com/free/

注意，免费的代理 IP 基本上十分不稳定，而且还可能随时更新，所以最好自己写一个爬虫去维护

In [36]: import urllib.request

In [37]: import random

In [38]: ip_list = [
    ...:     {'http':'61.135.217.7:80'},
    ...:     {'http':'182.88.161.204:8123'}
    ...: ]

In [39]: proxy_handler = urllib.request.ProxyHandler(random.choice(ip_list))

In [40]: opener = urllib.request.build_opener(proxy_handler)

In [41]: response = opener.open('http://www.httpbin.org/ip',{})

2、parse 模块

parse 模块一般可以用于处理 URL

（1）quote 方法

当你在 URL 中使用中文时，你会发现程序会出现莫名其妙的错误

使用中文路径的时候，会报错

In [43]: import urllib.request

In [44]: url = 'https://www.baidu.com/s?wd=爬虫'

In [45]: response = urllib.request.urlopen(url)
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-45-1b2a8a7379ce> in <module>
----> 1 response = urllib.request.urlopen(url)

这时，quote 方法就可以派上用场了，它使用转义字符替换特殊字符，从而将上面的 URL 处理成合法的 URL

In [48]: import urllib.parse

In [49]: url = 'https://www.baidu.com/s?wd=' + urllib.parse.quote('爬虫')

In [50]: url
Out[50]: 'https://www.baidu.com/s?wd=%E7%88%AC%E8%99%AB'

In [51]: response = urllib.request.urlopen(url)

（2）urlencode 方法

urlencode 方法在上面的文章中曾经用到过，不知道大家还有没有印象，这里我们再重新回顾一遍

简单来说，urlencode 方法就是将 dict 类型数据转化为符合 URL 标准的 str 类型数据，请看演示：

In [53]: import urllib.parse

In [54]: params = {
    ...:     'from':'AUTO',
    ...:     'to':'AUTO'
    ...: }

In [55]: data = urllib.parse.urlencode(params)

In [56]: data
Out[56]: 'from=AUTO&to=AUTO'

（3）urlparse 方法

urlparse 方法用于解析 URL，返回一个 ParseResult 对象

该对象可以认为是一个六元组，对应 URL 的一般结构：scheme://netloc/path;parameters?query#fragment

In [58]: import urllib.parse

In [59]: url = 'http://www.example.com:80/python.html?page=1&kw=urllib'

In [60]: url_after = urllib.parse.urlparse(url)

In [61]: url_after
Out[61]: ParseResult(scheme='http', netloc='www.example.com:80', path='/python.html', params='', query='page=1&kw=urllib', fragment='')

In [62]: url_after.port
Out[62]: 80

3、error 模块

error 模块一般用于进行异常处理，其中包含两个重要的类：URLError 和 HTTPError

注意，HTTPError 是 URLError 的子类，所以捕获异常时一般要先处理 HTTPError，常用的格式如下：

In [64]: import urllib.request

In [65]: import urllib.error

In [66]: import socket

In [67]: try:
    ...:     response = urllib.request.urlopen('http://www.httpbin.org/get', timeout=0.1)
    ...: except urllib.error.HTTPError as e:
    ...:     print("Error Code: ", e.code)
    ...:     print("Error Reason: ", e.reason)
    ...: except urllib.error.URLError as e:
    ...:     if isinstance(e.reason, socket.timeout):
    ...:         print('Time out')
    ...: else:
    ...:     print('Request Successfully')
    ...:
Time out

weixin_39686230

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
post 爬虫request_爬虫（七十三）urllib 的基本使用

爬虫（七十三）urllib 的基本使用mp.weixin.qq.com从今天开始，从零开始将一些基础的，常用的一些开发包的使用情况，为我们搞正式的爬虫做一些准备工作，哈哈，有兴趣的可以跟着小编一起来学习啊我的电脑 win10, 我使用的环境是 anaconda3.7 版本，我们一起来看看 urllib目录的组成结构如下pycache ：其实这个文件就是将相关 py 代码编译成 .pyc 文件，这...
复制链接

扫一扫