Python网络爬虫(四)——urllib

最新推荐文章于 2024-07-31 15:04:00 发布

止步听风

最新推荐文章于 2024-07-31 15:04:00 发布

阅读量799

点赞数

分类专栏： # 网络爬虫文章标签： urllib cookie proxyHandler

本文链接：https://blog.csdn.net/SAKURASANN/article/details/106107835

版权

网络爬虫专栏收录该内容

24 篇文章 4 订阅

订阅专栏

本篇文章主要介绍 urllib 库相关函数的使用。

urllib 能够模拟浏览器进行网络请求，也能够对服务器返回的数据进行保存。urllib 主要包括几个模块：

模块	描述
urllib.request	打开和读取 URL
urllib.error	包含 urllib.request 引发的异常
urllib.parse	解析 URL
urllib.robotparser	解析 robots.txt 文件

Urllib

常用函数

在 urllib 库中，主要用到的函数有：

urlopen

def urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
            *, cafile=None, capath=None, cadefault=False, context=None):

该函数能够发起 URL 请求，主要的参数为：

url：表示请求的 URL
data：表示请求的 URL 的 data，如果设置了该参数，该 URL 请求就变成了 POST 请求

如果发送的是 http/https URL，那么对于函数的返回值，官方给出的说法为：

This function always returns an object which can work as a context
manager and has methods such as

* geturl() - return the URL of the resource retrieved, commonly used to
  determine if a redirect was followed

* info() - return the meta-information of the page, such as headers, in the
  form of an email.message_from_string() instance (see Quick Reference to
  HTTP Headers)

* getcode() - return the HTTP status code of the response.  Raises URLError
  on errors.

For HTTP and HTTPS URLs, this function returns a http.client.HTTPResponse
object slightly modified. In addition to the three new methods above, the
msg attribute contains the same information as the reason attribute ---
the reason phrase returned by the server --- instead of the response
headers as it is specified in the documentation for HTTPResponse.

也就是说，此时函数的返回值为 http.client.HTTPResponse 对象，HTTPResponse 类是 python 自带的 http 库中 http 类的一个子类，在该子类下，能够使用该子类对应的方法，如 read(),readline(),readlines() 和 getcode() 方法等

from urllib import request

response = request.urlopen('https://www.baidu.com/')
print(type(response))
print(response.read())

结果为：

<class 'http.client.HTTPResponse'>
b'<html>\r\n<head>\r\n\t<script>\r\n\t\tlocation.replace(location.href.replace("https://","http://"));\r\n\t</script>\r\n</head>\r\n<body>\r\n\t<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>\r\n</body>\r\n</html>'

response.read() 打印的结果前的 b 表示 bytes，是一种数据类型。

request.Request

该类的“构造函数”为：

def __init__(self, url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None):

其中的 headers 可以用来设置 request headers，对爬虫进行伪装。

from urllib import request

url = 'http://www.baidu.com/s?wd=python'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'}
req = request.Request(url,headers=headers)
response = request.urlopen(req)
print(response.read())

结果为：

b'<!DOCTYPE html>\n<html lang="zh-CN">\n<head>\n    <meta charset="utf-8">\n    <title>\xe7\x99\xbe\xe5\xba\xa6\xe5\xae\x89\xe5\x85\xa8\xe9\xaa\x8c\xe8\xaf\x81</title>\n    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">\n    <meta name="apple-mobile-web-app-capable" content="yes">\n    <meta name="apple-mobile-web-app-status-bar-style" content="black">\n    <meta name="viewport" content="width=device-width, user-scalable=no, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0">\n    <meta name="format-detection" content="telephone=no, email=no">\n    <link rel="shortcut icon" href="https://www.baidu.com/favicon.ico" type="image/x-icon">\n    <link rel="icon" sizes="any" mask href="https://www.baidu.com/img/baidu.svg">\n    <meta http-equiv="X-UA-Compatible" content="IE=Edge">\n    <meta http-equiv="Content-Security-Policy" content="upgrade-insecure-requests">\n    <link rel="stylesheet" href="https://wappass.bdimg.com/static/touch/css/api/mkdjump_8befa48.css" />\n</head>\n<body>\n    <div class="timeout hide">\n        <div class="timeout-img"></div>\n        <div class="timeout-title">\xe7\xbd\x91\xe7\xbb\x9c\xe4\xb8\x8d\xe7\xbb\x99\xe5\x8a\x9b\xef\xbc\x8c\xe8\xaf\xb7\xe7\xa8\x8d\xe5\x90\x8e\xe9\x87\x8d\xe8\xaf\x95</div>\n        <button type="button" class="timeout-button">\xe8\xbf\x94\xe5\x9b\x9e\xe9\xa6\x96\xe9\xa1\xb5</button>\n    </div>\n    <div class="timeout-feedback hide">\n        <div class="timeout-feedback-icon"></div>\n        <p class="timeout-feedback-title">\xe9\x97\xae\xe9\xa2\x98\xe5\x8f\x8d\xe9\xa6\x88</p>\n    </div>\n\n<script src="https://wappass.baidu.com/static/machine/js/api/mkd.js"></script>\n<script src="https://wappass.bdimg.com/static/touch/js/mkdjump_6003cf3.js"></script>\n</body>\n</html><!--25127207760471555082051323-->\n<script> var _trace_page_logid = 2512720776; </script>'

urlretrieve

def urlretrieve(url, filename=None, reporthook=None, data=None):

该函数能够将请求的 URL 保存为本地名为 filename 的文件。没有返回值。

from urllib import request

request.urlretrieve('http://www.baidu.com/','saved.html')

urlencode

def urlencode(query, doseq=False, safe='', encoding=None, errors=None, quote_via=quote_plus):

对于该函数的作用，官方给出的说法为：

Encode a dict or sequence of two-element tuples into a URL query string.

也就是说，该函数可以将字典或者双元素元组编码为 URL 查询字符串。

from urllib import request,parse

di = {'名字':'zhangsan',
      '性别':'男'}
di_encode = parse.urlencode(di)
print(di_encode)

结果为：

%E5%90%8D%E5%AD%97=zhangsan&%E6%80%A7%E5%88%AB=%E7%94%B7

parse_qs

def parse_qs(qs, keep_blank_values=False, strict_parsing=False, encoding='utf-8', errors='replace'):

如果有编码，当然就会有解码，该函数可以视为 urlencode 的逆过程。只是 parse_qs 中的 encoding 的格式默认为 utf-8。

from urllib import request,parse

di = {'名字':'zhangsan',
      '性别':'男'}
di_encode = parse.urlencode(di)
print(di_encode)
di_qs = parse.parse_qs(di_encode)
print(di_qs)

结果为：

%E5%90%8D%E5%AD%97=zhangsan&%E6%80%A7%E5%88%AB=%E7%94%B7
{'名字': ['zhangsan'], '性别': ['男']}

urlparse

def urlparse(url, scheme='', allow_fragments=True):

上边的函数能够将 URL 按照以下六部分进行解析：

<scheme>://<netloc>/<path>;<params>?<query>#<fragment>

返回值也是上边六部分的元组。

from urllib import request,parse

url = 'http://www.baidu.com/s?wd=python'
url_parse = parse.urlparse(url)
print(url_parse)

结果为：

ParseResult(scheme='http', netloc='www.baidu.com', path='/s', params='', query='wd=python', fragment='')

urlsplit

def urlsplit(url, scheme='', allow_fragments=True):

上边的函数能够将 URL 按照以下五部分进行解析：

<scheme>://<netloc>/<path>?<query>#<fragment>

返回值也是上边五部分的元组。可以看出相比较于 urlparse 函数，该函数不会解析 params 部分。

from urllib import request,parse

url = 'http://www.baidu.com/s?wd=python'
url_parse = parse.urlsplit(url)
print(url_parse)

结果为：

SplitResult(scheme='http', netloc='www.baidu.com', path='/s', query='wd=python', fragment='')

ProxyHandler

有些网站会设置反爬虫机制，检测某个 IP 地址的访问情况，如果该地址的访问出现异常，那么就会对该 IP 的访问做出限制，因此在构建爬虫的时候，也可以设置代理来避免这个问题。

urllib 中使用 ProxyHandle 来设置代理服务器：

from urllib import request

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'}

# no proxy
response = request.urlopen('http://httpbin.org/ip')
print(response.read())

# using a proxy
proxy = request.ProxyHandler({"http" : "125.126.120.169:60004"})
opener = request.build_opener(proxy)
req = request.Request('http://httpbin.org/ip',headers=headers)
response = opener.open(req)
print(response.read())

向网址 httpbin.org 发送一个 get 请求能够得到当前主机的 IP 的地址，因此，上边的结果为：

b'{\n  "origin": "223.90.237.229"\n}\n'
b'{\n  "origin": "125.126.120.169"\n}\n'

ProxyHandle 是一个类，构建类对象时需要提供代理 IP 的字典。

这里还遇到过一个很有意思的现象，如果使用 VPN 运行上边代码的话，两次打印的 IP 是同一个地址，均为外部的地址。

Cookie

在 chrome 浏览器中的设置->高级->网站设置->Cookie中可以查看到浏览器保存的 Cookie 信息
一般情况下，向服务器发送的 http/https 的请求是无状态的，因此如果是在登陆状态下进行的请求需要再一次输入登录的 ID，这种繁琐的操作无疑会严重影响用户的使用体验，而 Cookie 就是用来解决这个问题的
在初次登陆后服务器会发送一些数据(也就是 Cookie)给浏览器，浏览器会将之保存在本地，当用户再次向同一个服务器发送请求的时候，就会使用保存在本地的 Cookie 信息，这样就不用再次输入登陆信息了
当然也不是所有的信息都能保存为 Cookie 信息的，Cookie 本身存储的数据量也是有限的，不同的浏览器有不同的存储大小，但一般都不会超过 4KB

Cookie 的格式

Set-Cookie: NAME=VALUE; Expires/Max-age=DATE; Path=PATH; Domain=DOMAIN_NAME; SECURE

NAME：Cookie 的名字。
VALUE：Cookie 的值。
Expires/Max-age：Cookie 的过期时间。
Path：Cookie 作用的路径。
Domain：Cookie 作用的域名。
SECURE：是否只在 https 协议下启动。

登陆访问

如果在登陆状态下发送 http/https 请求，需要使用 Cookie 信息，而解决该问题的方法有两种：

使用浏览器的 F12，保存登陆状态下的 Cookie，并将之放入 headers
使用关于 Cookie 的函数库来解决

from urllib import request

url = 'https://www.douban.com/'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
           'Referer':'https://www.douban.com/',
           'Cookie':'这里填写自己在浏览器中复制的 Cookie 信息'}

req = request.Request(url,headers=headers)
response = request.urlopen(req)
with open('saved.html','w',encoding='utf-8') as fp:
    fp.write(response.read().decode('utf-8'))

使用这种方法就可以将登陆状态的 html 文件直接保存在本地。但也可以使用一些关于 Cookie 的库来对 Cookie 进行处理。

http.cookiejar

在这一模块中关于 Cookie 的主要的类有 CookieJar,FileCookieJar,MozilaCookieJar,LWPCookieJar。

CookieJar	管理 HTTP cookie 的值存储 HTTP 请求生成的 Cookie 向 HTTP 请求中添加 Cookie 此时的 Cookie 都存储在内存中，对 Cookie 实例销毁之后对应的 Cookie 也会消失
FileCookieJar	CookieJar 的派生类检索 Cookie 信息并将之存储到文件也可以读取文件内容
MozilaCookieJar	FileCookieJar 的派生类创建与 Mozila 浏览器的 cookie.txt 兼容的 FileCookieJar 实例
LWPCookieJar	FileCookieJar 的派生类创建与 libwww-perl 标准的 Set-Cookie3 文件格式兼容的 FileCookieJar 的实例

from urllib import request,parse
from http.cookiejar import CookieJar

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
}

login_url = "http://www.renren.com/ajaxLogin/login"
target_url = 'http://www.renren.com/880151247/profile'
user_info = {"email": "用户名", "password": "密码"}

cookie = CookieJar()
handler = request.HTTPCookieProcessor(cookie)
opener = request.build_opener(handler)

data = parse.urlencode(user_info).encode('utf-8')
req = request.Request(login_url,data=data,headers=headers)
response = opener.open(req)

req = request.Request(target_url,headers=headers)
response = opener.open(req)
with open('saved.html','w',encoding='utf-8') as fp:
    fp.write(response.read().decode('utf-8'))

上面的程序会将某个特定用户的页面保存下来，但是不以登陆状态访问的话是进入不到特定用户的主页的，因此使用上面的页面可以实现。

本来是想使用上面的策略访问豆瓣网的个人主页的，但是会报"参数缺失"的错误，不知道哪里搞错了。

保存 Cookie 到本地

对于网页中的 Cookie 信息，也可以使用 cookieJar 进行本地保存：

from urllib import request
from http.cookiejar import MozillaCookieJar

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
}

url = "https://www.baidu.com"

cookie = MozillaCookieJar('cookie.txt')
handler = request.HTTPCookieProcessor(cookie)
opener = request.build_opener(handler)

req = request.Request(url,headers=headers)
response = opener.open(req)

cookie.save(ignore_discard=True,ignore_expires=True)

这样就将所访问网页的 cookie 信息保存在了本地名为 cookie.txt 的文件中。

从本地加载 cookie

既然能将 cookie 保存到本地，就也能够从本地加载 cookie 信息：

from urllib import request
from http.cookiejar import MozillaCookieJar

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
}

url = "https://www.baidu.com"

cookie = MozillaCookieJar('cookie.txt')
cookie.load(ignore_discard=True,ignore_expires=True)
handler = request.HTTPCookieProcessor(cookie)
opener = request.build_opener(handler)

req = request.Request(url,headers=headers)
response = opener.open(req)

这样就将本地的 cookie 信息加载到了创建的 MozillaCookieJar 对象中。