Python3 网络爬虫开发实战读书笔记（二）

艾尔伯特想变瘦

于 2021-10-29 21:05:14 发布

阅读量191

点赞数

文章标签： python 爬虫开发语言

本文链接：https://blog.csdn.net/m0_50710793/article/details/121037202

版权

3.1 使用urllib

urllib包含四个模块：
request： 最基本的HTTP请求模块，可以用来模拟发送请求。就像在浏览器里输入网址然后回车，只需要传入URL和参数，就可以模拟实现整个过程。
error: 异常处理模块，如果出现请求错误，那么可以捕获这些异常，然后进行重试或者其他操作保证程序不会意外终止。
parse: 一个工具模块，提供了许多URL的处理方法，比如拆分、解析、合并等等。
robotparser: 主要是用来识别robots.txt文件，然后判断哪些网站可以爬，哪些不行。用的少。

3.1.1 发送请求

1. urlopen()

urllib.request 这个模块同时还可以处理授权验证、重定向、浏览器Cookies以及其他内容。

import urllib.request

response = urllib.request.urlopen('https://www.python.org')
print(response.read().decode('utf-8'))

输出的是python官网的源代码。
HTTPResponse = urllib.request.urlopen(‘website’)
输入一段网址，之后调用urllib.request.urlopen(‘website’)方法，就返回一个HTTPResponse对象，主要包含：
read(), readinto(), getheader(), getheaders(), fileno()方法，以及msg, version, status, reason, debug level, closed等属性。

print(response.status)
print(response.getheaders())
print(response.getheader('Server'))

200
[('Connection', 'close'), ('Content-Length', '50780'), ('Server', 'nginx'), ('Content-Type', 'text/html; charset=utf-8'), ('X-Frame-Options', 'DENY'), ('Via', '1.1 vegur, 1.1 varnish, 1.1 varnish'), ('Accept-Ranges', 'bytes'), ('Date', 'Fri, 29 Oct 2021 09:01:31 GMT'), ('Age', '32'), ('X-Served-By', 'cache-bwi5128-BWI, cache-hkg17922-HKG'), ('X-Cache', 'HIT, HIT'), ('X-Cache-Hits', '3, 92'), ('X-Timer', 'S1635498092.822928,VS0,VE0'), ('Vary', 'Cookie'), ('Strict-Transport-Security', 'max-age=63072000; includeSubDomains')]
nginx

status：输出响应的状态码
getheaders()：输出响应的所有头信息
getheader(‘Server’)：响应头信息中的‘Server’，结果是nginx，意思是服务器使用Nginx搭建。

urllib.request.urlopen()的API：
urllib.request.urlopen(url, data = None, [timeout,]*, cafile = None, capath = None, cadefault = False, context = None)

参数的用法：
data 这个是发送求请需要传递的参数。如果要传递data参数,urlopen 将使用post方式请求,而非GET。如果要使用该参数，那么要使用bytes（）方法将参数转化成字节流编码格式的内容。

data = bytes(urllib.parse.urlencode({'word':'hello'}), encoding='utf8')
response = urllib.request.urlopen('https://www.httpbin.org/post',data= data)
print(response.read())

这里注意哦：这段代码传递了一个参数word，值是hello。它需要被转码成字节流，采用了bytes（）方法。第一个参数是字符串类型，这里需要用urllib.parse模块的urlencode（）方法来将参数字典转化成字符串；第二个参数是指定编码格式，这里指定utf8
并且:上述例子采用的URL最后有post，可以用来测试POST请求，它可以输出请求的一些信息，包括我们的参数

timeout 这个是求请超时时长。我们可以设置时长，如果请求时间过长，则会抛出异常。

# 如果一个网页太长时间没有响应，那么跳过该网页的抓取。

import socket
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen("https://www.httpbin.org/get",timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout):
        print('TIME OUT!')

可以这样子结合异常使用～
cafile 这个是CA证书。
capath 这个是CA证书路径。
~~cadefault=Flase 这个已经被弃用了，不用关注这个了。~~
context 这个可以指定SSL安装验证设置，比如我们可以设置忽略证书验证等等。

2. Request

request = urllib.request.Request('https://www.python.org')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

在urlopen()方法里面传入Request对象，一方面可以将请求独立成一个对象，另一方面可以更加丰富和灵活地配置参数。

Request对象如何构造？
class urllib.request.Request(url, data=None, headers{}, origin_req_host=None, unverifiable=False, method=None)

参数的用法：
url： 请求URL，必传参数
data： 必须传bytes（字节流）类型的。如果是字典，那么可以先用urllib.parse模块里的urlencode（）方法。
header： 本身是一个字典，它自己就是一个请求头，我们在构造请求时通过参数header直接构造，也可以通过调用请求实例的add_header()方法来进行添加。
请求头最常用的方法就是修改User-Agent来伪装浏览器，默认的User-Agent是Python-urllib,我们可以通过修改它来伪装浏览器。
origin_req_host: 是指请求方的host名称或者IP地址。
unverifiable: 表示这个请求是否是无法验证的，默认是False，意思就是用户没有足够的权限来选择接受这个请求的结果。
method： 是一个字符串，指示请求使用的方法。GET、POST、PUT

url = 'https://www.httpbin.org/post'
header = {
	#伪装成火狐浏览器
    #'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
    'Host' : 'httpbin.org'
}
dict = {
    'name1' : 'Germey',
    'name2' : 'China'
}
data = bytes(parse.urlencode(dict),encoding='utf8')
req = request.Request(url=url,data=data,headers=header,method='POST')
req.add_header('User-Agent','Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)')
print(request.urlopen(req).read().decode('utf-8'))

结果如下：

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "name1": "Germey", 
    "name2": "China"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "24", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)", 
    "X-Amzn-Trace-Id": "Root=1-617bdb3b-7a590c707e494c5552c2d142"
  }, 
  "json": null, 
  "origin": "113.54.234.17", 
  "url": "https://httpbin.org/post"
}

3. 高级用法

Handler是个更高级的工具，可以把它理解成各种处理器，有专门处理登录验证的，有处理Cookies的，有处理代理设置的，利用好这些工具，几乎可以做到HTTP请求中的所有事情。

Handler介绍：

urllib.request 模块里的BaseHandler类，这是其他所有Handler的父类，它提供了最基本的方法：defalut_open() , protocol_request()等。

以下为Handler子类继承BaseHandler类：
HTTPDefaultErrorHandler: 用于处理HTTP响应错误，错误都会抛出HTTPError类型异常
HTTPRedireHandler: 用于处理重定向
HTTPCookieProcessor: 用于处理Cookies
ProxyHandler: 用于设置代理，默认代理为空
HTTPPasswordMgr: 用于管理密码，它维护了用户名和密码的表
HTTPBasicAuthHandler: 用于管理认证，如果一个链接打开时需要认证，那么可以用它来解决认证问题。

为了实现高级功能，还要介绍一个重要的类：OpenerDirector, 可以称为Opener。之前用过的urlopen()就是urllib提供的一个Opener。
Opener可以使用open()方法，返回的类型与urlopen()如出一辙（HTTPResponse）。

Handler和Opener的关系？
就是利用Handler来构造Opener。

实例分析：
（1）如果打开页面之前需要输入用户名和密码才能继续查看页面，那要怎么办？——使用 HTTPBasicAuthHandler

from urllib.request import HTTPBasicAuthHandler,HTTPPasswordMgrWithDefaultRealm, build_opener
from urllib.error import URLError

username = 'username'
password = 'password'
url = 'http://localhost:5000/'

p = HTTPPasswordMgrWithDefaultRealm()
p.add_password(None,url,username,password)
auth_handler = HTTPBasicAuthHandler(p)
opener = build_opener(auth_handler)

try:
    result = opener.open(url)
    html = result.read().decode('utf-8')
    print(html)
except URLError as e:
    print(e.reason)

注意：这里的HTTPPasswordMgrWithDefaultRealm()类将创建一个代码管理对象，用来保存HTTP请求相关的用户名和密码，主要应用于两个场景：
验证代理授权的用户名和密码 (ProxyBasicAuthHandler())
验证Web客户端的的用户名和密码 (HTTPBasicAuthHandler())
并且，通过HTTPPasswordBasicAuthHandler类中的add_password()方法添加进入用户名和密码。
之后再通过这个Handler并且利用build_opener()来创建一个Opener，让这个Opener在发送请求时就相当于验证成功了。之后再调用open()打开链接，就可以完成验证获取源代码内容。

（2）爬虫总是免不了代理，如果想要添加代理，可以如下所示：

# 代理：
from urllib.request import ProxyHandler,build_opener
from urllib.error import URLError

proxy_handler = ProxyHandler({
    'http' : 'http://127.0.0.1:9743',
    'https' : 'https://127.0.0.1:9743'
})
opener = build_opener(proxy_handler)
try:
    response = opener.open('https://www.baidu.com')
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)

解释：这里我们本地搭建了一个代理，运行在9743端口
这里使用了ProxyHandler()类，其参数是一个字典，键名是协议类型（HTTP或者HTTPS）键值是代理链接，可以添加多个代理

（3）Cookies：如何将网站的Cookies获取下来？

#Cookies
import http.cookiejar, urllib.request
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('https://www.baidu.com')
for item in cookie:
    print(item.name + '=' + item.value)

首先声明一个CookieJar对象，再将这个对象传入HTTPCookiesProcessor方法中获取handler，这个handler传入build_opener方法获得opener，最后执行opener函数就可以了

将Cookie输出成文件格式：

代码如下：

#将Cookie输出成文件格式
import http.cookiejar, urllib.request
filename = 'cookies.tet'
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('https://www.baidu.com')
cookie.save(ignore_discard=True,ignore_expires=True)

运行结果:

# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This is a generated file!  Do not edit.

.baidu.com	TRUE	/	FALSE	1667047477	BAIDUID	DFC46C706123184F12D91B470F042A74:FG=1
.baidu.com	TRUE	/	FALSE	3782995124	BIDUPSID	DFC46C706123184FC7F75862490CA964
.baidu.com	TRUE	/	FALSE	3782995124	PSTM	1635511477
www.baidu.com	FALSE	/	FALSE	1635511777	BD_NOT_HTTPS	1

这里将CookieJar替换成了MozillaCookieJar，后者是前者的子类，在生成文件会使用到，可以用来处理Cookies和文件相关的事件，比如读取和保存Cookies，可以讲Cookies保存成Mozilla型浏览器的Cookies格式
补充：LWPCookieJar同样也可以读取和保存Cookies，但是保存格式和MozillaCookieJar不一样，会保存成LWP格式的文件

__扩展：__对上面的程序进行修改，使得cookies.txt输出LWP格式的内容，然后从文件读取并且利用生成百度的网页的源代码。

#将Cookie输出成文件格式
import http.cookiejar, urllib.request
filename = 'cookies.txt'
cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('https://www.baidu.com')
cookie.save(ignore_discard=True,ignore_expires=True)


# 生成LWP文件格式的Cookies文件
#import http.cookiejar,urllib.request
cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookies.txt',ignore_discard=True,ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('https://www.baidu.com')
print(response.read().decode('utf-8'))

输出结果如下：
cookies.txt文本文件：

#LWP-Cookies-2.0
Set-Cookie3: BAIDUID="84435D1D81BF61B164B5772BA3946998:FG=1"; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2022-10-29 13:00:06Z"; comment=bd; version=0
Set-Cookie3: BIDUPSID=84435D1D81BF61B1F6A8014EFE69789C; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2089-11-16 16:14:13Z"; version=0
Set-Cookie3: PSTM=1635512406; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2089-11-16 16:14:13Z"; version=0
Set-Cookie3: BD_NOT_HTTPS=1; path="/"; domain="www.baidu.com"; path_spec; expires="2021-10-29 13:05:06Z"; version=0

输出：

<html>
<head>
	<script>
		location.replace(location.href.replace("https://","http://"));
	</script>
</head>
<body>
	<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>
</body>
</html>