爬虫学习：基本库的使用

最新推荐文章于 2022-06-16 10:00:39 发布

VIP文章 Raymone_

最新推荐文章于 2022-06-16 10:00:39 发布

阅读量1.5k

点赞数

分类专栏：爬虫学习文章标签：爬虫 urllib requests 正则表达式

本文链接：https://blog.csdn.net/u012470887/article/details/98039643

版权

爬虫学习：基本库的使用

1. 使用 urllib
2. 使用 Requests
3. 正则表达式
4. 抓取猫眼电影排行

1. 使用 urllib

1.1 发送请求

1.1.1 urlopen()

urllib.request 模块提供了最基本的构造 HTTP 请求的方法，它可以模拟浏览器的一个请求发起过程，同时还带有处理授权验证，重定向，浏览器 Cookies 以及其他内容

# 抓取 Python 官网
import urllib.request

response = urllib.request.urlopen('https://www.python.org')
print(response.read().decode('utf-8'))

<!doctype html>
<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->
<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->
<!--[if IE 8]>      <html class="no-js ie8 lt-ie9">                 <![endif]-->
<!--[if gt IE 8]><!--><html class="no-js" lang="en" dir="ltr">  <!--<![endif]-->

<head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    ...（文档太长此处省略）

利用 type 方法输出相应的类型

print(type(response))

<class 'http.client.HTTPResponse'>

响应为 HTTPResponse 对象，主要包含 read(), readinto(), getheader(name), getheaders(), fileno() 等方法，以及 msg, version, status, reason, debuglevel, closed 等属性

response.status                   # 结果的状态码

response.getheaders()            # 响应头

[('Server', 'nginx'),
 ('Content-Type', 'text/html; charset=utf-8'),
 ('X-Frame-Options', 'DENY'),
 ('Via', '1.1 vegur'),
 ('Via', '1.1 varnish'),
 ('Content-Length', '48402'),
 ('Accept-Ranges', 'bytes'),
 ('Date', 'Fri, 26 Jul 2019 02:17:41 GMT'),
 ('Via', '1.1 varnish'),
 ('Age', '1290'),
 ('Connection', 'close'),
 ('X-Served-By', 'cache-iad2132-IAD, cache-hnd18731-HND'),
 ('X-Cache', 'MISS, HIT'),
 ('X-Cache-Hits', '0, 1457'),
 ('X-Timer', 'S1564107461.004033,VS0,VE0'),
 ('Vary', 'Cookie'),
 ('Strict-Transport-Security', 'max-age=63072000; includeSubDomains')]

response.getheader('Server')      # 响应头中 Server 的值

'nginx'

urlopen 函数的 API：

urlllib.request.urlopen(url, data=None, [timeout,]*, cafile=None, capath=None, cadefault=False, context=None)

data 参数
1. data 参数是可选的，如果要添加该参数，需要使用 byte() 方法将参数转化为字节流编码格式的内容，即 bytes 类型。
2. 如果传递了 data 参数，则它的请求方式不再是 GET，而是 POST
3. bytes()方法的第一个参数是 str 类型，需要用 urllib.parse 模块里的 urlencode()方法来将参数字典转换为字符串。第二个参数指定编码格式。
4. 结果中可以看到传递的参数出现在了form字段中，这表明是模拟了表单提交方式，以 POST 方式传输数据

import urllib.parse
import urllib.request
data = bytes(urllib.parse.urlencode({
   'word': 'hello'}), encoding='utf8')
response = urllib.request.urlopen('http://httpbin.org/post', data=data)            # data 参数
print(data)
print(response.read().decode('utf-8'))

b'word=hello'
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "word": "hello"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "10", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Python-urllib/3.7"
  }, 
  "json": null, 
  "origin": "119.123.40.149, 119.123.40.149", 
  "url": "https://httpbin.org/post"
}

timeout 参数
1. timeout 参数用于设置超时时间，单位为秒，意思就是如果请求超出了设置的时间还没有得到响应，就会抛出 timeout 异常。
2. 若不指定该参数，则使用全局默认时间。
3. 它支持 HTTP, HTTPS, FTP 请求。
4. 可以使用 try except 语句实现超时后跳过抓取：捕获 URLError 异常，判断该异常是否为 socket.timeout 类型

import socket
import urllib.error
try:
    response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout):
        print('TIME OUT')

TIME OUT

其他参数
1. context 参数：必须是 ssl.SSLContext 类型，用来指定 SSL 设置
2. cafile 和 capath 参数分别指定 CA 证书和它的路径
3. cadefault 参数已弃用，默认为 False

1.1.2 Request

Request 对象的构造方法：

class urllib.request.Request(url, data=None, header={}, origin_req_host=None, unverifiable=False, method=None)

url:用于请求的 URL，必选参数
data: 同 urlopen 里的 data
headers：请求头，是一个字典，既可以在构造请求时通过 headers 参数直接构造，也可以通过调用请求实例的 add_header() 添加
origin_req_host:请求方的 host 名称或 IP 地址
unverifiable：表示这个请求是否是无法验证的，默认为 False，意思是用户有没有足够的权限来选择接收这个请求的结果（暂未理解）
method：请求方式（GET, POST, PUT 等）

# 传入多个参数构建请求
from urllib import request, parse
url = 'http://httpbin.org/post'
headers = {
   
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
    'Host': 'httpbin.org'
}
dict = {
   'name': 'Germey'}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url, data=data, headers=headers, method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "name": "Germey"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "11", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)"
  }, 
  "json": null, 
  "origin": "119.123.40.149, 119.123.40.149", 
  "url": "https://httpbin.org/post"
}

# 使用 add_header() 方法添加 headers
req = request.Request(url=url, data=data, method='POST')
req.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)')

1.1.3 高级用法（Cookies处理、代理设置等）——Handler

什么是 Handler
- Handler 可以理解为各种处理器，专门处理登录验证，或者 Cookies 以及代理设置。
- urllib.request 模块里的 BaseHandler 类是其他所有 Handler 的父类，它提供了最基本的方法，如 default_open(), protocol_request()等
- Handler 子类的例子：HTTPDefaultErrorHandler(处理HTTP响应错误)，HTTPRedirectHandler(处理重定向)，HTTPCookieProcesssor(处理Cookies)，ProxyHandler(用于设置代理，默认代理为空)，HTTPPasswordMgr(用于管理密码)，HTTPBasicAuthHandler(用于管理认证)
- 另一个重要的类就是 OpenDirector，称为 Opener（实际 urlopen()就是一个Opener），Opener 可以使用 open()方法，和 urlopen()如出一辙，Opener的构建是利用的 Handler
验证
- 对于打开弹出提示框需要输入用户名和密码的页面，使用 HTTPBasicAuthHandler 完成：

from urllib.request import HTTPBasicAuthHandler, HTTPPasswordMgrWithDefaultRealm, build_opener
from urllib.error import URLError

username = 'username'
password = 'password'
url = 'http://localhost:5000'

p = HTTPPasswordMgrWithDefaultRealm()          # 首先实例化 HTTPPasswordMgrWithDefaultRealm 对象
p.add_password(None, url, username, password)  # 利用 add_password() 添加用户名和密码
auth_handler = HTTPBasicAuthHandler(p)         # 实例化 HTTPBasicAuthHandler 对象，其参数是 HTTPPasswordMgrWithDefaultRealm 对象
opener = build_opener(auth_handler)

try:
    result = opener.open(url)
    html = result.read().decode('utf-8')
    print(html)
except URLError as e:
    print(e.reason)

代理
- 使用 ProxyHandler 添加代理：

from urllib.error import URLError
from urllib.request import ProxyHandler, build_opener

proxy_handler = ProxyHandler({
   
    'http': 'http://127.0.0.1:9743',
    'https': 'https://127.0.0.1:9743'
})                                              # ProxyHandler 的参数是一个字典，键名是协议类型，键值是代理链接
opener = build_opener(proxy_handler)
try:
    response = opener.open('https://www.baidu.com')
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)

Cookies
- 获取网站的 Cookies
- 将 Cookies 保存为 Mozilla 型浏览器的 Cookies 格式：
- 读取 Cookies 文件并利用：

# 获取网站的 Cookies
import http.cookiejar, urllib.request

cookie = http.cookiejar.CookieJar()           # 声明 CookieJar 对象
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = build_opener(handler)
response = opener.open('https://www.baidu.com')
for item in cookie:
    print(item.name+'='+item.value)

BIDUPSID=F930DC7B2787707F675C4D0962FBE525
PSTM=1564122900
BD_NOT_HTTPS=1

# 将 Cookies 保存为 Mozilla 型浏览器的 Cookies 格式
filename = 'cookies.txt'
cookie = http.cookiejar.MozillaCookieJar(filename)         # 改为 LWPCookieJar 即可保存为 LWP 格式的 Cookies
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = build_opener(handler)
response = opener.open('https://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

# 读取 Cookies 文件并利用
cookie = http.cookiejar.MozillaCookieJar(filename)
cookie.load('cookies.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = build_opener(handler)
response = opener.open('https://www.baidu.com')
print(response.read().decode('utf-8'))

<html>
<head>
	<script>
		location.replace(location.href.replace("https://","http://"));
	</script>
</head>
<body>
	<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>
</body>
</html>

1.2 处理异常

urllib 的 error 模块定义了由 request 模块产生的异常，如果出现了问题，request 模块便会抛出 error 模块中定义的异常

1.2.1 URLError

URLError 类继承自 OSError 类，是 error 异常模块的基类，它具有一个属性 reason，代表着错误的原因：

from urllib import request, error
try:
    response = request.urlopen('https://cuiqingcai.com/index.htm')
except error.URLError as e:

最低0.47元/天解锁文章

Raymone_

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
爬虫学习：基本库的使用

爬虫学习：基本库的使用1. urllib2. requests3. 正则表达式4. 抓取猫眼电影排行
复制链接

扫一扫

专栏目录

爬虫学习：基本库的使用

爬虫学习：基本库的使用

1. 使用 urllib

1.1 发送请求

1.1.1 urlopen()

1.1.2 Request

1.1.3 高级用法（Cookies处理、代理设置等）——Handler

1.2 处理异常

1.2.1 URLError

“相关推荐”对你有帮助么？