python爬虫必备-urllib库详解

最新推荐文章于 2020-11-18 14:01:52 发布

卷儿哥

最新推荐文章于 2020-11-18 14:01:52 发布

阅读量610

点赞数

分类专栏： Python 文章标签： python http cookie 爬虫

本文链接：https://blog.csdn.net/DahlinSky/article/details/104454971

版权

Python 专栏收录该内容

16 篇文章 1 订阅

订阅专栏

urllib库详解

python比较基础的应用之一就是写爬虫了，写爬虫抓取数据无外乎就几个步骤，先把html等数据下载下来，再从下载得到的数据之中的利用各种字符串解析的方法提取解析我们所需要的的数据，当然也包括数据清洗，最后就是把我们辛苦提取出来的数据保存下来。然后再用这些数据进行分析预测什么的，总之爬虫是基础，没有数据，巧妇难为无米之炊。
接下来我们首要了解的就是python系统标准库中自带的urllib库了，他是从网络抓取html的必备库之一，当然大部分人用第三方库，因为官方的库实在太难使了，简直是反人类设计，关于大名鼎鼎的第三方库requests，我在下一章节再总结。

1. 导入类库

第一步当然是先导入所需的类库了，基本上从名称上看也知道是干什么的。

import socket
from http import cookiejar
from http.cookiejar import CookieJar
from urllib import request, parse, error
from urllib.parse import urlparse, urlunparse, urlsplit, urlunsplit, urljoin, urlencode, parse_qs, parse_qsl, quote, \
unquote
from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener, ProxyHandler, urlopen
from urllib.robotparser import RobotFileParser

2. 基本应用发送请求

发送请求我们一律用 http://httpbin.org 做测试，当然我们也可以自己在本地机器上搭建一个，搭建教程有很多，比如可以看我写的 Docker搭建httpbin服务

# 发送请求
request_url = "http://httpbin.org"
response = request.urlopen(request_url)
# 获取返回的字符串
html = response.read().decode('utf-8')
# 获取响应返回状态
response_status = response.status
# 获取响应头信息
response_headers = response.getheaders()
# 获取响应头中server的信息
response_status_Server = response.getheader('Server')
print(response_status_Server)

3. 传递参数

# data参数
request_url = "http://httpbin.org/post"
# 编码
data = bytes(parse.urlencode({'word': 'hello'}), encoding='utf8')
response = request.urlopen(request_url, data=data, timeout=2)
html = response.read()

4. 完整发送请求

标准的请求脚本，即有http请求头，又有参数和类型。

# 完整的请求
request_url = "https://python.org"
# 加入http协议头
headers = {
	'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36',
}
# 传入参数
param = {
	'name': 'Dahlin'
}
data = bytes(parse.urlencode(param), encoding='utf8')
# post请求
request_R = request.Request(url=request_url, data=data, headers=headers, method='POST')
request_R.add_header('Accept-Language', 'zh-CN,zh;q=0.9')
response = request.urlopen(request_R)
html = response.read().decode('utf-8')
print(html)

5. 请求验证

有时候有的url请求会要求你输入用户信息验证。

# 验证
username = 'username'
password = 'password'
url = 'http://localhost:500/'

p = HTTPPasswordMgrWithDefaultRealm()
p.add_password(None, url, username, password)
auth_handler = HTTPBasicAuthHandler(p)
opener = build_opener(auth_handler)

result = opener.open(url)
html = result.read().decode('utf-8')

6. 用代理请求

有时候爬取的速度过快，会被服务器封锁ip，类似于加入黑名单，所以一般专业爬虫脚本会用ip代理池中的ip来请求服务，咱们个人爬数据学习用，爬慢点儿就没事儿了。

# 普通代理
proxy_handler = ProxyHandler({
	'http': 'http://127.0.0.1:9743',
	'http': 'https://127.0.0.1:9743'
})

# 私密代理
# authproxy_handler=ProxyHandler({"http" :"username:password@61.135.217.7:80"})
opener = build_opener(proxy_handler)
request = Request(url,headers=header)
response = opener.open(request)
html = response.read().decode('utf-8')

7. Cookies使用


# 遍历Cookies中的值
cookies = CookieJar()
handler = request.HTTPCookieProcessor(cookies)
opener = request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookies:
	print(item.name+"="+item.value)
	
# 保存cookie到文件
cookies_file = 'cookies.txt'
# cookie = cookiejar.MozillaCookieJar(cookies_file)
cookie = cookiejar.LWPCookieJar(cookies_file)
handler = request.HTTPCookieProcessor(cookie)
opener = request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

# 从文件中加载cookies使用
cookie = cookiejar.LWPCookieJar()
cookie.load('cookies.txt', ignore_discard=True, ignore_expires=True)
handler = request.HTTPCookieProcessor(cookie)
opener = request.build_opener(handler)
response = opener.open('http://www.baidu.com')
html = response.read().decode('utf8')
print(html)

8. 解析与拼接url

# 解析链接
url = 'http://www.baidu.com/index.html;user?id=5#comment'
result = urlparse(url)
print(type(result), result)
print(result.scheme, result.netloc, result[0], result[1], sep='\n')

# 拼接url
data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']
print(urlunparse(data))
# http://www.baidu.com/index.html;user?a=6#comment

# 分割url
print(urlsplit(url))
# SplitResult(scheme='http', netloc='www.baidu.com', path='/index.html;user', query='id=5', fragment='comment')

# 拼接url
data = ['http', 'www.baidu.com', 'index.html', 'a=6', 'comment']
print(urlunsplit(data))
# http://www.baidu.com/index.html?a=6#comment

# 粘结url
print(urljoin('http://www.baidu.com', '?category=2'))
# http://www.baidu.com?category=2

# 编码url
params = {
	'name': 'dahlin',
	'age': 22
}
base_url = 'http://www.baidu.com?'
url = base_url + urlencode(params)
print(url)
# http://www.baidu.com?name=dahlin&age=22

query = 'name=dahlin&age=22'
print(parse_qs(query))
# {'name': ['dahlin'], 'age': ['22']}
print(parse_qsl(query))
# [('name', 'dahlin'), ('age', '22')]

# url编码
keyword = '壁纸'
url = 'https://www.baidu.com/s?wd='+quote(keyword)
print(url)
# https://www.baidu.com/s?wd=%E5%A3%81%E7%BA%B8
print(unquote(url))
# https://www.baidu.com/s?wd=壁纸

9. robots 协议

robots协议也叫robots.txt，是一种存放于网站根目录下的ASCII编码的文本文件，它通常告诉你哪些url可以抓取，哪些url不可以抓取。它不是一个规范，而只是约定俗成的，类似于一种道德的约束。

"""
robots.txt
User-agent:*
Disallow:/
Allow: /public/
"""

# 根据robots协议判断是否可以抓取
rp = RobotFileParser()
rp.set_url('http://www.jianshu.com/robots.txt')
rp.read()
print(rp.can_fetch('*', 'http://www.jianshu.com/p/b67554025d7d'))
# False
print(rp.can_fetch('*', 'http://www.jianshu.com/search?q=python&type=collections'))
# False

# 解析robots协议
rp = RobotFileParser()
rp.parse(urlopen('http://www.jianshu.com/robots.txt').read().decode('utf-8').split('\n'))

 /```
Forbidden
403
Server: Tengine
Date: Sun, 23 Feb 2020 03:22:51 GMT
Content-Type: text/html
Content-Length: 593
Connection: close
Vary: Accept-Encoding
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
 /```
# 根据robots协议判断是否可以抓取
print(rp.can_fetch('*', 'http://www.jianshu.com/p/b67554025d7d'))
print(rp.can_fetch('*', 'http://www.jianshu.com/search?q=python&type=collections'))

except error.HTTPError as e:
print(e.reason, e.code, e.headers, sep='\n')
except error.URLError as e:
if isinstance(e.reason, socket.timeout):
	print(e.reason)
print(e.reason)
except Exception as e:
print(e)
else:
print('Request Successfully')

卷儿哥

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python爬虫必备-urllib库详解

python比较基础的应用之一就是写爬虫了，写爬虫抓取数据无外乎就几个步骤，先把html等数据下载下来，再从下载得到的数据之中的利用各种字符串解析的方法提取解析我们所需要的的数据，当然也包括数据清洗，最后就是把我们辛苦提取出来的数据保存下来。然后再用这些数据进行分析预测什么的，总之爬虫是基础，没有数据，巧妇难为无米之炊。
复制链接

扫一扫

专栏目录