网络爬虫之基础库使用(一)
urllib 库
urllib 包含4个模块,分别是request,error,parse,robotparser
- request:最基本的HTTP请求模块,模拟发送请求
- error:异常处理模块,捕获异常,后进行重试或其他操作
- parse:工具模块,提供许多URL处理方法
- robotparser:用于识别robots.txt文件,判断网站是否可以爬
request 模块
urllib.request 模块,利用它可以模拟浏览器的请求发起过程,还可以处理授权验证、重定向、浏览器cookies及其他内容
urlopen()
用于完成最简单的网页的GET请求抓取
# 简单运用
import urllib.request
response = urllib.request.urlopen('https://www.csdn.net/')
print(response.status) # status返回结果状态码
print(response.getheaders()) # getheaders() 输出响应头信息
print(response.getheader('Server')) # 用 getheader() 获取响应头的Server值
# 运行结果
200
[('Server', 'openresty'), ('Date', 'Fri, 27 Mar 2020 16:37:45 GMT'), ('Content-Type', 'text/html; charset=UTF-8'), ('Transfer-Encoding', 'chunked'), ('Connection', 'close'), ('Vary', 'Accept-Encoding'), ('Set-Cookie', 'uuid_tt_dd=10_30735233500-1585327064557-153579; Expires=Thu, 01 Jan 2025 00:00:00 GMT; Path=/; Domain=.csdn.net;'), ('Set-Cookie', 'dc_session_id=10_1585327064557.148005; Expires=Thu, 01 Jan 2025 00:00:00 GMT; Path=/; Domain=.csdn.net;'), ('Vary', 'Accept-Encoding'), ('Strict-Transport-Security', 'max-age=31536000')]
openresty
# urlopen() 的 API:
urllib.request.urlopen(url, data=None, [timeout]*, cafile=None, capath=None, cadefault=False, context=None)
# data参数
# 添加data参数,并且如果它为字节流编码格式的内容,即bytes类型,则需通过bytes()方法进行转换
# 传递data参数,请求方法为POST
# 实例
import urllib.request
import urllib.parse
# urllib.parse.urlencode()将参数字典转化为字符串,第二个参数是指定编码格式
data = bytes(urllib.parse.urlencode({'world':'hello'}),encoding='utf8')
response = urllib.request.urlopen('https://httpbin.org/post',data=data)
print(response.read())
# 运行结果
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "world": "hello"\n }, \n "headers": {\n "Accept-Encoding": "identity", \n "Content-Length": "11", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "httpbin.org", \n "User-Agent": "Python-urllib/3.7", \n "X-Amzn-Trace-Id": "Root=1-5e7e2f16-649bbc200c1d94606b3ffa60"\n }, \n "json": null, \n "origin": "183.50.62.150", \n "url": "https://httpbin.org/post"\n}\n'
# timeout 参数
# 用于设置超时时间,超出设置时间没有得到响应会抛出异常
# 实例
import urllib.request
response = urllib.request.urlopen('https://httpbin.org/post',timeout=1)
print(response.read())
# 运行结果
# 这里只展示报错内容
urllib.error.URLError: <urlopen error _ssl.c:1029: The handshake operation timed out>
# 可以利用try-except语句来实现当一个网页长时间未响应时,跳过它的抓取
# 实例
import urllib.request
import urllib.error
import socket
try:
response = urllib.request.urlopen('https://httpbin.org/post',timeout=0.1)
except urllib.error.URLError as e:
# isinstance() 函数来判断一个对象是否是一个已知的类型
# 这里判断e.reason是否为socket.timeout的类型
if isinstance(e.reason,socket.timeout):
print("TIME OUT!")
# 运行结果
TIME OUT!
# context 参数用于指定SSL设置,它必须是ssl.SSLContext类型
# cafile和capath分别指定CA证书和它的路径
Request( )
用于在请求中加入Headers等信息
# 简单运用
import urllib.request
request = urllib.request.Request("https://www.csdn.net/")
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))
# 运行结果
得到Headers等信息
# Request的API
class urllib.request.Request(url,data=None,headers={},origin_req_host=None,unverifiable=False,method=None)
# url,data 参数同urlopen
# headers是一个字典,它是请求头,构造请求时可以通过headers参数进行构造,也可以通过add_header()方法进行添加
# origin_req_host指请求方的host名称或IP地址
# unverifiable表示这请求是否无法验证,默认False
# method原来只指示请求使用的方法
# 实例
from urllib import request, parse
url='http://httpbin.org/post'
headers = {'User-Agent':'Mozilla/4.0(compatible;MISE 5.5;Windws NT)','Host':'httpbin.org'} # 指定User-Agent和Host
dict = {'name':'Germey'}
data = bytes(parse.urlencode(dict),encoding='utf-8')
req = request.Request(url=url,data=data,headers=headers,method='POST') # headers可以用add_header()添加
response = request.urlopen(req)
print(response.read().decode('utf-8'))
# 运行结果
{
"args": {},
"data": "",
"files": {},
"form": {
"name": "Germey"
},
"headers": {
"Accept-Encoding": "identity",
"Content-Length": "11",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "httpbin.org",
"User-Agent": "Mozilla/4.0(compatible;MISE 5.5;Windws NT)",
"X-Amzn-Trace-Id": "Root=1-5e7ebda0-b070af1f8f50d9efbc3b7965"
},
"json": null,
"origin": "183.50.61.7",
"url": "http://httpbin.org/post"
}
Handler
urllib.request模块中的BaseHandler类是所有其他Handler的父类
Handler可以理解为各种处理器,可以用于处理登录验证,处理cookies,处理代理设置等
- Opener
Opener可以使用open()方法,返回类型与urlopen()相似,它是用Handler来构建的
- 实例:
- 验证
# 实例
# 请求一些需要输入用户名和密码的页面
# 此时需要用HTTPBasicAuthHandler来完成
from urllib.request import HTTPPasswordMgrWithDefaultRealm,HTTPBasicAuthHandler,build_opener
from urllib.error import URLError
usename = 'usename'
password = 'password'
url = 'https://www.zhihu.com/login/phone_num'
p = HTTPPasswordMgrWithDefaultRealm() # 实例化HTTPPasswordMgrWithDefaultRealm对象
p.add_password(None,url,usename,password) # 利用add_password()添加用户名和密码
auth_handler = HTTPBasicAuthHandler(p) # 建立一个处理验证的Handler
opener = build_opener(auth_handler) # 构建一个Opener
try:
result = opener.open(url)
html = result.read().decode('utf-8')
print(html)
except URLError as e:
print(e.reason)
# 运行结果
<!doctype html>
<html lang="zh" data-hairline="true" data-theme="light"><head><meta charSet="utf-8"/><title data-react-helmet="true">知乎 - 有问题,上知乎</title><meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1"/><meta name="renderer" content="webkit"/><meta name="force-rendering" content="webkit"/><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"/><meta name="google-site-verification" content="FTeR0c8arOPKh8c5DYh_9uu98_zJbaWw53J-Sch9MTg"/><meta name="description" property="og:description" content="有问题,上知乎。知乎,可信赖的问答社区,以让每个人高效获得可信赖的解答为使命。知乎凭借认真、专业和友善的社区氛围,结构化、易获得的优质内容,基于问答的内容生产方式和独特的社区机制,吸引、聚集了各行各业中大量的亲历者、内行人、领域专家、领域爱好者,将高质量的内容透过人的节点来成规模地生产和分享。用户通过问答等交流方式建立信任和连接,打造和提升个人影响力,并发现、获得新机会。"/><link data-react-helmet="true" rel="apple-touch-icon" href="https://static.zhihu.com/heifetz/assets/apple-touch-icon-152.67c7b278.png"/><link data-react-helmet="true" rel="apple-touch-icon" href="https://static.zhihu.com/heifetz/assets/apple-touch-icon-152.67c7b278.png" sizes="152x152"/><link data-react-helmet="true" rel="apple-touch-icon" href="https://static.zhihu.com/heifetz/assets/apple-touch-icon-120.b3e6278d.png" sizes="120x120"/><link data-react-helmet="true" rel="apple-touch-icon" href="https://static.zhihu.com/heifetz/assets/apple-touch-icon-76.7a750095.png" sizes="76x76"/><link data-react-helmet="true" rel="apple-touch-icon" href="https://static.zhihu.com/heifetz/assets/apple-touch-icon-60.a4a761d4.png" sizes="60x60"/><link rel="shortcut icon" type="image/x-icon" href="https://static.zhihu.com/static/favicon.ico"/><link rel="search" type="application/opensearchdescription+xml" href="https://static.zhihu.com/static/search.xml" title="知乎"/><link rel="dns-prefetch" href="//static.zhimg.com"/><link rel="dns-prefetch" href="//pic1.zhimg.com"/><link rel="dns-prefetch" href="//pic2.zhimg.com"/><link rel="dns-prefetch" href="//pic3.zhimg.com"/><link rel="dns-prefetch" href="//pic4.zhimg.com"/><style>
.u-safeAreaInset-top {
height: constant(safe-area-inset-top) !important;
height: env(safe-area-inset-top) !important;
}
.u-safeAreaInset-bottom {
height: constant(safe-area-inset-bottom) !important;
height: env(safe-area-inset-bottom) !important;
}
</style><link href="https://static.zhihu.com/heifetz/main.app.86dc12ecb6d4cae00fdc.css" rel="stylesheet"/><link href="https://static.zhihu.com/heifetz/main.sign-page.069b34d1856ce2eeb081.css" rel="sty
- 代理
# 添加代理
from urllib.error import URLError
from urllib.request import ProxyHandler,build_opener
# 搭建本地代理,运行在9743的端口上
# ProxyHandler的参数是字典,键名是协议类型,键值是代理链接,可以添加多个代理
proxy_handler = ProxyHandler({
'http':'http://127.0.0.1:9743',
'http':'https://127.0l.0.1:9743'
})
# 构造Opener
opener = build_opener(proxy_handler)
try:
response = opener.open('https://www.baidu.com')
print(response.read().decode('utf-8'))
except URLError as e:
print(e.reason)
# 运行结果
<html>
<head>
<script>
location.replace(location.href.replace("https://","http://"));
</script>
</head>
<body>
<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>
</body>
</html>
- Cookies
# Cookies处理需要Handler
import http.cookiejar,urllib.request
# 声明一个Cookiejar对象
cookie = http.cookiejar.CookieJar()
# 使用HTTPCookieProcessor构造Handler
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
print(item.name+'='+item.value)
# 运行结果
# 这里输出的是每条cookies的名称和值
BAIDUID=9CE7D1C738C792DCB4A682912DD2FD56:FG=1
BIDUPSID=9CE7D1C738C792DCE682EBB4D11A996D
H_PS_PSSID=30971_1446_31123_21120_30823
PSTM=1585367396
BDSVRTM=0
BD_HOME=1
# 我们也可以输出文件格式
import http.cookiejar,urllib.request
filename = 'cookie.txt'
# MozillaCookieJar生成文件时用到,用于处理Cookies和文件的相关事件,比如读取和保存Cookies
# MozillaCookieJar可以将Cookies保存成Mozilla型浏览器的Cookies格式
# 这里是生成一个cookies.txt文件
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True,ignore_expires=True)
cookie_file = open("E:\\python_work\\cookie.txt")
for lines in cookie_file.readlines():
print(lines)
cookie_file.close()
# 运行结果
# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This is a generated file! Do not edit.
.baidu.com TRUE / FALSE 1616904468 BAIDUID 1E763C2A7DEA6598F459E3CC2780EECE:FG=1
.baidu.com TRUE / FALSE 3732852115 BIDUPSID 1E763C2A7DEA659809BBBC0BF9CDC747
.baidu.com TRUE / FALSE H_PS_PSSID 30973_1458_31045_21113_31051_30824_26350_22160
.baidu.com TRUE / FALSE 3732852115 PSTM 1585368468
www.baidu.com FALSE / FALSE BDSVRTM 0
www.baidu.com FALSE / FALSE BD_HOME 1
# LWPCookieJar同样可以读取和保存Cookies,它保存成libwww-per(LWP)格式的Cookies文件
import http.cookiejar,urllib.request
filename = 'cookie1.txt'
cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True,ignore_expires=True)
cookie_file = open("E:\\python_work\\cookie1.txt")
for lines in cookie_file.readlines():
print(lines)
cookie_file.close()
# 运行结果
#LWP-Cookies-2.0
Set-Cookie3: BAIDUID="335C106DE8D30F26AE2B446B5EA0C5EB:FG=1"; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2021-03-28 04:12:25Z"; comment=bd; version=0
Set-Cookie3: BIDUPSID=335C106DE8D30F26276CDC5076B8F1AB; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2088-04-15 07:26:32Z"; version=0
Set-Cookie3: H_PS_PSSID=30962_1457_31122_21126_30824; path="/"; domain=".baidu.com"; path_spec; domain_dot; discard; version=0
Set-Cookie3: PSTM=1585368744; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2088-04-15 07:26:32Z"; version=0
Set-Cookie3: BDSVRTM=0; path="/"; domain="www.baidu.com"; path_spec; discard; version=0
Set-Cookie3: BD_HOME=1; path="/"; domain="www.baidu.com"; path_spec; discard; version=0
# 对Cookies文件的利用
import http.cookiejar,urllib.request
filename = 'cookie1.txt'
cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True,ignore_expires=True)
# 用load()方法读取本地文件,获取Cookies的内容
cookie.load('cookie1.txt',ignore_expires=True,ignore_discard=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('https://www.baidu.com')
# 这里输出百度网站的源代码
print(response.read().decode('utf-8'))
# 通过这种方法可以实现绝大多数请求的设置
# 运行结果
<html>
<head>
<script>
location.replace(location.href.replace("https://","http://"));
</script>
</head>
<body>
<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>
</body>
</html>
Error 模块
URLError
URLError 来自urllib中的error模块,继承了OSError类,是error异常模块的基类
可以用来处理request生成的异常
# 实例
from urllib import request,error
try:
response = request.urlopen('https://joker.com/index.html')
except error.URLError as e:
print(e.reason)
# 运行结果
Not Found
HTTPError
是URLError的子类,用于处理HTTP请求错误
它包含三个属性:code(返回HTTP状态码),reason(返回错误原因),headers(返回请求头)
# 实例
from urllib import request,error
try:
response = request.urlopen('https://joker.com/index.html')
except error.HTTPError as e:
print(e.reason,e.code,e.headers,sep='\n')
# 运行结果
Not Found
404
Server: nginx
Date: Sat, 28 Mar 2020 07:37:55 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: close
Set-Cookie: Joker_Session=sjpf9uffn808s4lqqg1n5i44s8; path=/; secure; HttpOnly
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate
Pragma: no-cache
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Set-Cookie: ZLBSYS=WEB3; expires=Sat, 28-Mar-2020 07:47:55 GMT; path=/; secure; HttpOnly
# 因为URLError是HTTPError的父类,所以可以先捕获子类的错误,获取他的状态码,原因,headers等,
# 如果不是子类的错误,则捕获父类错误,输出错误原因,最后用else来处理正常情况
Parse 模块
urlparse
实现URL的识别和分段
# 实例
from urllib.parse import urlparse
# urlparse将URL拆分为6个部分
# ://前为scheme(协议),第一个/前为netloc(域名),域名后面是path(访问路径)
# ';'后面是params(参数),'?'后面是query(查询条件,一般用作GET类型的URL),'#'后面是fragment(锚点,用于直接定位页面内部的下拉设置)
result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result),result)
# 运行结果
<class 'urllib.parse.ParseResult'> ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')
# urllib.parse.urlparse()的API
urllib.parse.urlparse(urlstring,scheme='',allow_fragements=True)
# urlstring 待解析的URL
# scheme 默认的协议(如http或https)
# 加入一个链接未带协议信息,会将scheme作为默认协议
#实例
from urllib.parse import urlparse
result = urlparse('www.baidu.com/index.html;user?id=5#comment',scheme='https')
print(type(result),result)
# 运行结果
<class 'urllib.parse.ParseResult'> ParseResult(scheme='https', netloc='', path='www.baidu.com/index.html', params='user', query='id=5', fragment='comment')
# 若URL本身带有协议,则会返回本身的scheme
# allow_fragment 即是否忽略fragment,如果它为false,则fragment部分会被忽略,它会被解析为path,parameters或者query的一部分
# 实例
from urllib.parse import urlparse
result = urlparse('https://www.baidu.com/index.html#comment',allow_fragments=False)
print(result)
# 运行结果
# 当URL不包含params和query时,fragment就会被解析为path的一部分
ParseResult(scheme='https', netloc='www.baidu.com', path='/index.html#comment', params='', query='', fragment='')
# ParseResult是一个元组,我们可以利用索引,属性名来获取里面的内容
urlunparse()
urlunparse() 是urlparse()的对立面,它接受的参数是可迭代对象,它的长度必须是6,不足或过多都会异常
# 实例
from urllib.parse import urlunparse
data = ['http','www.baidu.com','index.html','user','a=6','comment']
print(urlunparse(data))
# 运行结果
http://www.baidu.com/index.html;user?a=6#comment
urlsplit()
与urlparse()相似,但不再单独解析params这一部分,只返回5个结果
# 实例
from urllib.parse import urlsplit
result = urlsplit('https://www.baidu.com/index.html;user?id=5#comment')
print(result)
# 运行结果
SplitResult(scheme='https', netloc='www.baidu.com', path='/index.html;user', query='id=5', fragment='comment')
# SplitResult也是一个元组,我们也可以利用索引,属性名来获取里面的内容
urlunsplit()
与urlunsplit相似,区别是传入参数长度为5,不用传入params部分
urljoin()
利用urljoin(),我们可以提供一个base_url(基础链接)作为第一个参数,将新链接作为第二个参数,
该方法会分析base_url中scheme,netloc,path,并对新链接缺失项进行补充
# 实例
from urllib.parse import urljoin
print(urljoin('http:I lwww. baidu. com', 'FAQ. html'))
print(urljoin('http://www.baidu.com', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html','https:I /cuiqingcai. com/FAQ. html'))
print(urljoin('http://www.baidu.com/about.html','https://cuiqingcai.com/FAQ.html?question=2'))
print(urljoin('http://www.baidu.com?wd=abc', 'https://cuiqingcai.com/index.php'))
print(urljoin('http://www.baidu.com','?category=2#comment'))
print(urljoin('www.baidu.com ','?category=2#comment'))
print(urljoin('www.baidu.com#comment','?category=2'))
# 运行结果
http:///FAQ. html
https://cuiqingcai.com/FAQ.html
https:I /cuiqingcai. com/FAQ. html
https://cuiqingcai.com/FAQ.html?question=2
https://cuiqingcai.com/index.php
http://www.baidu.com?category=2#comment
www.baidu.com ?category=2#comment
www.baidu.com?category=2
urlencode()
构造GET请求参数
# 实例
from urllib.parse import urlencode
params = {
'name': 'joker',
'age': '30'
}
base_url = 'http://www.baidu.com?'
# urlencode()将params序列化为GET请求的参数
url = base_url + urlencode(params)
print(url)
# 运行结果
http://www.baidu.com?name=joker&age=30
parse_qs()
与urlencode()相反,用于反序列化,将GET请求参数转回字典
# 实例
from urllib.parse import parse_qs
query = 'name=joker&age=18'
# parse_qs()将query反序列化为字典
print(parse_qs(query))
# 运行结果
{'name': ['joker'], 'age': ['18']}
parse_qsl()
与parse_qs相似,但它将参数转化为元组
#实例
from urllib.parse import parse_qsl
query = 'name=joker&age=18'
# parse_qsl()将query反序列化为字典
print(parse_qsl(query))
# 运行结果
[('name', 'joker'), ('age', '18')]
quote()
可以将内容转化为URL编码格式
URL中带有中文参数时,有时会乱码,这个方法可以将中文字符转化为URL编码
# 实例
from urllib.parse import quote
keyword = '博客'
url = 'https://www.baid.com/s?wd='+quote(keyword)
print(url)
# 运行结果
https://www.baid.com/s?wd=%E5%8D%9A%E5%AE%A2
unquote()
可以用于对URL进行解码
# 实例
from urllib.parse import unquote
url = 'https://www.baid.com/s?wd=%E5%8D%9A%E5%AE%A2'
print(unquote(url))
# 运行结果
https://www.baid.com/s?wd=博客