网络爬虫之基础库使用(一)

网络爬虫之基础库使用(一)

urllib 库

urllib 包含4个模块,分别是request,error,parse,robotparser

  • request:最基本的HTTP请求模块,模拟发送请求
  • error:异常处理模块,捕获异常,后进行重试或其他操作
  • parse:工具模块,提供许多URL处理方法
  • robotparser:用于识别robots.txt文件,判断网站是否可以爬
request 模块

urllib.request 模块,利用它可以模拟浏览器的请求发起过程,还可以处理授权验证、重定向、浏览器cookies及其他内容

urlopen()

用于完成最简单的网页的GET请求抓取

# 简单运用
import urllib.request

response = urllib.request.urlopen('https://www.csdn.net/')
print(response.status)  # status返回结果状态码
print(response.getheaders())  # getheaders() 输出响应头信息
print(response.getheader('Server')) # 用 getheader() 获取响应头的Server值

# 运行结果
200
[('Server', 'openresty'), ('Date', 'Fri, 27 Mar 2020 16:37:45 GMT'), ('Content-Type', 'text/html; charset=UTF-8'), ('Transfer-Encoding', 'chunked'), ('Connection', 'close'), ('Vary', 'Accept-Encoding'), ('Set-Cookie', 'uuid_tt_dd=10_30735233500-1585327064557-153579; Expires=Thu, 01 Jan 2025 00:00:00 GMT; Path=/; Domain=.csdn.net;'), ('Set-Cookie', 'dc_session_id=10_1585327064557.148005; Expires=Thu, 01 Jan 2025 00:00:00 GMT; Path=/; Domain=.csdn.net;'), ('Vary', 'Accept-Encoding'), ('Strict-Transport-Security', 'max-age=31536000')]
openresty


# urlopen() 的 API:
urllib.request.urlopen(url, data=None, [timeout]*, cafile=None, capath=None, cadefault=False, context=None)

# data参数
# 添加data参数,并且如果它为字节流编码格式的内容,即bytes类型,则需通过bytes()方法进行转换
# 传递data参数,请求方法为POST
# 实例
import urllib.request
import urllib.parse

# urllib.parse.urlencode()将参数字典转化为字符串,第二个参数是指定编码格式
data = bytes(urllib.parse.urlencode({'world':'hello'}),encoding='utf8')  
response = urllib.request.urlopen('https://httpbin.org/post',data=data)
print(response.read())

# 运行结果
b'{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {\n    "world": "hello"\n  }, \n  "headers": {\n    "Accept-Encoding": "identity", \n    "Content-Length": "11", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "httpbin.org", \n    "User-Agent": "Python-urllib/3.7", \n    "X-Amzn-Trace-Id": "Root=1-5e7e2f16-649bbc200c1d94606b3ffa60"\n  }, \n  "json": null, \n  "origin": "183.50.62.150", \n  "url": "https://httpbin.org/post"\n}\n'

# timeout 参数
# 用于设置超时时间,超出设置时间没有得到响应会抛出异常

# 实例
import urllib.request

response = urllib.request.urlopen('https://httpbin.org/post',timeout=1)
print(response.read())
# 运行结果
# 这里只展示报错内容
urllib.error.URLError: <urlopen error _ssl.c:1029: The handshake operation timed out>

# 可以利用try-except语句来实现当一个网页长时间未响应时,跳过它的抓取
# 实例
import urllib.request
import urllib.error
import socket

try:
    response = urllib.request.urlopen('https://httpbin.org/post',timeout=0.1)
except urllib.error.URLError as e:
    # isinstance() 函数来判断一个对象是否是一个已知的类型
    # 这里判断e.reason是否为socket.timeout的类型
    if isinstance(e.reason,socket.timeout):  
        print("TIME OUT!")

# 运行结果
TIME OUT!

# context 参数用于指定SSL设置,它必须是ssl.SSLContext类型
# cafile和capath分别指定CA证书和它的路径
Request( )

用于在请求中加入Headers等信息

# 简单运用
import urllib.request

request = urllib.request.Request("https://www.csdn.net/")
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))
# 运行结果
得到Headers等信息

# Request的API
class urllib.request.Request(url,data=None,headers={},origin_req_host=None,unverifiable=False,method=None)
# url,data 参数同urlopen
# headers是一个字典,它是请求头,构造请求时可以通过headers参数进行构造,也可以通过add_header()方法进行添加
# origin_req_host指请求方的host名称或IP地址
# unverifiable表示这请求是否无法验证,默认False
# method原来只指示请求使用的方法

# 实例
from urllib import request, parse

url='http://httpbin.org/post'
headers = {'User-Agent':'Mozilla/4.0(compatible;MISE 5.5;Windws NT)','Host':'httpbin.org'}  # 指定User-Agent和Host
dict = {'name':'Germey'}
data = bytes(parse.urlencode(dict),encoding='utf-8')
req = request.Request(url=url,data=data,headers=headers,method='POST')  # headers可以用add_header()添加
response = request.urlopen(req)
print(response.read().decode('utf-8'))

# 运行结果
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "name": "Germey"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "11", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/4.0(compatible;MISE 5.5;Windws NT)", 
    "X-Amzn-Trace-Id": "Root=1-5e7ebda0-b070af1f8f50d9efbc3b7965"
  }, 
  "json": null, 
  "origin": "183.50.61.7", 
  "url": "http://httpbin.org/post"
}

Handler

urllib.request模块中的BaseHandler类是所有其他Handler的父类
Handler可以理解为各种处理器,可以用于处理登录验证,处理cookies,处理代理设置等

  • Opener

Opener可以使用open()方法,返回类型与urlopen()相似,它是用Handler来构建的

  • 实例:
  1. 验证
# 实例
# 请求一些需要输入用户名和密码的页面
# 此时需要用HTTPBasicAuthHandler来完成
from urllib.request import HTTPPasswordMgrWithDefaultRealm,HTTPBasicAuthHandler,build_opener
from urllib.error import URLError

usename = 'usename'
password = 'password'
url = 'https://www.zhihu.com/login/phone_num'

p = HTTPPasswordMgrWithDefaultRealm() # 实例化HTTPPasswordMgrWithDefaultRealm对象
p.add_password(None,url,usename,password) # 利用add_password()添加用户名和密码
auth_handler = HTTPBasicAuthHandler(p) # 建立一个处理验证的Handler
opener = build_opener(auth_handler) # 构建一个Opener

try:
    result = opener.open(url)
    html = result.read().decode('utf-8')
    print(html)
except URLError as e:
    print(e.reason)
# 运行结果
<!doctype html>
<html lang="zh" data-hairline="true" data-theme="light"><head><meta charSet="utf-8"/><title data-react-helmet="true">知乎 - 有问题,上知乎</title><meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1"/><meta name="renderer" content="webkit"/><meta name="force-rendering" content="webkit"/><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"/><meta name="google-site-verification" content="FTeR0c8arOPKh8c5DYh_9uu98_zJbaWw53J-Sch9MTg"/><meta name="description" property="og:description" content="有问题,上知乎。知乎,可信赖的问答社区,以让每个人高效获得可信赖的解答为使命。知乎凭借认真、专业和友善的社区氛围,结构化、易获得的优质内容,基于问答的内容生产方式和独特的社区机制,吸引、聚集了各行各业中大量的亲历者、内行人、领域专家、领域爱好者,将高质量的内容透过人的节点来成规模地生产和分享。用户通过问答等交流方式建立信任和连接,打造和提升个人影响力,并发现、获得新机会。"/><link data-react-helmet="true" rel="apple-touch-icon" href="https://static.zhihu.com/heifetz/assets/apple-touch-icon-152.67c7b278.png"/><link data-react-helmet="true" rel="apple-touch-icon" href="https://static.zhihu.com/heifetz/assets/apple-touch-icon-152.67c7b278.png" sizes="152x152"/><link data-react-helmet="true" rel="apple-touch-icon" href="https://static.zhihu.com/heifetz/assets/apple-touch-icon-120.b3e6278d.png" sizes="120x120"/><link data-react-helmet="true" rel="apple-touch-icon" href="https://static.zhihu.com/heifetz/assets/apple-touch-icon-76.7a750095.png" sizes="76x76"/><link data-react-helmet="true" rel="apple-touch-icon" href="https://static.zhihu.com/heifetz/assets/apple-touch-icon-60.a4a761d4.png" sizes="60x60"/><link rel="shortcut icon" type="image/x-icon" href="https://static.zhihu.com/static/favicon.ico"/><link rel="search" type="application/opensearchdescription+xml" href="https://static.zhihu.com/static/search.xml" title="知乎"/><link rel="dns-prefetch" href="//static.zhimg.com"/><link rel="dns-prefetch" href="//pic1.zhimg.com"/><link rel="dns-prefetch" href="//pic2.zhimg.com"/><link rel="dns-prefetch" href="//pic3.zhimg.com"/><link rel="dns-prefetch" href="//pic4.zhimg.com"/><style>
.u-safeAreaInset-top {
  height: constant(safe-area-inset-top) !important;
  height: env(safe-area-inset-top) !important;
  
}
.u-safeAreaInset-bottom {
  height: constant(safe-area-inset-bottom) !important;
  height: env(safe-area-inset-bottom) !important;
  
}
</style><link href="https://static.zhihu.com/heifetz/main.app.86dc12ecb6d4cae00fdc.css" rel="stylesheet"/><link href="https://static.zhihu.com/heifetz/main.sign-page.069b34d1856ce2eeb081.css" rel="sty
  1. 代理
# 添加代理
from urllib.error import URLError
from urllib.request import ProxyHandler,build_opener

# 搭建本地代理,运行在9743的端口上
# ProxyHandler的参数是字典,键名是协议类型,键值是代理链接,可以添加多个代理
proxy_handler = ProxyHandler({
    'http':'http://127.0.0.1:9743',
    'http':'https://127.0l.0.1:9743'
})									
# 构造Opener
opener = build_opener(proxy_handler)

try:
    response = opener.open('https://www.baidu.com')
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)
# 运行结果
<html>
<head>
	<script>
		location.replace(location.href.replace("https://","http://"));
	</script>
</head>
<body>
	<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>
</body>
</html>
  1. Cookies
# Cookies处理需要Handler
import http.cookiejar,urllib.request

# 声明一个Cookiejar对象
cookie = http.cookiejar.CookieJar()
# 使用HTTPCookieProcessor构造Handler
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name+'='+item.value)
    
# 运行结果
# 这里输出的是每条cookies的名称和值
BAIDUID=9CE7D1C738C792DCB4A682912DD2FD56:FG=1
BIDUPSID=9CE7D1C738C792DCE682EBB4D11A996D
H_PS_PSSID=30971_1446_31123_21120_30823
PSTM=1585367396
BDSVRTM=0
BD_HOME=1

# 我们也可以输出文件格式
import http.cookiejar,urllib.request

filename = 'cookie.txt'
# MozillaCookieJar生成文件时用到,用于处理Cookies和文件的相关事件,比如读取和保存Cookies
# MozillaCookieJar可以将Cookies保存成Mozilla型浏览器的Cookies格式
# 这里是生成一个cookies.txt文件
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True,ignore_expires=True)

cookie_file = open("E:\\python_work\\cookie.txt")
for lines in cookie_file.readlines():
    print(lines)
cookie_file.close()
# 运行结果

# Netscape HTTP Cookie File

# http://curl.haxx.se/rfc/cookie_spec.html

# This is a generated file!  Do not edit.



.baidu.com	TRUE	/	FALSE	1616904468	BAIDUID	1E763C2A7DEA6598F459E3CC2780EECE:FG=1

.baidu.com	TRUE	/	FALSE	3732852115	BIDUPSID	1E763C2A7DEA659809BBBC0BF9CDC747

.baidu.com	TRUE	/	FALSE		H_PS_PSSID	30973_1458_31045_21113_31051_30824_26350_22160

.baidu.com	TRUE	/	FALSE	3732852115	PSTM	1585368468

www.baidu.com	FALSE	/	FALSE		BDSVRTM	0

www.baidu.com	FALSE	/	FALSE		BD_HOME	1

# LWPCookieJar同样可以读取和保存Cookies,它保存成libwww-per(LWP)格式的Cookies文件
import http.cookiejar,urllib.request

filename = 'cookie1.txt'

cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True,ignore_expires=True)

cookie_file = open("E:\\python_work\\cookie1.txt")
for lines in cookie_file.readlines():
    print(lines)
cookie_file.close()

# 运行结果
#LWP-Cookies-2.0

Set-Cookie3: BAIDUID="335C106DE8D30F26AE2B446B5EA0C5EB:FG=1"; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2021-03-28 04:12:25Z"; comment=bd; version=0

Set-Cookie3: BIDUPSID=335C106DE8D30F26276CDC5076B8F1AB; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2088-04-15 07:26:32Z"; version=0

Set-Cookie3: H_PS_PSSID=30962_1457_31122_21126_30824; path="/"; domain=".baidu.com"; path_spec; domain_dot; discard; version=0

Set-Cookie3: PSTM=1585368744; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2088-04-15 07:26:32Z"; version=0

Set-Cookie3: BDSVRTM=0; path="/"; domain="www.baidu.com"; path_spec; discard; version=0

Set-Cookie3: BD_HOME=1; path="/"; domain="www.baidu.com"; path_spec; discard; version=0

# 对Cookies文件的利用
import http.cookiejar,urllib.request

filename = 'cookie1.txt'

cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True,ignore_expires=True)

# 用load()方法读取本地文件,获取Cookies的内容
cookie.load('cookie1.txt',ignore_expires=True,ignore_discard=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('https://www.baidu.com')
# 这里输出百度网站的源代码
print(response.read().decode('utf-8'))
# 通过这种方法可以实现绝大多数请求的设置

# 运行结果
<html>
<head>
	<script>
		location.replace(location.href.replace("https://","http://"));
	</script>
</head>
<body>
	<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>
</body>
</html>
Error 模块
URLError

URLError 来自urllib中的error模块,继承了OSError类,是error异常模块的基类
可以用来处理request生成的异常

# 实例
from urllib import request,error
try:
    response = request.urlopen('https://joker.com/index.html')
except error.URLError as e:
    print(e.reason)
# 运行结果
Not Found
HTTPError

是URLError的子类,用于处理HTTP请求错误
它包含三个属性:code(返回HTTP状态码),reason(返回错误原因),headers(返回请求头)

# 实例
from urllib import request,error
try:
    response = request.urlopen('https://joker.com/index.html')
except error.HTTPError as e:
    print(e.reason,e.code,e.headers,sep='\n')
# 运行结果
Not Found
404
Server: nginx
Date: Sat, 28 Mar 2020 07:37:55 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: close
Set-Cookie: Joker_Session=sjpf9uffn808s4lqqg1n5i44s8; path=/; secure; HttpOnly
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate
Pragma: no-cache
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Set-Cookie: ZLBSYS=WEB3; expires=Sat, 28-Mar-2020 07:47:55 GMT; path=/; secure; HttpOnly
# 因为URLError是HTTPError的父类,所以可以先捕获子类的错误,获取他的状态码,原因,headers等,
# 如果不是子类的错误,则捕获父类错误,输出错误原因,最后用else来处理正常情况
Parse 模块
urlparse

实现URL的识别和分段

# 实例
from urllib.parse import urlparse

# urlparse将URL拆分为6个部分
# ://前为scheme(协议),第一个/前为netloc(域名),域名后面是path(访问路径)
# ';'后面是params(参数),'?'后面是query(查询条件,一般用作GET类型的URL),'#'后面是fragment(锚点,用于直接定位页面内部的下拉设置)

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result),result)
# 运行结果
<class 'urllib.parse.ParseResult'> ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

# urllib.parse.urlparse()的API
urllib.parse.urlparse(urlstring,scheme='',allow_fragements=True)
# urlstring 待解析的URL

# scheme 默认的协议(如http或https)
# 加入一个链接未带协议信息,会将scheme作为默认协议
#实例
from urllib.parse import urlparse

result = urlparse('www.baidu.com/index.html;user?id=5#comment',scheme='https')
print(type(result),result)
# 运行结果
<class 'urllib.parse.ParseResult'> ParseResult(scheme='https', netloc='', path='www.baidu.com/index.html', params='user', query='id=5', fragment='comment')
# 若URL本身带有协议,则会返回本身的scheme

# allow_fragment 即是否忽略fragment,如果它为false,则fragment部分会被忽略,它会被解析为path,parameters或者query的一部分
# 实例
from urllib.parse import urlparse

result = urlparse('https://www.baidu.com/index.html#comment',allow_fragments=False)
print(result)
# 运行结果
# 当URL不包含params和query时,fragment就会被解析为path的一部分
ParseResult(scheme='https', netloc='www.baidu.com', path='/index.html#comment', params='', query='', fragment='')
# ParseResult是一个元组,我们可以利用索引,属性名来获取里面的内容
urlunparse()

urlunparse() 是urlparse()的对立面,它接受的参数是可迭代对象,它的长度必须是6,不足或过多都会异常

# 实例
from urllib.parse import urlunparse

data = ['http','www.baidu.com','index.html','user','a=6','comment']
print(urlunparse(data))
# 运行结果
http://www.baidu.com/index.html;user?a=6#comment
urlsplit()

与urlparse()相似,但不再单独解析params这一部分,只返回5个结果

# 实例
from urllib.parse import urlsplit

result = urlsplit('https://www.baidu.com/index.html;user?id=5#comment')
print(result)
# 运行结果
SplitResult(scheme='https', netloc='www.baidu.com', path='/index.html;user', query='id=5', fragment='comment')
# SplitResult也是一个元组,我们也可以利用索引,属性名来获取里面的内容
urlunsplit()

与urlunsplit相似,区别是传入参数长度为5,不用传入params部分

urljoin()

利用urljoin(),我们可以提供一个base_url(基础链接)作为第一个参数,将新链接作为第二个参数,
该方法会分析base_url中scheme,netloc,path,并对新链接缺失项进行补充

# 实例
from urllib.parse import urljoin

print(urljoin('http:I lwww. baidu. com', 'FAQ. html'))
print(urljoin('http://www.baidu.com', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html','https:I /cuiqingcai. com/FAQ. html'))
print(urljoin('http://www.baidu.com/about.html','https://cuiqingcai.com/FAQ.html?question=2'))
print(urljoin('http://www.baidu.com?wd=abc', 'https://cuiqingcai.com/index.php'))
print(urljoin('http://www.baidu.com','?category=2#comment'))
print(urljoin('www.baidu.com ','?category=2#comment'))
print(urljoin('www.baidu.com#comment','?category=2'))
# 运行结果
http:///FAQ. html
https://cuiqingcai.com/FAQ.html
https:I /cuiqingcai. com/FAQ. html
https://cuiqingcai.com/FAQ.html?question=2
https://cuiqingcai.com/index.php
http://www.baidu.com?category=2#comment
www.baidu.com ?category=2#comment
www.baidu.com?category=2
urlencode()

构造GET请求参数

# 实例
from urllib.parse import urlencode

params = {
    'name': 'joker',
    'age': '30'
}
base_url = 'http://www.baidu.com?'
# urlencode()将params序列化为GET请求的参数
url = base_url + urlencode(params)
print(url)
# 运行结果
http://www.baidu.com?name=joker&age=30
parse_qs()

与urlencode()相反,用于反序列化,将GET请求参数转回字典

# 实例
from urllib.parse import parse_qs

query = 'name=joker&age=18'
# parse_qs()将query反序列化为字典
print(parse_qs(query))
# 运行结果
{'name': ['joker'], 'age': ['18']}
parse_qsl()

与parse_qs相似,但它将参数转化为元组

#实例
from urllib.parse import parse_qsl

query = 'name=joker&age=18'
# parse_qsl()将query反序列化为字典
print(parse_qsl(query))
# 运行结果
[('name', 'joker'), ('age', '18')]
quote()

可以将内容转化为URL编码格式
URL中带有中文参数时,有时会乱码,这个方法可以将中文字符转化为URL编码

# 实例
from urllib.parse import quote

keyword = '博客'
url = 'https://www.baid.com/s?wd='+quote(keyword)
print(url)
# 运行结果
https://www.baid.com/s?wd=%E5%8D%9A%E5%AE%A2
unquote()

可以用于对URL进行解码

# 实例
from urllib.parse import unquote

url = 'https://www.baid.com/s?wd=%E5%8D%9A%E5%AE%A2'
print(unquote(url))
# 运行结果
https://www.baid.com/s?wd=博客
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值