再学爬虫---urllib：

最新推荐文章于 2024-05-27 09:06:10 发布

Shao0000

最新推荐文章于 2024-05-27 09:06:10 发布

阅读量99

点赞数

分类专栏： python 再学爬虫文章标签： python 爬虫

本文链接：https://blog.csdn.net/weixin_42903932/article/details/85260582

版权

python 同时被 2 个专栏收录

55 篇文章 0 订阅

订阅专栏

再学爬虫

2 篇文章 0 订阅

订阅专栏

urllib：

urllib是python内置的HTTP请求库，主要包括4个模块：request、error、parse、robotparser。

request模块：

1.urlopen()

import urllib.request 
response = urllib.request.urlopen (’ https://www.python.org') 
print(type(response)) 
#结果，返回一个HTTPResponse对象
<class ’ http.client.HTTPResponse ’ >

2.Request

import urllib .request 
#构建请求对象
request = urllib.request.Request(’ https://python .org') 
response = urllib .request.urlopen(request) 
print(response.read().decode (’ utf-8'))

Request(url,data,headers,origin_req_host,unverifiable,method

Request有六个参数：其中origin_req_host 是请求方的host名称或ip地址；unverifiable 表示这个请求是否是无法验证的，默认是 False ；date必须是一个字节流类型的，可以传一个字典先用urlencode()将其转为字符串，再用bytes()进行编码。

高级用法：

验证：

from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener 
from urllib.error import URLError 
username = username 
password =’ password ’ 
url = ’ http: //localhost:sooo/' 
p = HTTPPasswordMgrWithDefaultRealm() 
p.add_password(None, url, username , password) 
auth_handler = HTTPBasicAuthHandler(p) 
opener = build_opener(auth_handler) 
try: 
result = opener.open(url) 
html = result. read(). decode (’ utf 8’) 
print(html) 
except URLError as e: 
print(e.reason)

这里首先实例化HTTPBasicAuthHandler 对象，其参数是 HTTPPasswordMgrWithDefaultRealm 对象，
它利用 add_password()添加进去用户名和密码，这样就建立了一个处理验证的 Handler。再创建一个opener，然后open()这个url。

代理：

from urllib.error import URLError 
from urllib.request import ProxyHandler, build opener 
proxy _handler = ProxyHandler({ 
’ http ':’http://127.o.o .1:9743 ’, 
’ https’:’https://127.0 .0.1:9743 ’ 
}) 

opener = build_opener(proxy_handler) 
try: 
response = opener.open (’ https://www.baidu.com') 
print(response.read() .decode (’ utf-8')) 
except URLError as e: 
print(e .reason)

cookies：

# 创建cookiejar对象
cookie_jar = http.cookiejar.CookieJar()
# 使用cookiejar对象创建handler
handler = urllib.request.HTTPCookieProcessor(cookie_jar)
# 使用handler创建opener
opener = urllib.request.build_opener(handler)
# 获取request对象
request = urllib.request.Request(url=url, headers=headers)
# 请求数据
form_data = urllib.parse.urlencode(form_data).encode()
#用opener发送请求
response = opener.open(request, form_data)
print(response.read().decode())
# 登录成功之后跳转到个人中心
url = "http://www.renren.com/968904311/profile"
response = opener.open(url)
print(response.read().decode())

详细内容可以参考：python3网络爬虫开发实战，p110

error模块：

1.URLError：

from urllib import request, error 
try: 
response = request. urlopen https://cuiqingcai.com/index.htm')
except error. URL Error as e: 
print(e.reason)
#我们打开一个不存在的页面照理来说应该会报错，但是这时我们捕获了 URL Error 这个异常，运行结果如下：
Not Found 
#程序没有直接报错，而是输归了如上内容，这样通过如上操作，我们就可以避免程序异常终止，同时异常得到了有效处理

2.HTTPError：

是URLError的子类，专门处理HTTP请求错误。有三个属性：code、reason、headers。

from urllib import request,error 
try: 
response = request. urlopen(' https: I /cuiqingcai. com/index. htm ’) 
except error. HTTP Error as e: 
#可以得到他的三个属性
print(e.reason, e.code, e.headers, seq='\n ’)

不过想写的完整、更好点一般都将两种Error结合在一起写。

from urllib import request, error 
try: 
response = request.urlopen(’ https://cuiqingcai.com/index.htm’) 
except error.HTTPError as e: 

print(e.reason, e.code, e.headers, sep=’\ n’) except error.URLError as e: 
print(e .reason) 
else: 
print(’ Request Successfully')

要先写子类再写父类。

parse模块：

1.urlparse()

from urllib.parse import urlparse 
result=urlparse（’ http://www.baidu com/index .html； user?id=S#comment ’）
   print(type(result), result)
#运行结果如下：
<class ’ urllib.parse.ParseResult ’>
   
ParseResult(scheme=’ http ’, netloc= w. baidu. com ', path=' /index. html ’, params='user', query='id=S', 
  fragment='comment ' )

将url拆成六部分。

2.urlunparse()

from urllib.parse import urlunparse 
data =[’ http'' w.baidu com ', 'index.html ’ P ’user ’ 3 ’a=6 ', ' comment ' ) 
print(urlunparse(data)) 
#这里参数 data用列表类型 当然，你也可以用其 类型，比如元组或者特定的数据结构
#运行结果如下：
#http://www . baidu . com/index.html;user?a=6#comment 
这样我就成功实现了URL的构造

还有：urlsplit()、urlunsplit()、urljoin()、urlencode()、parse_qs()、parseqsl()、quote()、unquote()等。

详细内容可以参考：python3网络爬虫开发实战，p114

robot模块：

这个模块一般用不太到。具体详细内容可以参考：python3网络爬虫开发实战，p119

Shao0000

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
再学爬虫---urllib：

urllib：urllib是python内置的HTTP请求库，主要包括4个模块：request、error、parse、robotparser。request模块：1.urlopen()import urllib.request response = urllib.request.urlopen (’ https://www.python.org') print(type(respon...
复制链接

扫一扫

专栏目录