Urllib库基础（urllib.request、urllib.error、urllib.parse、urllib.robotparser、）

最新推荐文章于 2024-02-19 23:04:32 发布

Leadingme

最新推荐文章于 2024-02-19 23:04:32 发布

阅读量954

点赞数 1

分类专栏： python爬虫文章标签： python 数据挖掘

本文链接：https://blog.csdn.net/weixin_43388615/article/details/105087540

版权

python爬虫专栏收录该内容

20 篇文章 0 订阅

订阅专栏

Python内置的HTTP请求库

1.urllib.request 请求模块

urllib.request.urlopen(url,data=None,timeout=tiem) 返回一个文件

	例1(GET): response = urllib.request.urlopen("http://leadingme.top",timeout=1)
		response.read().decode('utf-8')
    例2(POST): data = bytes(urllib.parse.urlencode({'name':'leadingme'}),encoding='utf-8')
		 response = urllib.request.urlopen('http://httpbin.org/post',data=data)
		 import socket urllib.request urllib.error 
	例3: try:
			response = urllib.request.urlopen('http://httpbin/get',timeout=0.10)
		except urllib.error.URLError as e:
			if isinstance(e.reason,socket.timeout):
				print('TIME OUT!')

urllib.request.Request(url=url,data=data,headers=headers,method=‘POAT’)

url = 'http://httpbin/post'
headers = {
	User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36
}
dict = {
	'name':'leadingme'
}
data = bytes(parse.urlencode(dict),encoding:'utf-8')
request = request.Request(url=url,data=data,headers=headers,method='POAT')
response = request.urlopen(request)

代理ip的设置

 proxy-headler = urllib.request.ProxyHandler({
	'http:':'http://127.0.0.1'
	'https:':'https':'https://127.0.0.1'
})
opener = urllib.request.build_opener(proxy-headler) 
response = opener.open('http:httpbin.org/get')

2. urllib.error 异常处理模块

3. urllib.parse url解析模块

4. urllib.robotparser robots.txt解析模块

网页信息函数

response.info() 获取网页相关简介信息
response.getcode()或response.status 获取网页爬取的状态信息
response.geturl() 获取当前访问的网页的url
response.getheaders() 获取请求响应头所有键值 response.getheader(“键值”) 获取请求响应头部分键值
response.read().decode(‘utf-8’) 获取响应体

Headler

概述
1. Headler相当于一个辅助工具，来帮助我们处理一些额外的工作，比如FTP、Cache等等操作，我们都需要借助Headler来实现。比如在代理设置的时候，就需要用到一个ProxyHandler。
代理
1. 个完整的代理请求过程为：客户端首先与代理服务器创建连接，接着根据代理服务器所使用的代理协议，请求对目标服务器创建连接、或者获得目标服务器的指定资源（如：文件）。在后一种情况中，代理服务器可能对目标服务器的资源下载至本地缓存，如果客户端所要获取的资源在代理服务器的缓存之中，则代理服务器并不会向目标服务器发送请求，而是直接传回已缓存的资源
2. 爬虫中可以使用ProxyHeadler设置代理，伪装自己的IP地址。爬取时可以不停地切换IP，服务器检测到不听地域的访问，不会禁用
```
from urllib import request
proxy_handler = request.ProxyHandler( #构建ProxyHandler，传入代理的网址
{'http':'http://127.0.0.1:9743',
'https':'https://127.0.0.1:9743'
}) #实践表明这个端口已经被封了，这里无法演示了
 
opener = request.build_opener(proxy_handler) #再构建一个带有Handler的opener
 
response = opener.open('http://www.baidu.com')
print(response.read())
```

Cookie是在客户端保存的用来记录用户身份的文本文件。在爬虫时，主要是用来维护登录状态，这样就可以爬取一些需要登录认证的网页了。

cookie在爬虫中可以保持登录状态，持续性爬取

from urllib import request

from http import cookiejar
cookie =cookiejar.CookieJar()   #将cookie声明为一个CookieJar对象

handler = request.HTTPCookieProcessor(cookie)

opener = request.build_opener(handler)

response  =opener.open('http://www.baidu.com')   #通过opener传入，并打开网页

for item in cookie:      #通过遍历把已经赋值的cookie打印出来
  print(item.name+'='+item.value)		#通过item拿到每一个cookie并打印

Cookie保存: cookie保存成文本文件，若cookie没有失效，我们可以从文本文件中再次读取cookie，在请求时把cookie附加进去，这样就可以继续保持登录状态了

from urllib import request
from http import cookiejar

filename="cookie.txt"
cookie=cookiejar.LWPCookieJar(filename)(可选)
#cookie=cookiejar.MozillaCookieJar(filename)
#把cookie声明为cookiejar的一个子类对象————MozillaCookieJar，它带有一个save方法，可以把cookie保存为文本文件
handler=request.HTTPCookieProcessor(cookie)
opener=request.build_opener(handler)
response=opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True,ignore_expires=True)#调用save方法

Cookie读取: 选择相对应的格式来完成读取

from urllib import request
from http import cookiejar

cookie=cookiejar.LWPCookieJar() #z注意选择相应的格式，这里是LWP
cookie.load('cookie.txt',ignore_discard=True,ignore_expires=True)#load方法是读取的关键
handler=request.HTTPCookieProcessor(cookie)
opener=request.build_opener(handler)
response=opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

异常处理

		from urllib import request,error
		
		#我们试着访问一个不存在的网址
		try:
		    response = request.urlopen('http://www.cuiqingcai.com/index.html')
		except error.URLError as e:
		    print(e.reason)#通过审查可以查到我们捕捉的异常是否与之相符

from urllib import request,error

#我们试着访问一个不存在的网址
try:
    response = request.urlopen('http://www.cuiqingcai.com/index.html')
except error.HTTPError as e:#最好先捕捉HTTP异常，因为这个异常是URL异常的子类
	print(e.reason,e.code,e.headers,sep='\n')
except error.URLError as e:
    print(e.reason)
else:
	print('Request Successfully！')

URL解析

urlparse: urllib.parse.urlparse(urlstring,scheme=’’,allow_fragments=True #分割成（url，协议类型，和#后面的东西）

from urllib.parse import urlparse

#无协议类型指定，自行添加的情况
result1 = urlparse('www.baidu.com/s?wd=urllib&ie=UTF-8',scheme = 'https')
print(result1)

#有指定协议类型，默认添加的情况
result2 = urlparse('http://www.baidu.com/s?wd=urllib&ie=UTF-8',scheme = 'https')
print(result2) scheme = 'http'

allow_fragments参数使用
result3 = urlparse('http://www.baidu.com/s?#comment',allow_fragments = False)
Flase表示#后面的东西不能填，原本在fragment位置的参数就会往上一个位置拼接，可以对比result1和result2的区

urljoin: 这个函数用来拼合url

from urllib.parse import urljoin
print(urljoin('http://www.baidu.com','FQA.html'))
#http://www.baidu.com/FQA.html
 
print(urljoin('http://www.baidu.com','http://www.caiqingcai.com/FQA.html'))
#http://www.caiqingcai.com/FQA.html
 
>>>from urllib.parse import urljoin
>>> urljoin("http://www.chachabei.com/folder/currentpage.html", "anotherpage.html")
'http://www.chachabei.com/folder/anotherpage.html'
>>> urljoin("http://www.chachabei.com/folder/currentpage.html", "/anotherpage.html")
'http://www.chachabei.com/anotherpage.html'
>>> urljoin("http://www.chachabei.com/folder/currentpage.html", "folder2/anotherpage.html")
'http://www.chachabei.com/folder/folder2/anotherpage.html'
>>> urljoin("http://www.chachabei.com/folder/currentpage.html", "/folder2/anotherpage.html")
'http://www.chachabei.com/folder2/anotherpage.html'
>>> urljoin("http://www.chachabei.com/abc/folder/currentpage.html", "/folder2/anotherpage.html")
'http://www.chachabei.com/folder2/anotherpage.html'
>>> urljoin("http://www.chachabei.com/abc/folder/currentpage.html", "../anotherpage.html")
'http://www.chachabei.com/abc/anotherpage.html'

urlencode: 这个函数用来将字典对象转化为get请求参数

from urllib.parse import urlencode
 
params = {
'name':'zhuzhu',
'age':'23'
}
base_url = 'http://www.baidu.com？'
 
url = base_url+urlencode(params)   #将params对象编码转换

print(url)

Leadingme

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Urllib库基础（urllib.request、urllib.error、urllib.parse、urllib.robotparser、）

Python内置的HTTP请求库1.urllib.request 请求模块urllib.request.urlopen(url,data=None,timeout=tiem) 返回一个文件例1(GET): response = urllib.request.urlopen("http://leadingme.top",timeout=1) response.read(...
复制链接

扫一扫

专栏目录