python爬虫（四）urllib库基础知识的运用和掌握

最新推荐文章于 2022-01-05 20:06:33 发布

William_Tao（攻城狮）

最新推荐文章于 2022-01-05 20:06:33 发布

阅读量497

点赞数 2

分类专栏： python爬虫

本文链接：https://blog.csdn.net/qq_45353823/article/details/104167865

版权

python爬虫专栏收录该内容

10 篇文章 0 订阅

订阅专栏

urllib四个模块

urrlib.request
urrlib.error
urrlib.parse
urrlib.robotparser
在这里插入图片描述

获取网页源代码

import urllib.request
response=urllib.request.urlopen("http://www.baidu.com")
print(response.read().decode('utf-8'))
#获取百度的源代码

post请求

import urllib.parse
import urllib.request
data=bytes(urllib.parse.urlencode({"name":"hello"}),encoding='utf-8')
response=urllib.request.urlopen("http://httpbin.org/post",data=data)
print(response.read().decode('utf-8'))

这里的data为传递的参数，，如果该参数是字节流编码格式内容（即byte类型），则需要bytes（）方法转化，并且请求方式为POST

超时测试

import urllib.request
import urllib.error
import socket
try:
	response=urllib.request.urlopen("http://httpbin.org/get",timeout=0.1)
except urllib.error.URLError as e:
	if isinstance(e.reason,socket.timeout):
		print("TIME OUT")

这里传入的为时间参数（timeout）

响应

1.响应类型

import urllib.request
response=urllib.request.urlopen("http://httpbin.org/get")
print(type(response))

返回的结果为：<class ‘http.client.HTTPResponse’>

2.状态码
3.响应头
4.响应体

import urllib.request
response=urllib.request.urlopen("http://www.python.org")
print(response.status)#响应状态
print(response.getheaders())#获得头部信息
print(response.getheader('Server'))

response带多个参数

from urllib import request,parse
url='http://httpbin.org/post'
headers={
	'User-Agent':'Mozillia/4.0(comoatible;MSIE 5.5;Windows NT)',
	'Host':'httpbin.org'
}
dict={
	'name':'Germey'
}
data=bytes(parse.urlencode(dict),encoding='utf-8')
req=request.Request(url=url,data=data,headers=headers,method='POST')
response=request.urlopen(req)
print(response.read().decode('utf-8'))

高级用法 Hander

在有些请求中需要用到cookies，代理设置，因此采用了Handler

mport urllib.request
proxy_handler=urllib.request.ProxyHandler({
	'http':'http://127.0.0.1:9743'
	'https':'https://127.0.0.1:9743'
})
opener=urllib.request.build_opener(proxy_handler)
resopnse=opener.open('http://httpbin.org/get')
print(resopnse.read())

cookies（）

import http.cookiejar,urllib.request
cookie=http.cookiejar.CookieJar()
handler=urllib.request.HTTPCookieProcessor(cookie)
opener=urllib.request.build_opener(handler)
response=opener.open("http://www.baidu.com")
for item in cookie:
    print(item.name+"="+item.value)
# 声明一个CookieJar对象，利用HTTPCookieProcessor构建一个Handler，最后利用build_openr执行open方法

异常处理

from urllib import request,error
try:
	response=request.urllib.urlopen('http://cuiqingcai.com/index.html')
except error.HTTPError as e:
	print(e.reason,e.code,e.headers,seq='\n') #HTTPError是URLError的子类
except error.URLError as e:
	print(e.reason)
else:
print('request successfully')

URL解析：

URL包含的部分：

一个URL（统一资源路径地址）包含哪些部分呢？举个例子，比如 “http://www.baidu.com/index.html?name=mo&age=25#dowell”，在这个例子中我们可以分成六部分；

1、传输协议：http，https

2、域名：例www.baidu.com为网站名字。 baidu.com为一级域名，www是服务器

3、端口：不填写的话默认走的是80端口号

4、路径 http://www.baidu.com/路径1/路径1.2。/表示根目录

5、携带的参数：？name=mo

6、哈希值：#dowell

————————————————
原文链接：https://blog.csdn.net/qq_38990351/article/details/83689928

urlparse:对于url的拆分操作

==urllib.parse.urlparse(urlstring, scheme=’’, allow_fragments=True)

url：待解析的url
scheme=’’：假如解析的url没有协议,可以设置默认的协议,如果url有协议，设置此参数无效
allow_fragments=True：是否忽略锚点,默认为True表示不忽略,为False表示忽略

from urllib.parse import urlparse
result=urlparse('www.baidu.com/index.html;user?id=5#comment',scheme='https')
print(result)
#打印结果：ParseResult(scheme='https', netloc='', path='www.baidu.com/index.html', params='user', query='id=5', fragment='comment')

urlunparse（组合）

import urllib.parse import urlunparse
data=['http','www.baidu.com','index.html','user','a=6','comment']
print(urlunparse(data))

urljoin

对于urljoin方法，我们可以提供一个base_url(基础链接)作为第一个参数，将新的链接作为第二个参数，该方法会分析base_url的scheme，neloc，path这三个内容对链接缺少的部分进行补充，最后返回结果

from urllib.parse import urljoin
print(urljoin("http://www.baidu.com","FAQ.html"))
print(urljoin("http://www.baidu.com","https://cuiqinghua.com/FAQ.html"))
print(urljoin("http://www.baidu.com","?category=2"))
#打印结果
#http://www.baidu.com/FAQ.html
#https://cuiqinghua.com/FAQ.html
#http://www.baidu.com?category=2

urlencode 可以把字典对象转换成参数

from urllib.parse import urlencode
params={
	'name':'germey'
	'age':22
	}
base_url='http://www.baidu.com?'
url=base_url+urlencode(params)
print(url)

quote

该方法可以将内容转换为url格式，当url带有中文参数时，可能会导致乱码

from urllib.parse import quote
keyword="壁纸"
url="https://www.baidu/s?wd"+quote(keyword)
print(url)
#https://www.baidu/s?wd%E5%A3%81%E7%BA%B8

unquote

该方法可以将url解码

from urllib.parse import unquote
url='https://www.baidu/s?wd%E5%A3%81%E7%BA%B8'
print(unquote(url))
`#https://www.baidu/s?wd壁纸``

William_Tao（攻城狮）

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
python爬虫（四）urllib库基础知识的运用和掌握

urllib四个模块urrlib.requesturrlib.errorurrlib.parseurrlib.robotparser获取网页源代码在这里插入代码片post请求在这里插入代码片超时测试在这里插入代码片响应1.响应类型2.状态码3.响应头在这里插入代码片Hander代理(在前面已经介绍)在这里插入代码片cookies（）在这里插入代码片...
复制链接

扫一扫