python爬虫内置模块urllib详解

最新推荐文章于 2022-05-09 12:07:48 发布

最低调的奢华

最新推荐文章于 2022-05-09 12:07:48 发布

阅读量590

点赞数

分类专栏：爬虫文章标签： python

本文链接：https://blog.csdn.net/weixin_46700209/article/details/115904041

版权

爬虫专栏收录该内容

13 篇文章 0 订阅

订阅专栏

1.什么是urllib模块？

python内置的网络请求模块

2.为什么要学习这个模块？

许多比较老的爬虫项目就是要用这个技术
我们爬取一些数据需要requests和urllib模块配合完成
内置的

3.用requests模块下载一张图片和 urllib来下载一张图片来进行对比

import requests
url = 'https://alifei03.cfp.cn/creative/vcg/veer/800water/veer-145089182.jpg'
res = requests.get(url)
with open ('file1.png','wb') as f:
    f.write(res.content)

很显然urllib模块只需一行代码便可以完成，可见在某些方面还是很便捷的

from urllib import request
url = 'https://alifei03.cfp.cn/creative/vcg/veer/800water/veer-145089182.jpg'
request.urlretrieve(url,'file2.png')

4.1 urllib.request的讲解以及利用request获取网页源代码

来获取网页源代码需要以下几步：

1.创建请求对象
2.获取响应
3.读取响应对象

1.如果仅仅这样打印结果的话会得不到网页源代码，而是一个对象

import urllib.request
url='https://www.baidu.com/'
response = urllib.request.urlopen(url)
print(response)

下面是结果，，显然不是我们想要的结果，我们就要从对象中来得到里面的方法：eg read（）

<http.client.HTTPResponse object at 0x032458C8>

2.我们把read（）方法用上即修改最后一行代码

import urllib.request
url='https://www.baidu.com/'
response = urllib.request.urlopen(url)
print(response.read())

下面是结果，我们会发现可能反爬了，或者字节流bytes

b'<html>\r\n<head>\r\n\t<script>\r\n\t\tlocation.replace(location.href.replace("https://","http://"));\r\n\t</script>\r\n</head>\r\n<body>\r\n\t<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>\r\n</body>\r\n</html>'

3.我们得到的数据是一个字节流类型，所以我们就要把它转换成字符串结构

我们只需添加一个decode(‘utf-8’)就可以把字节流转化为字符串
decode bytes(数据类型） >>> str（字符串类型）
encode str（字符串类型） >>>bytes(数据类型）

import urllib.request
url='https://www.baidu.com/'
response = urllib.request.urlopen(url)
print(response.read().decode('utf-8'))

下面是结果，而我们可以看到它得到的结果也发生了改变，但是，和我们网页源代码还是少很多，此时就可能是反爬了，我们需要模仿浏览器来爬取网页内容，就要去添加UA

<html>
<head>
	<script>
		location.replace(location.href.replace("https://","http://"));
	</script>
</head>
<body>
	<noscript><meta http-equiv="refresh" content="0;url=http://www.baidu.com/"></noscript>
</body>
</html>

Process finished with exit code 0

4.我们只需添加一个Uagent，就可以解决反爬，但是，urllib.request.urlopen不支持UA，这时我们就要用到 urllib.request.Request 的方法来添加UA

urllib.request.Request的作用是我们的第一步创建请求对象
我们只需把UA放在headers字典里面
通过以下完整的代码我们就可得到网页的源代码

import urllib.request
url = 'https://www.baidu.com/'
headers={
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.57'
}
#1.创建请求对象
response = urllib.request.Request(url,headers=headers)
#2.获取响应
req = urllib.request.urlopen(response)
#3.读取响应对象
html = req.read().decode('utf-8')
print(html)

5.最后查看当前状态码和url

print(req.getcode())
print(req.geturl())
200
https://www.baidu.com/

4.2 如果我们之间用urllib.request.urlopen来对有汉字的url发起请求，就会报错，所以我们就要用到urllib.parse模块里的urlencode或者quote

我们需要把一个汉字转换成百分号加16进制的形式
例如：奔驰的16进制就是 %E5%A5%94%E9%A9%B0
我们可以打开链接:(https://tool.chinaz.com/tools/urlencode.aspx).这个链接来试一下
或者这个链接搜索https://tool.chinaz.com/
url1 = 'https://www.baidu.com/s?wd=奔驰我们就把带汉字的url转成下面的url
urllib.parse.urlencode（字典）

import urllib.parse
wd = {'wd':'奔驰'}
result = urllib.parse.urlencode(wd)
base_url = 'https://www.baidu.com/s?'+result
print(base_url)
# 得到以下网页源代码
https://www.baidu.com/s?wd=%E5%A5%94%E9%A9%B0

urllib.parse.quote(字符串)

import urllib.parse
r = '奔驰'
result = urllib.parse.quote(r)
base_url = 'https://www.baidu.com/s?wd='+result
print(base_url)
# 得到以下网页源代码
https://www.baidu.com/s?wd=%E5%A5%94%E9%A9%B0

urllib.parse.unqote()
拓展
把一个都是16进制王者荣耀的图片url，转化成我们需要的url

import urllib.parse
url = 'http%3A%2F%2Fshp%2Eqpic%2Ecn%2Fishow%2F2735041519%2F1618485629%5F84828260%5F22420%5FsProdImgNo%5F1%2Ejpg%2F200'
img_url = urllib.parse.unquote(url)
print(img_url)
#得到以下结果
http://shp.qpic.cn/ishow/2735041519/1618485629_84828260_22420_sProdImgNo_1.jpg/200

最低调的奢华

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
1
评论
python爬虫内置模块urllib详解

1.什么是urllib模块？python内置的网络请求模块2.为什么要学习这个模块？许多比较老的爬虫项目就是要用这个技术我们爬取一些数据需要requests和urllib模块配合完成内置的3.用requests模块下载一张图片和 urllib来下载一张图片来进行对比import requestsurl = 'https://alifei03.cfp.cn/creative/vcg/veer/800water/veer-145089182.jpg'res = requests.g
复制链接

扫一扫