爬虫----网络请求模块（urllib模块）

猩猩文学

已于 2022-05-07 14:40:32 修改

阅读量340

点赞数

分类专栏： python爬虫文章标签： python 开发语言后端

于 2022-04-01 16:43:06 首次发布

本文链接：https://blog.csdn.net/R71802/article/details/122781900

版权

python爬虫专栏收录该内容

19 篇文章 4 订阅

订阅专栏

本文详细介绍了Python内置的网络请求模块urllib，包括urllib.request常用方法如urlopen()和Request对象，以及如何处理响应对象。讲解了如何构造请求头以避免反爬，并通过urllib.parse处理中文URL。最后通过实例展示了如何使用urllib请求带有中文参数的URL。

摘要由CSDN通过智能技术生成

网络请求模块

urllib模块

python内置的网络请求模块（requests 是第三方的，是需要pip安装才能使用）

一、常用方法

● urllib.request.urlopen(“网址”) 作用：向网站发起一个请求并获取响应
● 字节流 = response.read()
● 字符串 = response.read().decode(“utf-8”)
● urllib.request.Request"网址",headers=“字典”) urlopen()不支持重构User-Agent

二、响应对象

● read() 读取服务器响应的内容
● getcode() 返回HTTP的响应码
● geturl() 返回实际数据的URL(防止重定向问题)

三、urllib.request的使用

1、urllib.request.urlopen(‘网站’)

一旦碰到需要检查ua等等类的请求头里面的信息这种方式就行不通
如何知道需要带上ua
一旦发现爬取下来的数据跟浏览器中看到的网页源代码不一样就很可能被反爬

urllib.request.urlopen(“网址”) 作用：向网站发起一个请求并获取响应

response = urllib.request.urlopen(url)

2、urllib.request.urlopen(‘请求对象’)

1、通过请求对象构造header
res = urllib.request.Rquest(url,headers=header)
在这个请求对象里面不仅有url目标网址还有请求头

2、发送请求获取响应对象
response = urllib.request.urlopen(res)

3、在响应对象里可以拿到响应状态码、响应内容等等

# response 响应对象里面的一些其他数据
# response.getheaders() 获取响应头
print(response.getheaders())
print(response.getheader('Bdqid'))

# 获取响应状态码
print(response.getcode())
print(response.status)

# 获取当前请求的url地址
print(response.geturl())

在响应对象里面需要重点关注的是源代码（0）

result = response.read().decode('utf-8')

通过URLopen发送请求获取响应对象

# 1、 创建一个请求对象 构造ua
res = urllib.request.Request(url, headers=header)
# 2、 通过URLopen发送请求 获取响应对象
response = urllib.request.urlopen(res)
print("响应对象的数据类型", type(response))  # 响应对象的数据类型 <class 'http.client.HTTPResponse'>
# 3、 从响应对象里面获取内容
result = response.read().decode('utf-8')
# print(result)

百度案例

import urllib.request
from http.client import HTTPResponse
# url 是目标网址
url = 'https://www.baidu.com/'
header = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36"
}
# response 是目标url地址发送请求后得到的响应对象
# response = urllib.request.urlopen(url, headers=header)
# TypeError: urlopen() got an unexpected keyword argument 'headers'
# 不能直接使用urlopen发送请求的时候 直接把headers传入

# .read()是用来读取响应对象里面的内容
# print(response.read()) #返回object
'''
通过打印响应内容 有两个问题：
1、数据不太对；
2、数据类型不太对 是字节流
'''

# 将数据转换为字符串
# print(type(response.read())) # <class 'bytes'>
# print(type(response.read().decode('utf-8')))# <class 'str'>
# print(response.read().decode('utf-8'))

# 1、 创建一个请求对象 构造ua
res = urllib.request.Request(url, headers=header)
# 2、 通过URLopen发送请求 获取响应对象
response = urllib.request.urlopen(res)
print("响应对象的数据类型", type(response))  # 响应对象的数据类型 <class 'http.client.HTTPResponse'>
# 3、 从响应对象里面获取内容
result = response.read().decode('utf-8')
# print(result)



# response 响应对象里面的一些其他数据
# response.getheaders() 获取响应头
print(response.getheaders())
print(response.getheader('Bdqid'))

# 获取响应状态码
print(response.getcode())
print(response.status)

# 获取当前请求的url地址
print(response.geturl())

四、urllib.parse的使用

一般用来处理带中文的url
使用urllib模块向一个携带中文url发送请求时报错：UnicodeEncodeError: ‘ascii’ codec can’t encode characters in position 41-42: ordinal not in range(128)

常用方法

● urlencode(字典)
● quote(字符串) (这个里面的参数是个字符串)

请求方式

● GET 特点
查询参数在URL地址中显示
● POST
1.在Request方法中添加data参数 urllib.request.Request(url,data=data,headers=headers)

2.data ：表单数据以bytes类型提交,不能是str

1、字典格式的处理方式

# 字典格式
org = {"wd":"酷我"}
result = urllib.parse.urlencode(org)
# print(result)#wd=%E9%85%B7%E6%88%91
new_url = "https://www.baidu.com/s?ie=UTF-8&tn=62095104_35_oem_dg&" + result
# print(new_url)
# exit()

2、字符串格式的处理方式

# 字符串格式
string_org = "酷我"
string_result = urllib.parse.quote(string_org)
# print(string_result)
new_string_url = "https://www.baidu.com/s?ie=UTF-8&tn=62095104_35_oem_dg&wd=" + string_result
# print(new_string_url)
# exit()

酷我案例

import urllib.parse
import urllib.request
from http.client import HTTPResponse
# url 是目标网址
url = 'https://www.baidu.com/s?ie=UTF-8&tn=62095104_35_oem_dg&wd=酷我'#%E9%85%B7%E6%88%91
# 使用urllib模块向一个携带中文url发送请求时：UnicodeEncodeError: 'ascii' codec can't encode characters in position 41-42: ordinal not in range(128)
header = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36"
}

#通过urllib,parse处理url中的中文字样

# 字典格式
org = {"wd":"酷我"}
result = urllib.parse.urlencode(org)
# print(result)#wd=%E9%85%B7%E6%88%91
new_url = "https://www.baidu.com/s?ie=UTF-8&tn=62095104_35_oem_dg&" + result
# print(new_url)
# exit()

# 字符串格式
string_org = "酷我"
string_result = urllib.parse.quote(string_org)
# print(string_result)
new_string_url = "https://www.baidu.com/s?ie=UTF-8&tn=62095104_35_oem_dg&wd=" + string_result
# print(new_string_url)
# exit()

# 1、构造一个请求对象
res = urllib.request.Request(new_url,headers=header)
# 2、发送请求 获取响应
response = urllib.request.urlopen(res)
# 3、获取响应对象里面的内容
print(response.read().decode('utf-8'))

拓展

https://pvp.qq.com/web201605/wallpaper.shtml

# url是待处理的数据
url = 'https%3A%2F%2Fshp%2Eqpic%2Ecn%2Fishow%2F2735032810%2F1648434425%5F1265602313%5F29174%5FsProdImgNo%5F1%2Ejpg%2F200'
res = urllib.parse.unquote(url)
print(res)

拓展目的：要是以后在url中看到%+数字+英文字母的组合而且复制这个URL在新窗口中打不开的时候，就需要考虑urllib.parse.unquote(url)通过这种处理方式看看你把你等到一个真正的URL（能不能在新窗口打开）