python3之模块urllib

最新推荐文章于 2023-07-03 16:34:31 发布

tester_sz

最新推荐文章于 2023-07-03 16:34:31 发布

阅读量339

点赞数

分类专栏： Python

本文链接：https://blog.csdn.net/qq_39813400/article/details/103350734

版权

Python 专栏收录该内容

59 篇文章 6 订阅

订阅专栏

文章目录

urllib是Python中内置的发送网络请求的一个库(包)，在Python2中由urllib和urllib2两个库来实现请求的发送，但是在Python中已经不存在urllib2这个库了，已经将urllib和urllib2合并为urllib。

1.urllib的包含的模块

urllib是python内置的HTTP请求库，无需安装即可使用，它包含了4个模块：

request：它是最基本的http请求模块，用来模拟发送请求
error：异常处理模块，如果出现错误可以捕获这些异常
parse：一个工具模块，提供了许多URL处理方法，如：拆分、解析、合并等
robotparser：主要用来识别网站的robots.txt文件，然后判断哪些网站可以爬

可见其中模拟请求使用的最主要的库便是urllib.request,urllib.parse

2.使用urllib发送请求

2.1 urllib包含的方法和属性

urllib.request.urlopen(url,data=None,[timeout,],cafile=None,capath=None,cadefault=False,context=None)

请求对象，返回一个HTTPResponse类型的对象，包含的方法和属性：

方法：read()、info()、geturl()、readinto()、getheader(name)、getheaders()、fileno()

属性：msg、version、status、reason、bebuglevel、closed

import urllib.request

response=urllib.request.urlopen('https://www.python.org')  #请求站点获得一个HTTPResponse对象
print(response.read().decode('utf-8'))   #返回网页内容
print(response.getheader('server')) #返回响应头中的server值
print(response.getheaders()) #以列表元祖对的形式返回响应头信息
print(response.fileno()) #返回文件描述符
print(response.version)  #返回版本信息
print(response.status)  #返回状态码200，404代表网页未找到
print(response.debuglevel) #返回调试等级
print(response.closed)  #返回对象是否关闭布尔值
print(response.geturl()) #返回检索的URL
print(response.url) #返回检索的URL
print(response.info()) #返回网页的头信息
print(response.getcode()) #返回响应的HTTP状态码
print(response.msg)  #访问成功则返回ok
print(response.reason) #返回状态信息

扩展：

str—>(encode)—>bytes，bytes—>(decode)—>str
字符串通过编码转换为字节码，字节码通过解码转换为字符串

2.2 发送一个不携带参数的get请求

import urllib.request
#发起一个不携带参数的get请求
response=urllib.request.urlopen('http://www.baidu.com')
print(response.reason)
#调用status属性可以此次请求响应的状态码，200表示此次请求成功
print(response.status)
#由于使用read方法拿到的响应的数据是二进制数据，所有需要使用decode解码成utf-8编码
# print(response.read().decode('utf-8'))

2.3 发送一个携带参数的get请求

import urllib.request
import urllib.parse
#http://www.yundama.com/index/login?username=1313131&password=132213213&utype=1&vcode=2132312

# 定义出基础网址
base_url='http://www.yundama.com/index/login'
#构造一个字典参数
data_dict={
    "username":"1313131",
    "password":"13221321",
    "utype":"1",
    "vcode":"2132312"
}
# 使用urlencode这个方法将字典序列化成字符串，最后和基础网址进行拼接
data_string=urllib.parse.urlencode(data_dict)
print(data_string)
new_url=base_url+"?"+data_string
response=urllib.request.urlopen(new_url)
print(response.read().decode('utf-8'))

在GET方法中传递参数的三种方式：

将字典形式的参数用urllib.parse.urlencode()函数编码成url参数:

import urllib.parse

if __name__ == '__main__':
    base_url = 'http://httpbin.org/'
    params = {
        'key1': 'value1',
        'key2': 'value2'
    }
    full_url = base_url + urllib.parse.urlencode(params)
    print(full_url)

直接在urllib.request.get()函数中使用params参数：

import requests

if __name__ == '__main__':
    payload = {
        'key1': 'value1',
        'key2': 'value2'
    }
    response = requests.get('http://httpbin.org/get', params=payload)
    print(response.url)

url直接包含参数：

http://httpbin.org/get?key2=value2&key1=value1

2.4 添加请求头和请求主体

一般只需要添加User-Agent这一信息就足够了，headers同样也是字典类型；
我们还有对付”反盗链”的方式，对付防盗链，服务器会识别headers中的referer是不是它自己，如果不是，有的服务器不会响应，所以我们还可以在headers中加入referer

由于urllib.request.urlopen() 函数不接受headers参数，所以需要构建一个urllib.request.Request对象来实现请求头的设置

 #生成一个请求报文，这里的url和data需要提前给定 
 req = urllib.request.Request(url,reqdata,headers) 
 #或者通过urllib.request.Request的add_header方法添加
 #使用这个报文去请求网页，这时请求的报文中就带有浏览器标识了
 html = urllib.request.urlopen(req).read()

2.5 构造一个携带参数的POST请求

import urllib.request
import urllib.parse
#测试网址：http://httpbin.org/post

#定义一个字典参数
data_dict={"username":"zhangsan","password":"123456"}
#使用urlencode将字典参数序列化成字符串
data_string=urllib.parse.urlencode(data_dict)
#将序列化后的字符串转换成二进制数据，因为post请求携带的是二进制参数
last_data=bytes(data_string,encoding='utf-8')
#如果给urlopen这个函数传递了data这个参数，那么它的请求方式则不是get请求，而是post请求
response=urllib.request.urlopen("http://httpbin.org/post",data=last_data)
#我们的参数出现在form表单中，这表明是模拟了表单的提交方式，以post方式传输数据
print(response.read().decode('utf-8'))

3.补充

如果直接将中文传入URL中请求，会导致编码错误。我们需要使用quote() ，对该中文关键字进行URL编码

import urllib.request
city=urllib.request.quote('郑州市'.encode('utf-8'))
response=urllib.request.urlopen('http://api.map.baidu.com/telematics/v3/weather?location={}&output=json&ak=TueGDhCvwI6fOrQnLM0qmXxY9N0OkOiQ&callback=?'.format(city))
print(response.read().decode('utf-8'))

4.requests和urllib比较

在我们使用python爬虫时，更建议用requests库，因为requests比urllib更为便捷，requests可以直接构造get，post请求并发起，而urllib.request只能先构造get，post请求，再发起。

get_response.text得到的是str数据类型。get_response.content得到的是Bytes类型,需要进行解码。作用和get_response.text类似。get_response.json得到的是json数据。总而言之，requests是对urllib的进一步封装，因此在使用上显得更加的便捷。

tester_sz

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
python3之模块urllib

文章目录1.urllib的包含的模块2.使用urllib发送请求2.1 urllib包含的方法和属性2.2 发送一个不携带参数的get请求2.3 发送一个携带参数的get请求2.4 添加请求头和请求主体2.5 构造一个携带参数的POST请求3.补充urllib是Python中内置的发送网络请求的一个库(包)，在Python2中由urllib和urllib2两个库来实现请求的发送，但是在Pyth...
复制链接

扫一扫

专栏目录