Python urllib 模块

最新推荐文章于 2020-07-14 22:56:39 发布

Test_C.

最新推荐文章于 2020-07-14 22:56:39 发布

阅读量518

点赞数

分类专栏： Python urllib

本文链接：https://blog.csdn.net/weixin_42544006/article/details/84339063

版权

Python 同时被 2 个专栏收录

110 篇文章 3 订阅

订阅专栏

urllib

1 篇文章 0 订阅

订阅专栏

Table of Contents

请求页面

urllib.request.urlopen() 构造 HTTP 请求

urlopen()函数的API

data参数:urllib.parse.urlencode(字典) 将字典转换为字符串data 接收bytes 类型

timeout 参数:设置超时时间，单位为秒，意思就是如果请求超出了设置的这个时间，还没有得到响应，就会抛出异常。如果不指定该就会使用全局默认时间

urllib.request.Request() 添加请求头和POST请求什么的

请求页面

urllib.request.urlopen() 构造 HTTP 请求

import urllib.request

html = urllib.request.urlopen('http://www.baidu.com')

# html 是HTTPResponse 对象
print(html)

# 响应状态码
print(html.status)

# 响应头
print(html.getheaders())

# 响应内容
print(html.read().decode('utf-8'))

`urlopen()`函数的API

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

data参数:urllib.parse.urlencode(字典) 将字典转换为字符串data 接收bytes 类型

import urllib.request

data = bytes(urllib.parse.urlencode({'aa':'bb','hello':'世界'}),encoding='utf-8')
html = urllib.request.urlopen('http://httpbin.org/post', data=data)

print(html.read().decode())



# {
#   "args": {}, 
#   "data": "", 
#   "files": {}, 
#   "form": {
#     "aa": "bb", 
#     "hello": "\u4e16\u754c"
#   }, 
#   "headers": {
#     "Accept-Encoding": "identity", 
#     "Connection": "close", 
#     "Content-Length": "30", 
#     "Content-Type": "application/x-www-form-urlencoded", 
#     "Host": "httpbin.org", 
#     "User-Agent": "Python-urllib/3.5"
#   }, 
#   "json": null, 
#   "origin": "111.197.18.20", 
#   "url": "http://httpbin.org/post"
# }

timeout 参数:设置超时时间，单位为秒，意思就是如果请求超出了设置的这个时间，还没有得到响应，就会抛出异常。如果不指定该就会使用全局默认时间

import urllib.request

data = bytes(urllib.parse.urlencode({'aa':'bb','hello':'世界'}),encoding='utf-8')
html = urllib.request.urlopen('http://httpbin.org/post', data=data,timeout=0.3)
print(html.read().decode())

urllib.request.Request() 添加请求头和POST请求什么的

urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

第一个参数url用于请求URL，这是必传参数，其他都是可选参数。
第二个参数data如果要传，必须传bytes（字节流）类型的。如果它是字典，可以先用urllib.parse模块里的urlencode()编码。
第三个参数headers是一个字典，它就是请求头，我们可以在构造请求时通过headers参数直接构造，也可以通过调用请求实例的add_header()方法添加。
添加请求头最常用的用法就是通过修改User-Agent来伪装浏览器，默认的User-Agent是Python-urllib，我们可以通过修改它来伪装浏览器。比如要伪装火狐浏览器，你可以把它设置为：
Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11
第四个参数origin_req_host指的是请求方的host名称或者IP地址。
第五个参数unverifiable表示这个请求是否是无法验证的，默认是False，意思就是说用户没有足够权限来选择接收这个请求的结果。例如，我们请求一个HTML文档中的图片，但是我们没有自动抓取图像的权限，这时unverifiable的值就是True`。
第六个参数method是一个字符串，用来指示请求使用的方法，比如GET、POST和PUT等。

import urllib.request

url = 'http://httpbin.org/post'

headers = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
    'Hosts': '123123.org',
    'test':'666'
}
# data 参数
data = bytes(urllib.parse.urlencode({'aa':'bb','hello':'世界'}),encoding='utf-8')
rs = urllib.request.Request(url,data,headers,method='POST')
response = urllib.request.urlopen(rs)

print(response.read().decode('utf-8'))

使用代理

import urllib.request
import re,random

proxy = [{
"ip": "112.245.189.220",
"port": 4243,
"expire_time": "2018-11-22 17:57:02",
"city": "山东省滨州市",
"isp": "联通"
},
{
"ip": "113.121.156.147",
"port": 7889,
"expire_time": "2018-11-22 18:08:10",
"city": "山东省泰安市",
"isp": "电信"
},
{
"ip": "112.240.176.185",
"port": 4213,
"expire_time": "2018-11-22 17:57:01",
"city": "山东省荷泽市",
"isp": "联通"
},
{
"ip": "122.4.40.196",
"port": 3937,
"expire_time": "2018-11-22 18:07:02",
"city": "山东省济南市",
"isp": "电信"
},
{
"ip": "122.7.135.224",
"port": 7889,
"expire_time": "2018-11-22 18:11:33",
"city": "山东省泰安市",
"isp": "电信"
}]

def test(ip,port):
    proxy_handler = urllib.request.ProxyHandler({
        'http': 'http://%s:%s'%(ip,port),
        'https': 'https://%s:%s'%(ip,port),
    })
    opener = urllib.request.build_opener(proxy_handler)

    response = opener.open('http://www.baidu.com/s?wd=ip')
    ip = re.search(r'本机IP:&nbsp;(.*?)</td>',response.read().decode('utf-8'),re.S)
    print(ip.group(1))

for i in range(5):
    rdm = random.choice(proxy)
    test(rdm['ip'],str(rdm['port']))

运行结果: 有的 ip 不稳定

122.7.135.224</span>山东省泰安市 电信	    
    
122.4.40.196</span>山东省济南市 电信	    
    
Traceback (most recent call last):
  File "E:\Python\安装目录\lib\urllib\request.py", line 1254, in do_open
    h.request(req.get_method(), req.selector, req.data, headers)
  File "E:\Python\安装目录\lib\http\client.py", line 1107, in request
    self._send_request(method, url, body, headers)
  File "E:\Python\安装目录\lib\http\client.py", line 1152, in _send_request
    self.endheaders(body)
  File "E:\Python\安装目录\lib\http\client.py", line 1103, in endheaders
    self._send_output(message_body)
  File "E:\Python\安装目录\lib\http\client.py", line 934, in _send_output
    self.send(msg)
  File "E:\Python\安装目录\lib\http\client.py", line 877, in send
    self.connect()
  File "E:\Python\安装目录\lib\http\client.py", line 849, in connect
    (self.host,self.port), self.timeout, self.source_address)
  File "E:\Python\安装目录\lib\socket.py", line 712, in create_connection
    raise err
  File "E:\Python\安装目录\lib\socket.py", line 703, in create_connection
    sock.connect(sa)
TimeoutError: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。

Cookies 处理

import urllib.request
import http.cookiejar

cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('https://www.baidu.com')
for item in cookie:
    print(item.name+"="+item.value)

下载文件

urllib.request.urlretrieve(url,filename,reporthook,data)

filename:制定了保存到本地的路径，（如果未指定该参数，urllib会生成一个临时文件来保存数据）
reporthook:是一个回调函数，当连接上服务器以及响应的数据模块传输完毕的时候就会触发该回调函数，我们可以用这个回调函数来显示当前的下载进度
data:指post到服务器的数据。该方法返回一个包含两个元素的元祖（filename，headers）filename表示保存到本地的路径，headers表示服务器响应首部

from urllib import request

def Schedule(a, b, c):
    '''
     7     a:已经下载的数据块
     8     b:数据块的大小
     9     c:远程文件的大小
    '''
    per = 100.0 * a * b / c
    if per > 100:
        per = 100
    print ('%.2f%% ' % per,a,b,c)

url = 'https://www.python.org/ftp/python/3.7.1/python-3.7.1-amd64.exe'

local = 'python-3.7.1-amd64.exe'
request.urlretrieve(url, local, Schedule)

关于https请求

from urllib import request
# 忽略https 请求的证书校验
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

base_url = 'https://www.12306.cn/mormhweb/'
response = request.urlopen(base_url)
print(response.read().decode('utf-8'))

Test_C.

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Python urllib 模块

Table of Contents请求页面urllib.request.urlopen() 构造 HTTP 请求urlopen()函数的APIdata参数:urllib.parse.urlencode(字典) 将字典转换为字符串data 接收bytes 类型timeout 参数:设置超时时间，单位为秒，意思就是如果请求超出了设置的这个时间，还没有得到响应，就会抛出异常。如果不指...
复制链接

扫一扫

专栏目录