Python urllib库

最新推荐文章于 2023-12-08 16:06:51 发布

SeeUa

最新推荐文章于 2023-12-08 16:06:51 发布

阅读量142

点赞数

分类专栏： python 文章标签： Python

python 专栏收录该内容

35 篇文章 0 订阅

订阅专栏

urllib是python内置的HTTP请求库：

urllib.request 请求模块
urllib.error 异常处理模块
urllib.parse url解析模块
urllib.robotparser robots.txt解析模块

urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadeffault=False,context=None)

decode(“utf-8”):转换为字符串（utf-8)编码

import urllib.request
html=urllib.request.urlopen("http://www.baidu.com/")
print(html.read().decode("utf-8"))

以post方式访问： 其中http://httpbin.org (供我们做http测试的网址)

import urllib.parse
import urllib.request
data=bytes(urllib.parse.urlencode({"word":"hello"}),encoding="utf-8") #encoding=""以指定的编码方式
response=urllib.request.urlopen("http://httpbin.org/post",data=data)
print(response.read())

结果

b'{\n  "args": {}, \n  "data": "", \n  "files": {}, \n  "form": {\n    "word": "hello"\n  }, \n  "headers": {\n    "Accept-Encoding": "identity", \n    "Content-Length": "10", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "httpbin.org", \n    "User-Agent": "Python-urllib/3.6"\n  }, \n  "json": null, \n  "origin": "113.105.12.153, 113.105.12.153", \n  "url": "https://httpbin.org/post"\n}\n'

设置超时

	import urllib.error
	import socket
	import urllib.request
	try:
	    response=urllib.request.urlopen("http://httpbin.org/get",timeout=0.1)
	except urllib.error.URLError as e:
	    if isinstance(e.reason,socket.timeout):
	        print("TIME OUT")

结果

	TIME OUT

响应

响应类型

import urllib.request
html=urllib.request.urlopen("http://www.baidu.com/")
print(html.read().decode("utf-8"))

<http.client.HTTPResponse object at 0x0000024B3B676080>

响应码，响应头

import urllib.request
html=urllib.request.urlopen("http://www.baidu.com/") 
print(html.status)
print(html.getheaders())
print(html.getheader('server'))


200
[('Bdpagetype', '1'), ('Bdqid', '0x8afac3a8000c1dba'), ('Cache-Control', 'private'), ('Content-Type', 'text/html'), ('Cxy_all', 'baidu+8ec69d29edd1ec53e9faabc8051e2fd7'), ('Date', 'Sun, 17 Mar 2019 07:12:33 GMT'), ('Expires', 'Sun, 17 Mar 2019 07:11:47 GMT'), ('P3p', 'CP=" OTI DSP COR IVA OUR IND COM "'), ('Server', 'BWS/1.1'), ('Set-Cookie', 'BAIDUID=5F61E86C65F2F415AE669543617A67B2:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'BIDUPSID=5F61E86C65F2F415AE669543617A67B2; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'PSTM=1552806753; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'delPer=0; path=/; domain=.baidu.com'), ('Set-Cookie', 'BDSVRTM=0; path=/'), ('Set-Cookie', 'BD_HOME=0; path=/'), ('Set-Cookie', 'H_PS_PSSID=1438_21118_28558_28607_28584_26350_28604_28606; path=/; domain=.baidu.com'), ('Vary', 'Accept-Encoding'), ('X-Ua-Compatible', 'IE=Edge,chrome=1'), ('Connection', 'close'), ('Transfer-Encoding', 'chunked')]
BWS/1.1

read() 获取响应体的内容：

html.read()

Request

request.Request(url-url,data=data,headers=headers,methon=“POST”)
url:网址地址
data:提交的表单数据
headers:响应头
methon:访问方式

from urllib import parse,request

url='http://httpbin.org/post'
headers={
    'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Mobile Safari/537.36',
    'Host':'httpbin.org'
}
dict={
    'name':'Germey'
}
data=bytes(urllib.parse.urlencode(dict),encoding="utf-8")
req=request.Request(url=url,data=data,headers=headers,method='POST')
response=request.urlopen(req)
print(response.read().decode('utf-8'))

结果

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "name": "Germey"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "11", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Mobile Safari/537.36"
  }, 
  "json": null, 
  "origin": "113.105.12.153, 113.105.12.153", 
  "url": "https://httpbin.org/post"
}

handler

代理

方法一：

import urllib.request
proxy_handler=urllib.request.ProxyHandler(
{
   'https':'219.131.240.200:9797'（千万注意http后面没有点）

})
opener=urllib.request.build_opener(proxy_handler,urllib.request.HTTPHandler)
response=opener.open("https://httpbin.org/get")
print(response.read())

结果

b'{\n  "args": {}, \n  "headers": {\n    "Accept-Encoding": "identity", \n    "Host": "httpbin.org", \n    "User-Agent": "Python-urllib/3.6"\n  }, \n  "origin": "219.131.240.200, 219.131.240.200", \n  "url": "https://httpbin.org/get"\n}\n'

方法二：

import urllib.request
proxy_handler=urllib.request.ProxyHandler(
{
   'https':'219.131.240.200:9797'
})
opener=urllib.request.build_opener(proxy_handler)
urllib.request.install_opener(opener)
response=urllib.request.urlopen("https://httpbin.org/get")
print(response.read())

结果

b'{\n  "args": {}, \n  "headers": {\n    "Accept-Encoding": "identity", \n    "Host": "httpbin.org", \n    "User-Agent": "Python-urllib/3.6"\n  }, \n  "origin": "219.131.240.200, 219.131.240.200", \n  "url": "https://httpbin.org/get"\n}\n'

!!!注意http的代理只能代理HTTP开头的，https的代理只能代理 HTTPS的

cookie

cookie 可以保持登录会话信息
导入处理cookie 的库 http.cookiejar

import http.cookiejar,urllib.request

cookie =http.cookiejar.CookieJar()#注意大小写
handler=urllib.request.HTTPCookieProcessor(cookie)
opener=urllib.request.build_opener(handler)
response=opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name+'*'+item.value)

结果：

BAIDUID*C31837787335FED26959A1D8CCE1030F:FG=1
BIDUPSID*C31837787335FED26959A1D8CCE1030F
H_PS_PSSID*1450_21085_28557_28608_28584_26350_28603_28606
PSTM*1552816088
delPer*0
BDSVRTM*0 
BD_HOME*0

cookie保存为文本文件
第一种方法：

import http.cookiejar,urllib.request

filename='C:/Users/hanson/Desktop/1/cookie.txt' #保存的文件位置和文件名，默认为工程目录
cookie=http.cookiejar.MozillaCookieJar(filename) #cookie声明为http.cookiejar的子类对象MozillCookieJar，因为其带有save（）方法
handler=urllib.request.HTTPCookieProcessor(cookie)
opener=urllib.request.build_opener(handler)
response=opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True,ignore_expires=True)

结果

 Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
 This is a generated file!  Do not edit.

.baidu.com	TRUE	/	FALSE	3700363349	BAIDUID	D3E2F4A0A280B33C6E7C5558F8A6DB34:FG=1
.baidu.com	TRUE	/	FALSE	3700363349	BIDUPSID	D3E2F4A0A280B33C6E7C5558F8A6DB34
.baidu.com	TRUE	/	FALSE		H_PS_PSSID	28629_1444_21119_28558_28607_28584_28603_28626_28605
.baidu.com	TRUE	/	FALSE	3700363349	PSTM	1552879705
.baidu.com	TRUE	/	FALSE		delPer	0
www.baidu.com	FALSE	/	FALSE		BDSVRTM	0
www.baidu.com	FALSE	/	FALSE		BD_HOME	0

第二种方法：

import http.cookiejar,urllib.request

filename='C:/Users/hanson/Desktop/1/cookie1.txt'
cookie=http.cookiejar.LWPCookieJar(filename)   把MozillCookieJar改为LWPCookieJar
handler=urllib.request.HTTPCookieProcessor(cookie)
opener=urllib.request.build_opener(handler)
response=opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True,ignore_expires=True)

结果

#LWP-Cookies-2.0
Set-Cookie3: BAIDUID="AFA15173D5BB3D6F2CA1645B51A149C4:FG=1"; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2087-04-05 06:49:31Z"; version=0
Set-Cookie3: BIDUPSID=AFA15173D5BB3D6F2CA1645B51A149C4; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2087-04-05 06:49:31Z"; version=0
Set-Cookie3: H_PS_PSSID=1438_21113_28558_28607_28584_28604_28625_28606; path="/"; domain=".baidu.com"; path_spec; domain_dot; discard; version=0
Set-Cookie3: PSTM=1552880127; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2087-04-05 06:49:31Z"; version=0
Set-Cookie3: delPer=0; path="/"; domain=".baidu.com"; path_spec; domain_dot; discard; version=0
Set-Cookie3: BDSVRTM=0; path="/"; domain="www.baidu.com"; path_spec; discard; version=0
Set-Cookie3: BD_HOME=0; path="/"; domain="www.baidu.com"; path_spec; discard; version=0

用cookie打开网址
用哪种cookie保存就用哪种打开

import http.cookiejar,urllib.request

cookie=http.cookiejar.LWPCookieJar() 用哪种cookie就用哪种cookie保存方式
cookie.load('C:/Users/hanson/Desktop/1/cookie1.txt',ignore_discard=True,ignore_expires=True)
handler=urllib.request.HTTPCookieProcessor(cookie)
opener=urllib.request.build_opener(handler)
response=opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

异常处理：

父类：URLError
子类：HTTPError

try:
except

URL解析

SeeUa

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python urllib库

urllib是python内置的HTTP请求库：urllib.request 请求模块urllib.error 异常处理模块urllib.parse url解析模块urllib.robotparser robots.txt解析模块urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,ca...
复制链接

扫一扫

专栏目录