python3 urllib 小结

最新推荐文章于 2023-12-15 21:05:22 发布

allen20104245

最新推荐文章于 2023-12-15 21:05:22 发布

阅读量395

点赞数

分类专栏： Python 文章标签： python

本文链接：https://blog.csdn.net/allen20104245/article/details/105575999

版权

Python 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

官方文档地址：https://docs.python.org/3/library/urllib.html

什么是Urllib

Urllib是python内置的HTTP请求库
包括以下模块：
urllib.request 请求模块
urllib.error 异常处理模块
urllib.parse url 解析模块
urllib.robotparser robots.txt 解析模块，本篇不做解释

HTTP请求(rullib.request.urlopen)

模拟浏览器发起一个 HTTP 请求，我们需要用到 urllib.request 模块。urllib.request 的作用不仅仅是发起请求，还能获取请求返回结果：

urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT, *, cafile=None, capath=None, cadefault=False, context=None)

data 是 bytes 类型的内容，可通过 bytes()函数转为化字节流。它也是可选参数。使用 data 参数，请求方式变成以 POST 方式提交表单。使用标准格式是application/x-www-form-urlencoded
timeout 参数是用于设置请求超时时间。单位是秒。
cafile、capath、cadefault 参数：用于实现可信任的CA证书的HTTP请求。如果使用HTTPS则需要用到。（基本上很少用）
context参数必须是ssl.SSLContext类型，用来指定SSL设置，实现SSL加密传输。（基本上很少用）
该方法也可以单独传入urllib.request.Request对象
该函数返回结果是一个http.client.HTTPResponse对象。

url参数的使用

先写一个简单的例子：

import urllib.request

response = urllib.request.urlopen('http://www.baidu.com')
print(response.read().decode('utf-8'))
# response.read() 可以获取到网页的内容，如果没有read()，将返回如下内容：
# <http.client.HTTPResponse object at 0x000001990D44CBA8>

data参数的使用

上述的例子是通过请求百度的get请求获得百度，下面使用urllib的post请求：

import urllib.request
import urllib.parse

posturl = "http://www.iqianyue.com/mypost/"
postdata = urllib.parse.urlencode({
    "name": "1258ceo@qq.com",
    "pass": "kjsahgjkashg",
    }).encode("utf-8")

# 进行post，就需要使用urllib.request下面的Request(真实post地址,post数据)
# rst = urllib.request.urlopen(posturl, postdata)
req = urllib.request.Request(posturl, postdata)
rst = urllib.request.urlopen(req).read().decode("utf-8")

这里用到urllib.parse.urlencode，可以将post数据进行转换放到urllib.request.urlopen的data参数中。这样就完成了一次post请求。所以如果添加data参数的时候就是以post请求方式请求，如果没有data参数就是get请求方式。

timeout参数的使用

在某些网络情况不好或者服务器端异常的情况会出现请求慢的情况，或者请求异常，所以这个时候我们需要给
请求设置一个超时时间，而不是让程序一直在等待结果。例子如下：

import urllib.request

response = urllib.request.urlopen('http://httpbin.org/get', timeout=1)
print(response.read())

运行之后我们看到可以正常的返回结果，接着我们将timeout时间设置为0.1
运行程序会提示如下错误：

urllib.error.URLError: <urlopen error timed out>

所以我们需要对异常进行抓取，代码更改为：

import socket
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout):
        print('TIME OUT')

响应的状态码、响应头：

import urllib.request

resp = urllib.request.urlopen('http://www.baidu.com')
print(resp.status)
print(resp.getheaders())  # 数组（元组列表）
print(resp.getheader('Server'))  # "Server"大小写不区分

200
[('Bdpagetype', '1'), ('Bdqid', '0xde1e34760008c473'), ('Cache-Control', 'private'), ('Content-Type', 'text/html;charset=utf-8'), ('Date', 'Fri, 17 Apr 2020 13:09:05 GMT'), ('Expires', 'Fri, 17 Apr 2020 13:08:46 GMT'), ('P3p', 'CP=" OTI DSP COR IVA OUR IND COM "'), ('P3p', 'CP=" OTI DSP COR IVA OUR IND COM "'), ('Server', 'BWS/1.1'), ('Set-Cookie', 'BAIDUID=C4333343AD5C44E296B0F47DAA3DF9BF:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'BIDUPSID=C4333343AD5C44E296B0F47DAA3DF9BF; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'PSTM=1587128945; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'BAIDUID=C4333343AD5C44E240DB5BE377DDDB5E:FG=1; max-age=31536000; expires=Sat, 17-Apr-21 13:09:05 GMT; domain=.baidu.com; path=/; version=1; comment=bd'), ('Set-Cookie', 'BDSVRTM=0; path=/'), ('Set-Cookie', 'BD_HOME=1; path=/'), ('Set-Cookie', 'H_PS_PSSID=31351_30969_1423_21126_31341_31270_30823_31163; path=/; domain=.baidu.com'), ('Traceid', '1587128945023877684216005287807132681331'), ('Vary', 'Accept-Encoding'), ('Vary', 'Accept-Encoding'), ('X-Ua-Compatible', 'IE=Edge,chrome=1'), ('Connection', 'close'), ('Transfer-Encoding', 'chunked')]
BWS/1.1

Hearders

有很多网站为了防止程序爬虫爬网站造成网站瘫痪，会需要携带一些headers头部信息才能访问，最长见的有user-agent参数。

写一个简单的例子：

import urllib.request

request = urllib.request.Request('https://python.org')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

给请求添加头部信息，从而定制自己请求网站是时的头部信息。

import request, parse

url = 'http://httpbin.org/post'
headers = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
    'Host': 'httpbin.org'
}
data = {
    'name': 'zhaofan'
}
data = bytes(parse.urlencode(data), encoding='utf8')
req = request.Request(url=url, data=data, headers=headers, method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

添加请求头的第二种方式(这种添加方式有个好处是自己可以定义一个请求头字典，然后循环进行添加)：

from urllib import request, parse

url = 'http://httpbin.org/post'
dict = {
    'name': 'Germey'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, method='POST')
req.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

第三种方式：

import urllib.request
url = "http://blog.csdn.net"
# 头文件格式header=("User-Agent",具体用户代理值)
headers = ("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) "
                         "Chrome/49.0.2623.22 Safari/537.36 SE 2.X MetaSr 1.0")
opener = urllib.request.build_opener()
opener.addheaders = [headers]
print(opener.open(url).read().decode('utf-8'))

高级用法各种handler

代理 - ProxyHandler

通过rulllib.request.ProxyHandler()可以设置代理,网站它会检测某一段时间某个IP 的访问次数，如果访问次数过多，它会禁止你的访问,所以这个时候需要通过设置代理来爬取数据

import urllib.request

proxy_handler = urllib.request.ProxyHandler({
    'http': 'http://127.0.0.1:8908',
    'https': 'https://127.0.0.1:8997'
})
opener = urllib.request.build_opener(proxy_handler)
response = opener.open('http://httpbin.org/get')
print(response.read())

cookie,HTTPCookiProcessor

cookie中保存中我们常见的登录信息，有时候爬取网站需要携带cookie信息访问,这里用到了http.cookijar，用于获取cookie以及存储cookie。

import http.cookiejar, urllib.request
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name + "=" + item.value)

异常处理(urllib.error)

在很多时候我们通过程序访问页面的时候，有的页面可能会出现错误，类似404，500等错误。

在urllb异常这里有两个个异常错误：
URLError,HTTPError，HTTPError是URLError的子类

URLError里只有一个属性：reason,即抓异常的时候只能打印错误信息，类似上面的例子

HTTPError里有三个属性：code,reason,headers，即抓异常的时候可以获得code,reson，headers三个信息，例子如下：

import socket
from urllib import request,error

try:
    response = request.urlopen("http://pythonsite.com/111.html")
except error.HTTPError as e:
    print(e.reason)
    print(e.code)
    print(e.headers)
except error.URLError as e:
    print(e.reason)
    if isinstance(e.reason, socket.timeout):  # e.reason也可以在做深入的判断
        print("time out")
else:
    print("reqeust successfully")

URL解析(urllib.parse)

urlparse(url) 对传入的url地址进行拆分。

urlunpars(url) 其实功能和urlparse的功能相反，它是用于拼接。

urljoin(url1，url2) 这个的功能其实是做拼接的，拼接的时候后面的优先级高于前面的url。

urlencode() 这个方法可以将字典转换为url参数。

import urllib.parse

# 这里就是可以对你传入的url地址进行拆分
result = urllib.parse.urlparse("http://www.baidu.com/index.html;user?id=5#comment")
print(result)
# 同时我们是可以指定协议类型：
result = urllib.parse.urlparse("www.baidu.com/index.html;user?id=5#comment", scheme="https")
# 这样拆分的时候协议类型部分就会是你指定的部分，当然如果你的url里面已经带了协议，你再通过scheme指定的协议就不会生效
print(result)

data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=123', 'commit']
print(urllib.parse.urlunparse(data))

print(urllib.parse.urljoin('http://www.baidu.com', 'FAQ.html'))
print(urllib.parse.urljoin('http://www.baidu.com', 'https://pythonsite.com/FAQ.html'))
print(urllib.parse.urljoin('http://www.baidu.com/about.html', 'https://pythonsite.com/FAQ.html'))

params = {
    "name": "zhaofan",
    "age": 23,
}
base_url = "http://www.baidu.com?"
url = base_url + urllib.parse.urlencode(params)
print(url)

结果：

ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')
ParseResult(scheme='https', netloc='', path='www.baidu.com/index.html', params='user', query='id=5', fragment='comment')
http://www.baidu.com/index.html;user?a=123#commit
http://www.baidu.com/FAQ.html
https://pythonsite.com/FAQ.html
https://pythonsite.com/FAQ.html
http://www.baidu.com?name=zhaofan&age=23

urllib.parse.quote（unquote）

在url中，是只能使用ASCII中包含的字符的，也就是说，ASCII不包含的特殊字符，以及中文等字符都是不可以在url中使用的。而我们有时候又有将中文字符加入到url中的需求，例如百度的搜索地址：https://www.baidu.com/s?wd=美食。?之后的wd参数，则是我们搜索的关键词。那么我们实现的方法就是将特殊字符进行url编码，转换成可以url可以传输的格式，urllib中可以使用quote()方法来实现这个功能。

from urllib import parse
keyword = '美食'
parse.quote(keyword)
=======>>'%E7%BE%8E%E9%A3%9F'

#如果需要将编码后的数据转换回来，可以使用unquote()方法。
parse.unquote('%E7%BE%8E%E9%A3%9F')
=======>>'美食'

allen20104245

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python3 urllib 小结

官方文档地址：https://docs.python.org/3/library/urllib.html什么是UrllibUrllib是python内置的HTTP请求库包括以下模块：urllib.request 请求模块urllib.error 异常处理模块urllib.parse u...
复制链接

扫一扫