本节主要学习python语言中网络相关知识。
一
主要文件和目录在Urllib的request.py模块下面。其中支持SSL加密方式访问。
下面我们看看其中的主要类和函数吧。
先看看源码吧。
def urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
*, cafile=None, capath=None, cadefault=False):
global _opener
if cafile or capath or cadefault:
if not _have_ssl:
raise ValueError('SSL support not available')
context = ssl._create_stdlib_context(cert_reqs=ssl.CERT_REQUIRED,
cafile=cafile,
capath=capath)
https_handler = HTTPSHandler(context=context, check_hostname=True)
opener = build_opener(https_handler)
elif _opener is None:
_opener = opener = build_opener()
else:
opener = _opener
return opener.open(url, data, timeout)
直接利用URLOPEN函数进行web访问,主要传递的关键参数,就是网址的具体URL
import urllib.request
if __name__ == '__main__':
print('Main Thread Run :', __name__)
ResponseData = urllib.request.urlopen('http://www.baidu.com/robots.txt')
strData = ResponseData.read()
strShow = strData.decode('utf-8')
if(False):
print(ResponseData.geturl())
if(False):
print(ResponseData.info())
else:
print(ResponseData.__sizeof__())
print(strShow)
ResponseData.close()
print('\nMain Thread Exit :', __name__)
//注意上面的代码,采用的是UTF-8编码方式,所以对应解码相同。
结果如下
Main Thread Run : __main__
32
User-agent: Baiduspider
Disallow: /baidu
Disallow: /s?
Disallow: /ulink?
Disallow: /link?
User-agent: Googlebot
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
User-agent: MSNBot
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
User-agent: Baiduspider-image
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
User-agent: YoudaoBot
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
User-agent: Sogou web spider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
User-agent: Sogou inst spider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
User-agent: Sogou spider2
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
User-agent: Sogou blog
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
User-agent: Sogou News Spider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
User-agent: Sogou Orion spider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
User-agent: ChinasoSpider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
User-agent: Sosospider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
User-agent: yisouspider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
User-agent: EasouSpider
Disallow: /baidu
Disallow: /s?
Disallow: /shifen/
Disallow: /homepage/
Disallow: /cpro
Disallow: /ulink?
Disallow: /link?
User-agent: *
Disallow: /
Main Thread Exit : __main__
二
函数urlretrieve可以实现直接传递URL地址读取该web网页内容,并且以本地文件存储。
函数返回值是一个list其中包括两个参数,第一个是本地存储文件名称,第二个是web服务
返回的http响应头
def urlretrieve(url, filename=None, reporthook=None, data=None):
"""
Retrieve a URL into a temporary location on disk.
代码测试
import urllib.request
if __name__ == '__main__':
print('Main Thread Run :', __name__)
data = urllib.request.urlretrieve('http://www.baidu.com/robots.txt', 'robots.txt')
print('--filename--:', data[0])
print('--response--:', data[1])
print('\nMain Thread Exit :', __name__)
结果:
Main Thread Run : __main__
--filename--: robots.txt
--response--: Date: Mon, 22 Sep 2014 08:08:05 GMT
Server: Apache
P3P: CP=" OTI DSP COR IVA OUR IND COM "
Set-Cookie: BAIDUID=4FB847BEE916A0F72ABC5093271CD2BC:FG=1; expires=Tue, 22-Sep-15 08:08:05 GMT; max-age=31536000; path=/; domain=.baidu.com; version=1
Last-Modified: Thu, 17 Jul 2014 07:10:38 GMT
ETag: "91e-4fe5e56791780"
Accept-Ranges: bytes
Content-Length: 2334
Vary: Accept-Encoding,User-Agent
Connection: Close
Content-Type: text/plain
Main Thread Exit : __main__
三
函数request_host解析url中包含的主机地址 传入参数只有一个Request的对象实例
至于Request对象待会介绍。
下面看看该函数源代码
def request_host(request):
"""Return request-host, as defined by RFC 2965.
Variation from RFC: returned value is lowercased, for convenient
comparison.
"""
url = request.full_url
host = urlparse(url)[1]
if host == "":
host = request.get_header("Host", "")
# remove port, if present
host = _cut_port_re.sub("", host, 1)
return host.lower()
测试代码:
import urllib.request
if __name__ == '__main__':
print('Main Thread Run :', __name__)
Req = urllib.request.Request('http://www.baidu.com/robots.txt')
host = urllib.request.request_host(Req)
print(host)
print('\nMain Thread Exit :', __name__)
结果:
Main Thread Run : __main__
www.baidu.com
Main Thread Exit : __main__
四
下面介绍该模块主要类Request类。注意啊,是大写的R别搞错了。
先看看源代码
class Request:
def __init__(self, url, data=None, headers={},
origin_req_host=None, unverifiable=False,
method=None):
self.full_url = url
self.headers = {}
self.unredirected_hdrs = {}
self._data = None
self.data = data
self._tunnel_host = None
for key, value in headers.items():
self.add_header(key, value)
if origin_req_host is None:
origin_req_host = request_host(self)
self.origin_req_host = origin_req_host
self.unverifiable = unverifiable
if method:
self.method = method
@property
def full_url(self):
if self.fragment:
return '{}#{}'.format(self._full_url, self.fragment)
return self._full_url
@full_url.setter
def full_url(self, url):
# unwrap('<URL:type://host/path>') --> 'type://host/path'
self._full_url = unwrap(url)
self._full_url, self.fragment = splittag(self._full_url)
self._parse()
@full_url.deleter
def full_url(self):
self._full_url = None
self.fragment = None
self.selector = ''
@property
def data(self):
return self._data
@data.setter
def data(self, data):
if data != self._data:
self._data = data
# issue 16464
# if we change data we need to remove content-length header
# (cause it's most probably calculated for previous value)
if self.has_header("Content-length"):
self.remove_header("Content-length")
@data.deleter
def data(self):
self.data = None
def _parse(self):
self.type, rest = splittype(self._full_url)
if self.type is None:
raise ValueError("unknown url type: %r" % self.full_url)
self.host, self.selector = splithost(rest)
if self.host:
self.host = unquote(self.host)
def get_method(self):
"""Return a string indicating the HTTP request method."""
default_method = "POST" if self.data is not None else "GET"
return getattr(self, 'method', default_method)
def get_full_url(self):
return self.full_url
def set_proxy(self, host, type):
if self.type == 'https' and not self._tunnel_host:
self._tunnel_host = self.host
else:
self.type= type
self.selector = self.full_url
self.host = host
def has_proxy(self):
return self.selector == self.full_url
def add_header(self, key, val):
# useful for something like authentication
self.headers[key.capitalize()] = val
def add_unredirected_header(self, key, val):
# will not be added to a redirected request
self.unredirected_hdrs[key.capitalize()] = val
def has_header(self, header_name):
return (header_name in self.headers or
header_name in self.unredirected_hdrs)
def get_header(self, header_name, default=None):
return self.headers.get(
header_name,
self.unredirected_hdrs.get(header_name, default))
def remove_header(self, header_name):
self.headers.pop(header_name, None)
self.unredirected_hdrs.pop(header_name, None)
def header_items(self):
hdrs = self.unredirected_hdrs.copy()
hdrs.update(self.headers)
return list(hdrs.items())
类的初始构造化函数
def __init__(self, url, data=None, headers={},
origin_req_host=None, unverifiable=False,
method=None):
注意里面几个关键参数 url 代表你要访问的URL地址,Data代表你要发送的POST数据,
headers表示你需要在http请求头中包含的头部信息字段
method代表使用GET还是POST方法。
默认是POST传递方式
<span style="font-size:12px;">Req = urllib.request.Request('http://www.baidu.com/robots.txt')</span>
创建Request类的对象实例
例如增加一个User-Agent的字段头请求头
USER_AGENT = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
Req= urllib.request.Request(url='http://www.baidu.com/robots.txt‘, headers=USER_AGENT)
修改超时时间
import socket
socket.setdefaulttimeout(10)#10s
五
下面介绍代理使用
代理配置和相关地址信息,必须在调用web访问服务之前进行。
使用代码示例如下:
import socket
import urllib.request
socket.setdefaulttimeout(10) # 10s
if __name__ == '__main__':
print('Main Thread Run :', __name__)
proxy = urllib.request.ProxyHandler({'http':'http://www.baidu.com:8080'})
opener = urllib.request.build_opener(proxy, urllib.request.HTTPHandler)
urllib.request.install_opener(opener)
content = urllib.request.urlopen('http://www.baidu.com/robots.txt').read()
print('\nMain Thread Exit :', __name__)
六: 错误异常处理
python的网络服务异常处理相关函数和使用。
主要是try和execpt语句块的使用。记住重要一点
python的异常处理语句,最好是一行代码一抛出一捕捉
示例:
try :
reqUrl = urllib.request.Request(url='http://www.baidu.com/robots.txt', headers=USER_AGENT)
except HTTPError:
print('urllib.error.HTTPError')
except URLError:
print('urllib.error.URLError')
except OSError:
print('urllib.error.OSError')
try :
responseData = urllib.request.urlopen(reqUrl)
except HTTPError:
print('urllib.error.HTTPError')
except URLError:
responseData.close()
print('urllib.error.URLError')
except OSError:
print('urllib.error.OSError')
try :
pageData = responseData.read()
except HTTPError:
responseData.close()
print('urllib.error.HTTPError')
except URLError:
responseData.close()
print('urllib.error.URLError')
except OSError:
print('urllib.error.OSError')
print(pageData)
responseData.close()
七 说明
以上大概是基本的web访问服务函数和类的一些使用方法,当然还有很多方法和函数可以实现相同功能。
根据个人意愿和需求进行调用。还有一个记得本人只是基础学习,整理笔记于此,方便菜鸟和自己后期查阅。