urllib2库.官方文档翻译

最新推荐文章于 2024-07-31 14:30:10 发布

莫利斯安

最新推荐文章于 2024-07-31 14:30:10 发布

阅读量4.5k

点赞数 1

分类专栏： Python文档文章标签： python 文档库

Python文档专栏收录该内容

1 篇文章 0 订阅

订阅专栏

urllib2库.官方文档翻译

标签（空格分隔）：译文

作者：Michael Foord

简介：

urllib2 是python中一个用于抓取URLs的模块。它提供了非常简单的接口，形如urlopne函数。此函数可以抓取采用各种协议的URLs。此外，库中还提供了一些稍微复杂点的接口用于处理其它常见的情形，例如basic authentication，cookies,proxies等情况。上面提到的处理各种事物的接口都是由handlers 和 openers 对象提供的。
urllib2支持抓取各种形式的URLs（由冒号前的字符串指定。例如，ftp是一种模式，如ftp://python.org/)，在抓取中使用了相关联的协议（如我们熟知的FTP/HTTP）.此教程着重于最常见的协议，HTTP。
对于最简单的情形，urlopen 函数是非常易于使用的。但是当你打开HTTP URLs,遇到错误或者要处理一些重要的事情，你需要对超文本传输协议有一些理解。关于HTPP最复杂和权威的参考文档莫过于RFC 2616.这是一份技术文档，并且晦涩难懂的。这个教程旨在说明使用urllib2，并且我们会介绍足够的HTTP协议知识帮你度过难关。上面的RFC文档不会替换urllib2文档，但是我们把它作为一个补充。

抓取 URLs :

如下是urllib2的最简单使用方式：

import urllib2
response=urllib2.urlopen('http://www.python.org/')
html=response.read()

对urllib2库的使用就是那么简单（注意除了HTTP类型的URL可以写进去，我们也可以使用形如ftp:/file:等类型的URL）。但是此教程更多关注与HTTP，旨在解释一些复杂的案例。
HTTP 基于请求和应答。客户端做出请求，服务器端发送响应。urllib2创建一个Request对象代表你发送HTTP 的请求。在最简单的使用例子中，我们创建了一个Request 对象，对象之中指定了我们将要抓取的URL。调用urlopen函数，传入Request参数，返回一个针对欲访问的URL的response对象。这个response是一个类文件类型的对象，这点意味着你可以使用read函数读取它。（译者注：关于类文件类型对象，可以参考下鸭子类型和多态）

import urllib2
url='http://www.voidspace.org.uk' # specify the url we will fetch .
req=urllib2.Request(url) #create the request .
# next,we pass the req into the function urlopen .
response=urllib2.urlopen(req)
the_page=response.read() # we use the read method the read the file wo fetche from the URL.

可以到到我们的urllib2使用相同的Request接口去处理所有的URL模式。例如我们可以制作一个FTP请求：

req=urllib2.Request('ftp://example.com/')

在HTTP的例子中，Request请求对象可以允许我们做两件额外的事情：第一，我们可以把数据传送给目的服务器。第二，我们可以传递额外的信息（元数据,即关于数据或者是请求自身的）给服务器。这个信息是被当作HTTP headers.下面我们以此研究下他们。

数据：

有时候，我们想发送一些数据到一个URL(通常，这个url指的是CGI脚本或者其他的web应用)。对于HTTP来说，发送数据是通过POST request 。这个也是我们使用浏览器访问时，我们在网络上填写一个HTML表单。并不是所有的posts源于表单：我们可以使用一个POST传输任意的数据给我们自己的应用。常见的HTML表单中，数据需要被编码为标准方式，然后传递到Request对象中，作为data参数。编码工作并不是由urllib2库完成，而是使用了urllib库中的encode函数。

import urllib
import urllib2
# specify the url we will fetch .
url='http://www.voidspace.com.uk/'
#create a dict to store the  data.
dict_data={'use_name':'aibilim','password':'xxxx','language':'python'}
# encode the dict_data in order to pass to the Request.
data_pass=urllib.encode(dict_data)
# next,we make a req.
req=urllib2.Request(url,data)
# now we get the response 
response=urllib2.urlopen(req)
the_page=response.read()
# print the page.
print the page

注意，有时候可能会需要其它的一些编码。

如果你不传递一个data参数，urllib2允许使用GET request。get & post 请求的一个区别在于post请求通常伴随着副作用：他们以某种方式改变了系统的状态。没有什么阻止GET请求有副作用，也没有什么阻止post请求没有副作用。尽管HTTP标准说的很清楚，posts请求本意永远导致副作用，get请求永远不会导致副作用。数据也可以被传递进一个HTTP get请求通过在url自身里面编码它。

import urllib
import urllib2
url='http://www.baidu.com'
data_dict={} # a empty dict.
# we append elements  in the dict.
data_dict['user_name']='aibilim'
data_dict['password']='xxxx'
data_dict['language']='Python'
# we encode the data_dict into standard way .
data_pass=urllib.encode(data_dict)
# print the url before appending the data_pass
print url
# now we append the data_pass
new_url=url+'?'+data_pass
# now ,we can make req
req=urllib2.Request(new_url) #notice we don't pass data argu
response=urllib2.urlopen(req)
the_page=response.read()
print the new_url
print the_page
# remember we must encode the data_dict before adding it into post or get request !!!

可以看到，在url中添加了一个?，尾随的是被编码的数据，这样构成了一个新的url。

译于2015年10月18日23:15:43 未完待续。

Headers：

我们将会探讨下HTTP headers,目的在于说明如何把headers添加到你的HTTP request中。一些网站[^2]不喜欢被程序访问或者发送不同的信息针对不同的浏览器。默认情况，urllib2被标识为Python-urllib2/x.y(此处的x&y分别代表python的发现版本号，比如我的版本号是Python-urllib2/2.7),python自身的标识可能让欲访问的站点感到困惑或者就是完全的不工作（我理解为对我们做出的请求不进行应答）。浏览器通过一个User-Agent header伪装自己。当你创建一个请求对象，你可以把一个字典类型的headers传递到请求对象的参数列表里。接下来的例子做出了和上文一样的请求，但是我们把程序伪装成IE浏览器。

import urllib
import urllib2
url='http://www.baidu.com'
user-agent='Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
data_dict={'user_name':'aibilim','password':'xxxx','language':'Python'}
# now ,we make a header .
header={'User-Agent':user-agent'}
# the type is dict 
# now ,we encode the data_dict by using the urllib.encode function 
data_pass=urllib.encode(data_dict)
# ok,we can make req .
req=urllib2.Request(url,data_pass,header)
response=urllib2.urlopen(req)
the_page=response.read()

response对象也有两个有用的方法。具体内容在info and geturl章节。我们将在研究过异常处理之后探讨它。

异常处理 :

urlopen函数抛出一个URLError异常当函数不能处理一个响应时（内置异常诸如ValueError,TypeError等也可能被抛出）
HTTPError异常是URLError异常的子集，HTTPError异常在特定的HTTP URLS中被抛出。

URLError :

通常，URLError被抛出是由于没有网络连接建立（或者没有到达指定服务器的路线），或者指定的服务器不存在。在这种情况下抛出的异常将会有一个reason属性（我理解为用来解释错误，所以取名为reason），reason是一个包含错误代码和错误文本消息的元组。
例如：

import urllib2
req=urllib2.Request=('http://www.xxxxx.com')
try:
    urllib2.urlopen(req)
except urllib2.URLError as e :
    print e.reason
# below is the result :
#[Errno 11002] getaddrinfo failed

ps:此处和文档给的网址不一样，因为我运行的时候，发现文档给的http://www.pretend_server.org链接可以抓取到内容，并不是引发异常。所有我替换了网址。此外，文档中在调用中省略了urllib2。

HTTPError :

每个来自服务器的HTTP应答都包含一个数值的状态码。有时候，状态码表明服务器不能满足我们做出的请求。默认的handlers将会帮我们处理一些应答（例如，应答是一个重定向，要求客户端从不同的URL抓取资源，urllib2将会替你处理好）。但是总有一些不能处理好，urloprn将会抛出一个HTTPError异常。典型的异常有404(页面丢失），403（请求被禁止），401（要求验证）
所有的HTTP error codes可以在RFC 2616的第十章节查看。
HTTPError异常的实例拥有一个整形的code属性，这个code对应着服务器发送回来的异常。

错误码

我们通常只会看见400-599范围内的错误码，因为默认的handlers会自动处理重定向（错误码以300开始），此外100-299范围内的错误码表明没有问题需要处理。
BaseHTTPServer.BaseHTTPRequestHandler.responses是一个十分有用的字典，内含错误码以及错误的描述。在这里，为了方便，错误码字典被重新处理下：

# Table mapping response codes to messages; entries have the
# form {code: (shortmessage, longmessage)}.
responses = {
    100: ('Continue', 'Request received, please continue'),
    101: ('Switching Protocols',
          'Switching to new protocol; obey Upgrade header'),

    200: ('OK', 'Request fulfilled, document follows'),
    201: ('Created', 'Document created, URL follows'),
    202: ('Accepted',
          'Request accepted, processing continues off-line'),
    203: ('Non-Authoritative Information', 'Request fulfilled from cache'),
    204: ('No Content', 'Request fulfilled, nothing follows'),
    205: ('Reset Content', 'Clear input form for further input.'),
    206: ('Partial Content', 'Partial content follows.'),

    300: ('Multiple Choices',
          'Object has several resources -- see URI list'),
    301: ('Moved Permanently', 'Object moved permanently -- see URI list'),
    302: ('Found', 'Object moved temporarily -- see URI list'),
    303: ('See Other', 'Object moved -- see Method and URL list'),
    304: ('Not Modified',
          'Document has not changed since given time'),
    305: ('Use Proxy',
          'You must use proxy specified in Location to access this '
          'resource.'),
    307: ('Temporary Redirect',
          'Object moved temporarily -- see URI list'),

    400: ('Bad Request',
          'Bad request syntax or unsupported method'),
    401: ('Unauthorized',
          'No permission -- see authorization schemes'),
    402: ('Payment Required',
          'No payment -- see charging schemes'),
    403: ('Forbidden',
          'Request forbidden -- authorization will not help'),
    404: ('Not Found', 'Nothing matches the given URI'),
    405: ('Method Not Allowed',
          'Specified method is invalid for this server.'),
    406: ('Not Acceptable', 'URI not available in preferred format.'),
    407: ('Proxy Authentication Required', 'You must authenticate with '
          'this proxy before proceeding.'),
    408: ('Request Timeout', 'Request timed out; try again later.'),
    409: ('Conflict', 'Request conflict.'),
    410: ('Gone',
          'URI no longer exists and has been permanently removed.'),
    411: ('Length Required', 'Client must specify Content-Length.'),
    412: ('Precondition Failed', 'Precondition in headers is false.'),
    413: ('Request Entity Too Large', 'Entity is too large.'),
    414: ('Request-URI Too Long', 'URI is too long.'),
    415: ('Unsupported Media Type', 'Entity body in unsupported format.'),
    416: ('Requested Range Not Satisfiable',
          'Cannot satisfy request range.'),
    417: ('Expectation Failed',
          'Expect condition could not be satisfied.'),

    500: ('Internal Server Error', 'Server got itself in trouble'),
    501: ('Not Implemented',
          'Server does not support this operation'),
    502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),
    503: ('Service Unavailable',
          'The server cannot process the request due to a high load'),
    504: ('Gateway Timeout',
          'The gateway server did not receive a timely response'),
    505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),
    }

############################
#  看起来很简单，不一一翻译   #
############################

当一个异常被抛出，服务器通过返回一个HTTP错误代码和一个错误页面进行响应。我们可以使用HTTPError实例代表响应返回的页面（类似前面我们用response代表抓取的页面）.这意外着除了拥有code属性，它还有着read ,geturl ,info方法。
注意，上面提到过URLError，有一个reason属性，不要混淆。

import urllib2
requset = urllib2.Request('http://www.python.org/fish.html')
try:
    response = urllib2.urlopen(requset)
except urllib2.HTTPError as e:
    print e.code
    print e.read()
# below is the part of the result
404
<!doctype html>
<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->
<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->

解决方案：

有两个基本方法用了解决HTTPError ,URLError ，我推荐第二种。

import urllib2
url = 'http://www.xxxxx.com'
req = urllib2.Request(url)
try:
response = urllib2.urlopen(req)
except urllib2.HTTPError as e:
print e.code
print 'we can not fulfill the request \n'
except urllib2.URLError as e:
print e.reason
print 'we can not reach a server'
else:
print('No problem')

注意，HTTPError一定要放在最前面进行捕获。因为HTTPError是URLError的子集，不然的话会一直捕获到的是URLError。
2.

import urllib2
url = 'http://www.python.org/fish.html'
req = urllib2.Request(url)
try:
response = urllib2.urlopen(req)
except urllib2.URLError as e:
if hasattr(e, 'reason'):
    print 'we can not reach a server '
    print 'The reason is %s' %e.reason
elif hasattr(e, 'code'):
    print 'we can not fulfill the request '
    print 'The error code is %s' %e.code
else:
print('No problem')

info and geturl

由urlopen函数返回的应答对象response(或者是httperror的实例）有两个有用的方法：info() geturl() .
geturl:返回真正抓取到页面的地址。这一点很有用，因为urlopen函数可能会伴随一个重定向。抓取的页面的地址可能不同于请求中传入的地址。
info:这个返回一个类字典类型的对象，用来描述抓取到的页面，尤其是由服务器端发送回来的headers.它目前是类httplib.HTTPMessage的实例。
典型的headers包括Content-length,Content-type等内容。具体的内容可以参阅[Quick Reference to HTTP Headers][1]，里面简明的列出了headers以及解释，含义，使用。

Openers 和 Handlers

当你抓取一个URL时，你会使用一个opener（urllib2.OpenDirector类的实例）。通常，我们经由urlopen使用默认的opener。但是我们可以定制自己的opener。opener调用handlers。所有的累活都是由handlers完成。每个handlers知道如何针对不同的URL模式（http，ftp,file)打开相应的URLs或者是知道如何处理打开url过程中的其它方面，例如http重定向问题抑或是htpp cookies问题。
如果你想抓取一个站点，并且试图用特定的handlers处理它，那么你需要定制自己的openers.例如你抓取的站点你希望你的opener可以处理cookies,或者希望你的openers不要处理重定向。
为了创建一个openers，我们可以实例化一个OpenDirector，然后反复调用add_handler(some_handler_instance)来添加handlers.
我们还有一个替代的解决方案，我们可以使用build_opener函数来创建我们自己的opener。仅通过一次函数的调用，我们就可以方便的创建一个opener对象。build_opener函数默认添加了几个handlers,但是也提供了提供了添加或者重写默认handlers的快速解决方案。
你可能需要一些用来处理代理，验证，以及一些常见但是属于某些特定情形的问题，这个时候我们就需要一些其他类型的handlers.
install_opener函数可以使得一个opener对象成为全局默认的opener。这意味着当我们调用urlopen函数时，我们使用的将会是我们自己安装的opener。
opener对象有一个open方法，这个方法可以直接调用，抓取URLs，它的过程和你调用urloprn方法是一样的。实际上，除了带来一些便利之外，没有必要调用install_opener函数。（此处的意思我们可以自己调用open函数，没必要让opener对象成为全局的，然后去调用urlopen函数）

Basic Authentication :

为了说明创建和安装一个handler,我们将会使用HTTPBasicAuthHandler。关于Basic Authentication如何工作的具体细节的讨论和解释，我们可以参阅[Basic Authentication Tutorial][2].
当要求验证的时候，服务器发送一个header和401错误代码，通知我们要进行验证。这个指定了验证验证模式和一个realm。headers看起来可能是形如：WWW-Authenticate: SCHEME realm="REALM"。
例如

WWW-Authenticate: Basic realm="cPanel Users"

客户端收到验证的应答后，应该尝试重新进行请求，并且在此次请求中附上合适的用户名和密码用于realm。
这就是Basic Authentication。为了简化这个过程，我们创建一个HTTPBasicAuthHandler实例和一个opener来使用上面创建的handler。
HTTPBasicAuthHandler使用一个叫做密码管理器的对象来处理URL，用户名，密码。如果你知道realm是什么（可以从服务器发回的验证header知晓），那么你可以使用一个HTTPPasswordMgr。通常，我们不关心realm是什么。在这种情形下，使用HTTPPasswordMgrWithDefaultRealm是很方便的。这个允许我们替URL指定一个默认的用户名和密码。默认的用户名和密码将会被提供给realm当我们不提供一个可选的组合。我们通过传递None作为realm的参数，传递到add_password函数里。
最顶层的URL是第一个需要验证的URL。更深层次的URL和你传递到add_password()函数里的URL也会匹配。

import urllib2
url = 'http://www.python.org/fish.html'
# create a password manager
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
# Add the username and password .
#if we knew he real, we could use it instead of None.
top_level_url = "http://example.com/foo/"
password_mgr.add_password(None,top_level_url,username,password)

handler = urllib2.HTTPBasicAuthHandler(password_mgr)
opener = urllib2.build_opener(handler)
# use the opener to fetch the URL
opener.open(a_url)
# install the opener
#now all the calls to urllib2.urlopen use the opener
urllib2.install_opener(opener)

注：在上面的例子中，我们仅仅把我们的HTTPBasicAuthHandler提供给build_opener。默认情况下，openers拥有handlers来处理正常的情形-例如ProxyHandler,UnKnownHandler,HTTPHandler,HTTPDefaultErrorHandler,HTTPRedirectHandler,FTPHandler,FileHandler,HTTPErrorProcessor。

实际上，top_level_url不是一个完全的URL（包括http模式成分，主机名，以及可选的端口号，e.g. “http://example.com/“)就是一个authority（例如主机名，可选的包括端口号）例如”example.com”或者”example.com:8080”(后面的例子包含端口号）。如果authority存在，那么一定不能包含用户信息成分，例如joe@password:examole.com，这个例子是错误的。

代理：

urllib2库自动侦测你的代理设置，并且付诸使用。这是通过ProxyHandler实现的，它是常见handler处理链的一部分。通常，这是好事，但是偶尔可能帮了倒忙。一个解决方案就是创建我们自己的ProxyHandler，不设置代理。创建的步骤类似于上面的Basic Authentication handler.

proxy_support = urllib2.ProxyHandler({})
opener=urllib2.build_opener(proxy_support)
urllib2.install_opener(opener)

注：目前，urllib2库并不支持通过代理抓取https站点。但是这个可以实现通过延伸urllib2库。具体我们可以看下小窍门。

Sockets 和 Layers：

支持Python从网络抓取资源调用的库是呈现层次化结构的。urllib2库使用了httplib库，但是httplib库又使用了socket库。
在Python2.3版本中，在等待超时之前，我们可以自行设置一个socket应该等待response多长时间。这一点在不得不抓取网页的应用中十分有用。默认情况下，socket模块没有超时，可以悬空。但是目前，socket超时时间不暴露于httplib和urllib2层。但是你可以设置全局默认超时等待时间给所有的socket使用。

import  urllib2
import socket
# timeout in seconds
time_out = 10
socket.setdefaulttimeout(time_out)

#this call to urllib2.urlopen now use the default time_out .
#we have set in the socket module
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)

Footnotes

此篇文档由John Lee审查和修订。
[1]关于CGI协议的介绍可以参阅Writing Web Applications in Python
[2]谷歌
[3]对网站设计来说，浏览器嗅探是一个非常糟糕的举措。用web标准构建网站更明智。不幸的是，许多站点仍然发送不同版本的信息针对不同的浏览器。
[4]MISE 6的用户代理是'Mozilla/4.0(compatible;MISE 6.0;Windows NT 5.1;SV1;.NET CLR 1.1.4322)'
[5]关于HTTP头的细节可以参阅Quick Reference to HTTP Headers

译者说：

因为纯新手学爬虫，发现各种教程都要学习urllib2库的使用，索性自己撸了一遍，加深对库的了解。有的地方，我的理解可能出现了重大偏差，望您不吝赐教。关于文章中有些地方为什么不使用中文，诸如cookie，realm等，那是因为我并没有发现合适的词来描述它们，所以暂时搁置了，待以后对网络的理解更为深入之后，我可能会维护一下。文章中可能会有若干错别字，望见谅。可以直接在回复中提醒我该正，再次感谢。最后吐槽下，CSDN的编辑器还是挺难用的，完全赶不上作业部落的编辑器。

          完结于 2015年10月20日21:36:36/by 莫利斯安

莫利斯安

关注

1
点赞
踩
7

收藏

觉得还不错? 一键收藏
2
评论
urllib2库.官方文档翻译

urllib2库.官方文档翻译标签（空格分隔）：译文作者：Michael Foord简介：urllib2 是python中一个用于抓取URLs的模块。它提供了非常简单的接口，形如urlopne函数。此函数可以抓取采用各种协议的URLs。此外，库中还提供了一些稍微复杂点的接口用于处理其它常见的情形，例如basic authentication，cookies,proxies等情况。上面提到的处理各种
复制链接

扫一扫

专栏目录