python的urllib, urllib2库基本知识

最新推荐文章于 2023-07-03 16:34:31 发布

U侠学子_博约天下

最新推荐文章于 2023-07-03 16:34:31 发布

阅读量404

点赞数

分类专栏：爬虫相关

爬虫相关专栏收录该内容

5 篇文章 0 订阅

订阅专栏

转自： http://blog.csdn.net/tianzhu123/article/details/7193408

urllib2是python的一个获取url（Uniform ResourceLocators，统一资源定址器）的模块。它用urlopen函数的形式提供了一个非常简洁的接口。这使得用各种各样的协议获取url成为可能。它同时也提供了一个稍微复杂的接口来处理常见的状况-如基本的认证，cookies，代理，等等。这些都是由叫做opener和handler的对象来处理的。
以下是获取url最简单的方式：

import urllib2
response = rllib2.urlopen('http://python.org/')
html = response.read()

许多urlib2的使用都是如此简单（注意我们本来也可以用一个以”ftp:”"file：”等开头的url取代”HTTP”开头的url）.然而，这篇教程的目的是解释关于HTTP更复杂的情形。HTTP建基于请求和回应（requests &responses）-客户端制造请求服务器返回回应。urlib2用代表了你正在请求的HTTPrequest的Request对象反映了这些。用它最简单的形式，你建立了一个Request对象来明确指明你想要获取的url。调用urlopen函数对请求的url返回一个respons对象。这个respons是一个像file的对象，这意味着你能用.read()函数操作这个respon对象：

import urllib2
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)
the_page = response.read()

注意urlib2利用了同样的Request接口来处理所有的url协议。例如，你可以像这样请求一个ftpRequest：req = urllib2.Request('ftp://example.com/')
对于HTTP，Request对象允许你做两件额外的事：第一，你可以向服务器发送数据。第二，你可以向服务器发送额外的信息（metadata），这些信息可以是关于数据本身的，或者是关于这个请求本身的–这些信息被当作HTTP头发送。让我们依次看一下这些数据：
有时你想向一个URL发送数据（通常这些数据是代表一些CGI脚本或者其他的web应用）。对于HTTP，这通常叫做一个Post。当你发送一个你
在网上填的form（表单）时，这通常是你的浏览器所做的。并不是所有的Post请求都来自HTML表单，这些数据需要被以标准的方式encode，然后
作为一个数据参数传送给Request对象。Encoding是在urlib中完成的，而不是在urlib2中完成的。

import urllib
import urllib2
url = 'http://www.someserver.com/cgi-bin/register.cgi'
values = {'name' : 'Michael Foord',
          'location' : 'Northampton',
          'language' : 'Python' }
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
the_page = response.read()

如果你不传送数据参数，urlib2使用了一个GET请求。一个GET请求和POST请求的不同之处在于POST请求通常具有边界效应：它们以某种
方式改变系统的状态。（例如，通过网页设置一条指令运送一英担罐装牛肉到你家。）虽然HTTP标准清楚的说明Post经常产生边界效应，而get从不产生
边界效应，但没有什么能阻止一个get请求产生边界效应，或一个Post请求没有任何边界效应。数据也能被url自己加密（Encoding）然后通过一
个get请求发送出去。这通过以下实现：
>>> import urllib2
>>> import urllib
>>> data = {}
>>> data['name'] = 'Somebody Here'
>>> data['location'] = 'Northampton'
>>> data['language'] = 'Python'
>>> url_values = urllib.urlencode(data)
>>> print url_values
name=Somebody+Here&language=Python&location=Northampton
>>> url = 'http://www.example.com/example.cgi'
>>> full_url = url + '?' + url_values
>>> data = urllib2.open(full_url)

我们将会在这里讨论一个特殊的HTTP头，来阐释怎么向你的HTTP请求中加入头。有一些网站不希望被某些程序浏览或者针对不同的浏览器返回不同的版本。默认情况下，urlib2把自己识别为Python-urllib/x.y（这里的xy是python发行版的主要或次要的版本号，如，Python-urllib/2.5），这些也许会混淆站点，或者完全不工作。浏览器区别自身的方式是通过User-Agent头。当你建立一个Request对象时，你可以加入一个头字典。接下来的这个例子和上面的请求一样，不过它把自己定义为IE的一个版本。
import urllib
import urllib2
url = 'http://www.someserver.com/cgi-bin/register.cgi'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values = {'name' : 'Michael Foord',
'location' : 'Northampton',
'language' : 'Python' }
headers = { 'User-Agent' : user_agent }
data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
the_page = response.read()

Respons同样有两种有用的方法。当我们出差错之后，看一下关于info and geturl的部分。

以下来自官方文档：https://docs.python.org/2/library/urllib2.html#request-objects点击打开链接

urllib2.urlopen(url[, data[, timeout[, cafile[, capath[, cadefault[, context]]]]])¶

Open the URL url, which can be either a string or a Request object.

data may be a string specifying additional data to send to the server, orNone if no such data is needed. Currently HTTP requests are the only ones that usedata; the HTTP request will be a POST instead of a GET when the data parameter is provided.data should be a buffer in the standard application/x-www-form-urlencoded format. The urllib.urlencode() function takes a mapping or sequence of 2-tuples and returns a string in this format. urllib2 module sends HTTP/1.1 requests withConnection:close header included.

The optional timeout parameter specifies a timeout in seconds for blocking operations like the connection attempt (if not specified, the global default timeout setting will be used). This actually only works for HTTP, HTTPS and FTP connections.

If context is specified, it must be a ssl.SSLContext instance describing the various SSL options. SeeHTTPSConnection for more details.

The optional cafile and capath parameters specify a set of trusted CA certificates for HTTPS requests.cafile should point to a single file containing a bundle of CA certificates, whereascapath should point to a directory of hashed certificate files. More information can be found inssl.SSLContext.load_verify_locations().

The cadefault parameter is ignored.

This function returns a file-like object with three additional methods:

geturl() — return the URL of the resource retrieved, commonly used to determine if a redirect was followed
info() — return the meta-information of the page, such as headers, in the form of anmimetools.Message instance (seeQuick Reference to HTTP Headers)
getcode() — return the HTTP status code of the response.

Raises URLError on errors.

Note that None may be returned if no handler handles the request (though the default installed globalOpenerDirector usesUnknownHandler to ensure this never happens).

In addition, if proxy settings are detected (for example, when a *_proxy environment variable like http_proxy is set), ProxyHandler is default installed and makes sure the requests are handled through the proxy. class urllib2. Request ( url[, data][, headers][, origin_req_host][, unverifiable] ) ¶

This class is an abstraction of a URL request.

url should be a string containing a valid URL.

headers should be a dictionary, and will be treated as if add_header() was called with each key and value as arguments. This is often used to “spoof” theUser-Agent header value, which is used by a browser to identify itself – some HTTP servers only allow requests coming from common browsers as opposed to scripts. For example, Mozilla Firefox may identify itself as "Mozilla/5.0(X11; U; Linuxi686) Gecko/20071127 Firefox/2.0.0.11", while urllib2‘s default user agent string is"Python-urllib/2.6" (on Python 2.6).

The final two arguments are only of interest for correct handling of third-party HTTP cookies:

origin_req_host should be the request-host of the origin transaction, as defined byRFC 2965. It defaults tocookielib.request_host(self). This is the host name or IP address of the original request that was initiated by the user. For example, if the request is for an image in an HTML document, this should be the request-host of the request for the page containing the image.

unverifiable should indicate whether the request is unverifiable, as defined by RFC 2965. It defaults to False. An unverifiable request is one whose URL the user did not have the option to approve. For example, if the request is for an image in an HTML document, and the user had no option to approve the automatic fetching of the image, this should be true.

The following methods describe all of Request‘s public interface, and so all must be overridden in subclasses.

Request. add_data ( data ) ¶

Set the Request data todata. This is ignored by all handlers except HTTP handlers — and there it should be a byte string, and will change the request to bePOST rather than GET.

Request. get_method ( ) ¶

Return a string indicating the HTTP request method. This is only meaningful for HTTP requests, and currently always returns'GET' or 'POST'.

Request. has_data ( ) ¶

Return whether the instance has a non-None data.

Request. get_data ( ) ¶

Return the instance’s data.

Request. add_header ( key, val ) ¶

Add another header to the request. Headers are currently ignored by all handlers except HTTP handlers, where they are added to the list of headers sent to the server. Note that there cannot be more than one header with the same name, and later calls will overwrite previous calls in case the key collides. Currently, this is no loss of HTTP functionality, since all headers which have meaning when used more than once have a (header-specific) way of gaining the same functionality using only one header.

Request. add_unredirected_header ( key, header ) ¶

Add a header that will not be added to a redirected request.

New in version 2.4.

Request. has_header ( header ) ¶

Return whether the instance has the named header (checks both regular and unredirected).

New in version 2.4.

Request. get_full_url ( ) ¶

Return the URL given in the constructor.

Request. get_type ( ) ¶

Return the type of the URL — also known as the scheme.

Request. get_host ( ) ¶

Return the host to which a connection will be made.

Request. get_selector ( ) ¶

Return the selector — the part of the URL that is sent to the server.

Request. get_header ( header_name, default=None ) ¶

Return the value of the given header. If the header is not present, return the default value.

Request. header_items ( ) ¶

Return a list of tuples (header_name, header_value) of the Request headers.

Request. set_proxy ( host, type ) ¶

Prepare the request by connecting to a proxy server. The host and type will replace those of the instance, and the instance’s selector will be the original URL given in the constructor.

Request. get_origin_req_host ( ) ¶

Return the request-host of the origin transaction, as defined by RFC 2965. See the documentation for theRequest constructor.

Request. is_unverifiable ( ) ¶

Return whether the request is unverifiable, as defined by RFC 2965. See the documentation for theRequest constructor.

另外可以读：http://cuiqingcai.com/947.html点击打开链接 Python爬虫入门三之Urllib库的基本使用

U侠学子_博约天下

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python的urllib, urllib2库基本知识

转自： http://blog.csdn.net/tianzhu123/article/details/7193408urllib2是python的一个获取url（Uniform ResourceLocators，统一资源定址器）的模块。它用urlopen函数的形式提供了一个非常简洁的接口。这使得用各种各样的协议获取url成为可能。它同时也提供了一个稍微复杂的接口来处理常见的状况-
复制链接

扫一扫