python爬虫学习笔记_Python爬虫学习笔记（一）

最新推荐文章于 2024-05-12 16:55:52 发布

weixin_39600366

最新推荐文章于 2024-05-12 16:55:52 发布

阅读量85

点赞数

文章标签： python爬虫学习笔记

1.urllib2简介

urllib2的是爬取URL(统一资源定位器)的Python模块。它提供了一个非常简单的接口，使用urlopen函数。它能够使用多种不同的协议来爬取URL。

它还提供了一个稍微复杂的接口，用于处理常见的情况 - 如基本身份验证，cookies，代理等。

2.抓取URLs

使用urlib2的简单方式可以像下面一样：

importurllib2

response= urllib2.urlopen('http://python.org/')

html=response.read()print html

输出就是爬取的网页内容。

我们可以使用urllib2抓取格式形式的url，可以将‘http：’用‘ftp：’，‘file：’等代替。http是基于请求应答模式，urllib2使用Request代表HTTP请求，最简单的形式是创建一个Request对象，指定要获取的URL。使用Request对象调用urlopen，返回一个请求的URL

响应对象。此响应是一个类似文件的对象，这意味着你可以对这个对象使用.read()：

importurllib2

req= urllib2.Request('http://www.voidspace.org.uk')

response=urllib2.urlopen(req)

the_page=response.read()print the_page

urlib2可以使用各种URL模式，例如可以使用ftp形式：

req = urllib2.Request('ftp://example.com/')

3.Data

有时你想将数据发送到一个URL(通常是URL将指向一个CGI(通用网关接口)脚本或其他Web应用程序)。

通过HTTP，这通常使用一个POST请求。这是当你提交你填写的HTML表单，浏览器通常使用POST请求。

并非所有POST都都来源于表单：你可以使用一个POST传送任意数据到自己的应用程序。

在通常情况下HTML表单，需要对数据编码成标准方式，然后传递到请求对象作为数据参数。编码是使用的函数来自urllib库不是从urllib2的。

importurllibimporturllib2

url= 'http://www.someserver.com/cgi-bin/register.cgi'values= {'name' : 'Michael Foord','location' : 'Northampton','language' : 'Python'}

data=urllib.urlencode(values)

req=urllib2.Request(url, data)

response=urllib2.urlopen(req)

the_page= response.read()

如果你没有提交data参数，urllib2使用GET请求。GET和POST请求不同之处在于POST请求通常有“副作用”：他们以某种方式改变了系统的状态。

虽然HTTP标准明确规定，POST可能会引起副作用，而GET请求从来没有引起副作用，data也可以在HTTP GET请求通过在URL本身编码来传送。

>>> importurllib2>>> importurllib>>> data ={}>>> data['name'] = 'Somebody Here'

>>> data['location'] = 'Northampton'

>>> data['language'] = 'Python'

>>> url_values =urllib.urlencode(data)>>> print url_values #The order may differ.

name=Somebody+Here&language=Python&location=Northampton>>> url = 'http://www.example.com/example.cgi'

>>> full_url = url + '?' +url_values>>> data = urllib2.urlopen(full_url)

全的URL需要加一个？在URL后面，后面跟着encoded values。

4 Headers

我们将在这里讨论一个特定的HTTP头，来说明如何headers添加到您的HTTP请求。有些网站不喜欢被程序浏览，或发送不同的版本内容到不同的浏览器。

urllib2默认的自身标识为Python-urllib/ XY(x和y是Python主版本和次版本号,例如Python-urllib/2.5)，这可能会使网站迷惑，或只是简单的不能正常工作。

浏览器通过User-Agent标识自己，当你创建一个Request对象，你可以传送一个包含头部的字典。

下面的例子标题的字典作出了和上面同样的要求，但自身标识为 Internet Explorer 5 。

importurllibimporturllib2

url= 'http://www.someserver.com/cgi-bin/register.cgi'user_agent= 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'values= {'name' : 'Michael Foord','location' : 'Northampton','language' : 'Python'}

headers= { 'User-Agent': user_agent }

data=urllib.urlencode(values)

req=urllib2.Request(url, data, headers)

response=urllib2.urlopen(req)

the_page= response.read()

5 URLError

urlopen不能处理的响应时(通常的Python APIs异常如ValueError,TypeError等也会同时产生)他会引发URLError。

HTTPError是URLError的子类，一般在特定的HTTP URL中产生。

通常，URLError产生是因为没有网络连接(到指定的服务器的路由)，或指定的服务器不存在。在这种情况下，所提出的异常将有一个“reason”属性，它是含有一个元组包含错误代码和文本错误消息。

importurllib2

req= urllib2.Request('http://www.pretend_server.org')try:

urllib2.urlopen(req)excepturllib2.URLError as e:print e.reason

输出是：

[Errno -2] Name or service not known

6 HTTPError

来自服务器的HTTP响应包含一个数字“状态码”。

有时，状态代码表示服务器无法完成请求。默认处理程序将处理一些这类的响应(例如，如果该响应是一个“重定向”，请求客户端从不同的URL获取文档，urllib2将会处理)。

对于那些它不能处理，urlopen会引发HTTPError。典型错误包括“404”(找不到网页)，“403”(要求禁止)，和'401'(需要身份验证)。

下面是Error Codes

# Table mapping response codes to messages; entries have the

# form {code: (shortmessage, longmessage)}.

responses = {

100: ('Continue', 'Request received, please continue'),

101: ('Switching Protocols',

'Switching to new protocol; obey Upgrade header'),

200: ('OK', 'Request fulfilled, document follows'),

201: ('Created', 'Document created, URL follows'),

202: ('Accepted',

'Request accepted, processing continues off-line'),

203: ('Non-Authoritative Information', 'Request fulfilled from cache'),

204: ('No Content', 'Request fulfilled, nothing follows'),

205: ('Reset Content', 'Clear input form for further input.'),

206: ('Partial Content', 'Partial content follows.'),

300: ('Multiple Choices',

'Object has several resources -- see URI list'),

301: ('Moved Permanently', 'Object moved permanently -- see URI list'),

302: ('Found', 'Object moved temporarily -- see URI list'),

303: ('See Other', 'Object moved -- see Method and URL list'),

304: ('Not Modified',

'Document has not changed since given time'),

305: ('Use Proxy',

'You must use proxy specified in Location to access this '

'resource.'),

307: ('Temporary Redirect',

'Object moved temporarily -- see URI list'),

400: ('Bad Request',

'Bad request syntax or unsupported method'),

401: ('Unauthorized',

'No permission -- see authorization schemes'),

402: ('Payment Required',

'No payment -- see charging schemes'),

403: ('Forbidden',

'Request forbidden -- authorization will not help'),

404: ('Not Found', 'Nothing matches the given URI'),

405: ('Method Not Allowed',

'Specified method is invalid for this server.'),

406: ('Not Acceptable', 'URI not available in preferred format.'),

407: ('Proxy Authentication Required', 'You must authenticate with '

'this proxy before proceeding.'),

408: ('Request Timeout', 'Request timed out; try again later.'),

409: ('Conflict', 'Request conflict.'),

410: ('Gone',

'URI no longer exists and has been permanently removed.'),

411: ('Length Required', 'Client must specify Content-Length.'),

412: ('Precondition Failed', 'Precondition in headers is false.'),

413: ('Request Entity Too Large', 'Entity is too large.'),

414: ('Request-URI Too Long', 'URI is too long.'),

415: ('Unsupported Media Type', 'Entity body in unsupported format.'),

416: ('Requested Range Not Satisfiable',

'Cannot satisfy request range.'),

417: ('Expectation Failed',

'Expect condition could not be satisfied.'),

500: ('Internal Server Error', 'Server got itself in trouble'),

501: ('Not Implemented',

'Server does not support this operation'),

502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),

503: ('Service Unavailable',

'The server cannot process the request due to a high load'),

504: ('Gateway Timeout',

'The gateway server did not receive a timely response'),

505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),

}

当错误被返回一个HTTP错误代码和错误页面提高服务器响应。您可以使用为页面上的响应HTTPError这样的实例返回。这意味着，以及代码属性，它也有阅读中，getURL和信息，方法。

当一个错误号产生后，服务器会返回一个HTTP错误号和一个错误页面。

可以使用HTTPError实例作为页面返回的response应答对象。

这表示和错误属性一样，它同样包含了read,geturl,和info方法。

importurllib2

req= urllib2.Request('http://www.python.org/fish.html')try:

urllib2.urlopen(req)excepturllib2.HTTPError as e:printe.codeprint e.read()

运行发现：

404

。。。。。。。

weixin_39600366

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python爬虫学习笔记_Python爬虫学习笔记（一）

1.urllib2简介urllib2的是爬取URL(统一资源定位器)的Python模块。它提供了一个非常简单的接口，使用urlopen函数。它能够使用多种不同的协议来爬取URL。它还提供了一个稍微复杂的接口，用于处理常见的情况 - 如基本身份验证，cookies，代理等。2.抓取URLs使用urlib2的简单方式可以像下面一样：importurllib2response= urllib2.urlo...
复制链接

扫一扫