python编写爬虫的步骤-python网络爬虫（二）编写第一个爬虫

最新推荐文章于 2024-03-19 09:59:36 发布

weixin_37988176

最新推荐文章于 2024-03-19 09:59:36 发布

阅读量176

点赞数

抓取网站数据通常需要先下载网页，这个过程称为爬取。爬取网站通常有3种常见方法：

爬取网站地图

遍历每个页面的数据库ID

跟踪每个网页链接

想要爬取网页，首先要将其下载下来。下面使用Python的urllib2模块下载URL。

import urllib2

def download(url):

html = urllib2.urlopen(url).read()

return html

download('http://www.baidu.com')

带有扑捉异常的代码：

import urllib2

def download(url):

print 'downloadimng :' , url

try:

html = urllib2.urlopen(url).read()

except urllib2.URLError as e:

print 'Download error:', e.reason

html = None

return html

download('http://www.baidu.com')

1.重试下载

下载时遇到错误是常有的，优势服务器返回503 Service Unavailable错误时，对于此类错误，我们可以重试下载。如果返回404，则说明网页目前不存在，不需要重新下载。4xx是请求存在问题，5xx是服务器端存在问题，所以确保download函数在5xx错误是重试下载。

下面代码是支持重试下载的代码：

import urllib2

def download(url, num_retries = 2):

print 'downloadimng :' , url

try:

html = urllib2.urlopen(url).read()

except urllib2.URLError as e:

print 'Download error:', e.reason

html = None

if num_retries > 0:

if hasattr(e, 'code') and 500 <= e.code <= 600:

return download(url, num_retries-1)

return html

download('http://httpstat.us/500')

下面是返回的结果：

downloadimng : http://httpstat.us/500

Download error: Internal Server Error

downloadimng : http://httpstat.us/500

Download error: Internal Server Error

downloadimng : http://httpstat.us/500

Download error: Internal Server Error

download函数在收到500错误后，重试了2次才放弃。

2.设置用户代理

默认情况下，urllib2使用的是Python-urllib/2.7作为用户代理下载网内容。一些目标网站会封禁这个默认代理，这时就需要修改默认的用户代理。

import urllib2

def download(url, user_agent='wswp', num_retries = 2):

print 'downloadimng :' , url

headers = {'User-agent': user_agent}

request = urllib2.Request(url, headers=headers)

try:

html = urllib2.urlopen(url).read()

except urllib2.URLError as e:

print 'Download error:', e.reason

html = None

if num_retries > 0:

if hasattr(e, 'code') and 500 <= e.code <= 600:

return download(url, num_retries-1)

return html

download('http://www.meetup.com/')

网站地图爬虫：

def crawl_sitemap(url):

sitemap = download(url)

links = re.findall('(.*?)', sitemap)

for link in links:

html = download(link

ID遍历爬虫：

for page in itertools.count(1):

url = 'http://blog.jobbole.com/%d' % page

html = download(url)

if html is None:

break

else:

pass

链接爬虫：

通过跟踪链接的形式，我们的爬虫看起来更像普通用户。我们可以很轻易的下载整个网站的页面，通常我们感兴趣的是网站一部分网页中的一部分信息，我们可以使用正则表达式确定下载那些页面中的特定信息。

weixin_37988176

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python编写爬虫的步骤-python网络爬虫（二）编写第一个爬虫

抓取网站数据通常需要先下载网页，这个过程称为爬取。爬取网站通常有3种常见方法：爬取网站地图遍历每个页面的数据库ID跟踪每个网页链接想要爬取网页，首先要将其下载下来。下面使用Python的urllib2模块下载URL。import urllib2def download(url):html = urllib2.urlopen(url).read()return htmldownload('http:...
复制链接

扫一扫