上篇文章已讲了抓取一篇博客的内容,这篇文章将讲述如何爬取一页博客(一页上有很多篇(新浪博客一页最多有50篇))
我们只需在第一篇代码的外面加一个循环,给出第一页的网页链接,然后在上面爬取所有的博客链接,再下载就Ok了
# -*- coding : -utf-8 -*-
import urllib
import time
url = ['']*50
con = urllib.urlopen('http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html').read()//第一页的链接
#print con
i = 0
title = con.find(r'<a title=')
#print 'title',title
href = con.find(r'href=',title)
#print 'href',href
html = con.find(r'.html',href)
while title != -1 and href != -1 and html != -1 and i<50:
url[i] = con[href + 6:html + 5]
print url[i]
title = con.find(r'<a title=',html)
href = con.find(r'href=',title)
html = con.find(r'.html',href)
i = i + 1
else:
print 'find end!'
j = 0
while j < 50:
content = urllib.urlopen(url[j]).read()
open(r'hanhan/'+url[j][-26:],'w+').write(content)
print 'downloading',url[j]
j = j + 1
#time.sleep(1)
else:
print 'download artical finished!'
结果页面: