Python爬取韩寒全部新浪博客

AC_Dreameng

于 2016-03-07 22:45:10 发布

阅读量2.1k

点赞数

分类专栏： Python 文章标签： Python爬虫 Python爬取韩寒全部博客

本文链接：https://blog.csdn.net/hurmishine/article/details/50822928

版权

本文介绍如何通过Python爬虫技术，循环遍历获取韩寒在新浪博客的所有文章页面，实现对每个分页博客的完整抓取。

摘要由CSDN通过智能技术生成

接上一篇，我们根据第一页的链接爬取了第一页的博客，我们不难发现，每一页的链接就只有一处不同（页码序号），我们只要在上一篇的代码外面加一个循环，这样就可以爬取所有博客分页的博文，也就是所有博文了。

# -*- coding : -utf-8 -*-
import urllib
import time
url = [' ']*350
page = 1
link = 1
while page <=7://目前共有7页，3
    con = urllib.urlopen('http://blog.sina.com.cn/s/articlelist_1191258123_0_'+str(page)+'.html').read()
    i = 0
    title = con.find(r'<a title=')
    href = con.find(r'href=',title)
    html = con.find(r'.html',href)
    while title != -1 and href != -1 and html != -1 and i<350:
        url[i] = con[href + 6:html + 5]

        content = urllib.urlopen(url[i]).read()
        open(r'allbok