Python3爬虫韩寒新浪博客文章

最新推荐文章于 2021-02-12 05:37:00 发布

Rango_lhl

最新推荐文章于 2021-02-12 05:37:00 发布

阅读量1.8k

点赞数

分类专栏： Python 文章标签： python 爬虫

本文链接：https://blog.csdn.net/Rango_lhl/article/details/51026668

版权

Python 专栏收录该内容

14 篇文章 0 订阅

订阅专栏

利用Python3把韩寒新浪博客每篇文章的链接找出，并把每篇文章的正文下载为html格式文件。

# -*- coding: utf-8 -*- 
import urllib.request
import re

url=['']*350
#建立350个列表用来存储每篇博文的地址链接
i=0
page=1
while page<8:
    #所有文章有7页，因此循环查找每一页
    content=urllib.request.urlopen("http://blog.sina.com.cn/s/articlelist_1191258123_0_"+str(page)+".html").read()
    #数值INT不能直接与字符串直接连接起来，用函数str先把数值转为字符串
    content=content.decode('utf-8')
    #把content转为str格式，不转置为byte格式find函数无法使用
    title=content.find(r'<a title=')
    href=content.find(r'href',title)
    #找到title后在接着从该处往后查找
    html=content.find(r'html',href)
    while title != -1 and href != -1 and html !=-1 and i < 350:
        url[i]=content[href+6:html+4]
        #获取链接
        print(url[i])
        title=content.find(r'<a title=',html)
        href=content.find(r'href',title)
        html=content.find(r'html',href)
        i=i+1
    else:
        print(str(page) +'find end!')
    page=page+1
else:
    print('All Find End')

#以上语句找到所有博文的文章链接后，下面开始逐条下载，并把正文保存下来
j=0
while j<350 and len(url[j])>0:
    #实现文章数目是低于350的，当判断链接为空的时候则停止下载
    con=urllib.request.urlopen(urll[j]).read()
    con=con.decode("utf-8")
    title = re.findall(r'<title>.*</title>',con)
    #取标题
    title = re.sub('<title>|</title>','',title[0])
    #把<title></title>替换为空，title[0]把title取为string，原title是List格式，无法使用re.sub
    title=title.replace('*','')
    #有篇文章标题含有*符号，保存的时候会报错，用来替代掉*号
    text = re.findall(r'<!-- 正文开始 -->.*<!-- 正文结束 -->', con,re.DOTALL)[0]  
    open(r'hanhan/'+str(j)+title+'.html','w+').write(text)
    print('downing',urll[j])
    j=j+1
else:
    print('download finish')

#下面为Python中find函数的官方说明，start/end为可选填写，如果找不到则返回-1
Help on method_descriptor:

find(...)
    S.find(sub[, start[, end]]) -> int

    Return the lowest index in S where substring sub is found,
    such that sub is contained within S[start:end].  Optional
    arguments start and end are interpreted as in slice notation.

    Return -1 on failure.