保存CSDN 中的博客文章为本地文件

最新推荐文章于 2024-08-27 18:31:05 发布

网络安全打工人

最新推荐文章于 2024-08-27 18:31:05 发布

阅读量2.2k

点赞数 2

分类专栏： python编程

本文链接：https://blog.csdn.net/kevinhanser/article/details/88936860

版权

python编程专栏收录该内容

19 篇文章 1 订阅

订阅专栏

保存CSDN 中的博客文章为本地文件

2019年3月31日21:49:39 【原创】
python 学习博客目录

1. 运行环境

最近发现我CSDN里的博客中，外链的图片全部无法加载，博客总使用的图片论坛是 https://i.imgur.com/，图片链接形如 https://i.imgur.com/WxbYfSu.png，图片是使用 markdown 软件写博客时上传的。

目前在国内的 IP 无法访问 i.imgur.com 网站，不知道啥情况。

现在的问题是国内的IP无法访博客里的图片，所以决定挂个国外的代理将图片保存至本地，以后有时间再写个脚本再批量替换，毕竟博客数量还是很多的。

有几点不足：

未下载 css 和 js 文件，其实实现起来很简单，可以复制代码中的正则自己扩展。
下载后的HTML文件中的js和css都是需要联网，不联网只能看到文章内容，不能看到美化布局。

废话不多说，环境是 python3 的，没有多余的库需要安装，忘记是不是需要 pillow 库了。

pip install Pillow

2. 流程分解

URL保存在文件中

getHtml() 函数，作用是挂代理访问 URL；

 def getHtml(url):
     proxies = {"https": "127.0.0.1:1080"}
     response = requests.get(url, proxies = proxies)
     return response.text

getImg() 函数，作用是格式化过滤保存文章内容为图片；

 # 过滤自动跳转
     pattern1 = re.compile(r'<div style="display:none;">(.*?)</div>',flags=re.S)
     match1 = pattern1.sub('',body)
     #print(match1)

页面中的title作为文件名，同时也是路径名；

 # 获取title作为文件名使用
 title = ''
 title_list = re.findall('<title>(.*?) - kevinhanser - CSDN博客</title>',match4)
 #print(title_list)
 title = str(title_list[0])
 print(title)

将body中的关于图片的路径链接都替换为本地的图片路径，使在本地查看文章时也可以加载图片；

 # 定位body中的图片位置
     #match5 = pattern5.findall(match4)
     pattern5 = re.compile(r'<p>(.*)i.imgur.com')
     # 替换body中的图片文件名为本地的（./image/+title）
     match5 = pattern5.sub('<img src="'+'./image/'+title,match4)

for 循环中是将图片名使用正则匹配出来，然后构造完整的图片链接下载并保存为同名文件。

 # 保存图片，将img_url保存为img_path，保存为本地文件
   if not os.path.exists(img_path):
         with urllib.request.urlopen(img_url, timeout=120) as response, open(img_path, 'wb') as f_save:
                 f_save.write(response.read())
                 f_save.flush()
                 f_save.close()

3. 代码部分

#coding=utf-8
import urllib   #urllib模块提供了读取Web页面数据的接口.
import os       # 系统相关
import re       # 正则表达式
from PIL import Image
import os,stat
import urllib.request
import requests
import time
# python3

def getHtml(url):
    proxies = {"https": "127.0.0.1:1080"}
    response = requests.get(url, proxies = proxies)
    return response.text

def getImg(body):
    # 过滤自动跳转
    pattern1 = re.compile(r'<div style="display:none;">(.*?)</div>',flags=re.S)
    match1 = pattern1.sub('',body)
    #print(match1)
    
	 # 过滤推荐及广告
    pattern2 = re.compile(r'</article>(.*)</script>',flags=re.S)
    match2 = pattern2.sub('</div>',match1)
    #print(match2)
    
    # 过滤CSDN的最上头的横幅
    pattern2_tmp = re.compile(r'b site By baidu end\n    </script>\n    (.*?)    <script src="https://csdnimg.cn/rabbit/exposure-click/main-1.0.6.js"></script>',flags=re.S)
    match2_tmp = pattern2_tmp.sub('',match2)
    #print(match2_tmp)
    
    # 过滤css，使页面全屏
    pattern3 = re.compile(r'https://csdnimg.cn/release/phoenix/themes/skin-yellow/skin-yellow-2eefd34acf.min.css',flags=re.S)
    match3 = pattern3.sub('',match2_tmp)
    #print(match3)
    
    # 过滤使页面最大，而不是居中固定宽度
    pattern4 = re.compile(r'container',flags=re.S)
    match4 = pattern4.sub('container_bak',match3)
    #print(match4)    
    #container
    
    # 获取title作为文件名使用
    title = ''
    title_list = re.findall('<title>(.*?) - kevinhanser - CSDN博客</title>',match4)
    #print(title_list)
    title = str(title_list[0])
    print(title)
    
    # python2中使用.decode('string_escape')编码为中文
    #title = str(title_list[0]).decode('string_escape')
    
    # 获取图片名
    picture_list = re.findall('https://i.imgur.com/(.*?).png',match4)
    picture = picture_list
    #print picture
    
    # 定位body中的图片位置
    #match5 = pattern5.findall(match4)
    pattern5 = re.compile(r'<p>(.*)i.imgur.com')
    # 替换body中的图片文件名为本地的（./image/+title）
    match5 = pattern5.sub('<img src="'+'./image/'+title,match4)
    
    # 将body写成文件，编码为中文
    f = open(title+'.html','w',encoding='utf8')
    f.write(match5)
    f.close()
    
    #博客中无图片则提示
    if len(picture) == 0:
        print("%s 中无图片" % title)
    
    #博客中有图片则保存，使用title作为目录路径名
    for i in range(len(picture)):
        if not os.path.exists('./image/'+title):
            os.makedirs('./image/'+title)
            
        #print picture[i]
        img_path = './image/'+title+'/'+picture[i]+'.png'
        img_url = 'https://i.imgur.com/'+picture[i]+'.png'
        print("%s 开始..." % img_url)
        #print(img_path)
        
        # 保存图片，将img_url保存为img_path，保存为本地文件
        if not os.path.exists(img_path):
            with urllib.request.urlopen(img_url, timeout=120) as response, open(img_path, 'wb') as f_save:
                    f_save.write(response.read())
                    f_save.flush()
                    f_save.close()
        
            # 请求过快会报错，前面有重试机制
            #time.sleep(15)
            
        print(picture[i]+'.png'+"图片加载完成")             


if __name__ == '__main__':
    
    # kali.txt 文档中存放的是CSDN博客的URL链接
    f = open("kali.txt","r")
    URL_list = f.readlines()
    f.close()
    print(URL_list)
    
    for url in URL_list:	        
        # 设置错误重试机制
        attempts = 0
        success = False
        print(url)
        while attempts < 30 and not success:
            try:
                # 调用函数，主要就这两个
                body = getHtml(url.strip("\n"))     # 获取页面body
                getImg(body)                        # 对body进行操作
                success = True
            except:
                attempts += 1
                # 重试超时
                if attempts % 5 == 0:
                    time.sleep(10) 
                elif attempts == 30:
                    break