博客整理大全

csuzhucong

已于 2023-12-20 10:00:32 修改

阅读量1.2w

点赞数 3

文章标签： python

于 2017-12-06 20:33:56 首次发布

本文链接：https://blog.csdn.net/nameofcsdn/article/details/78734818

版权

一，获得所有博客地址

1，获得包含所有博客链接的文本

首先，要写一个博客，里面包含自己所有博客的链接，然后运行如下程序

import re
import urllib.request

def getTxtWithAllUrl():
    url = 'https://blog.csdn.net/nameofcsdn/article/details/109147261'
    html = urllib.request.urlopen(url).read().decode('utf-8')
    return str(html)

获得这个文本内容的方法还有很多，比如读取本地文件。

2，提取所有博客链接

依赖项：getTxtWithAllUrl 函数

import re
import urllib.request

def getAllUrl(strg):
    all_url = re.findall('http[s]*://blog.csdn.net/nameofcsdn/article/details/[0-9]+', strg, re.IGNORECASE)
    all_url = list(set(all_url))
    fp = open('D:\\csdn.txt', 'w')
    s = 0
    for each in all_url:
        fp.write(each + '\n')
        s = s + 1
    print(s)

getAllUrl(getTxtWithAllUrl())

这样就在D盘根目录新建了文件csdn.txt，并且存下了所有博客的链接

3，获得所有博客id

依赖项：本地D盘的 csdn.txt

先获得所有博客链接，然后运行下面的程序转化成id

# coding=utf-8
import re
import urllib.request


def getIdFromUrl(url):
    url = url.replace('\n', '')
    url = url[-9:]
    url = url.replace('/', '')
    return url


def printUrl(url):
    print(getIdFromUrl(url))



fp = open('D:\\csdn.txt', 'r', encoding='utf-8')
for each in fp:
    printUrl(each)

二，博客备份

1，自动备份

依赖项：本地D盘的 csdn.txt

在D盘根目录下csdn.txt中存下了所有博客的链接之后，再运行下面的程序即可读取每一篇博客，如果是纯文本的博客还可以直接写到txt。

#coding=utf-8
import re
import urllib.request
import os

fp = open('D:' + '\\' + 'csdn.txt','r').readlines()
for line in fp:
    if line == '\n':
        break
    try:
        urllib.request.urlopen(line.strip('\n')).read()
    except:
        print('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa')
    print(line)

PS：不要用这个程序刷访问量，小心被封号。

2，手动备份

自动备份的功能太有限，所以我选择手动备份。

IE和谷歌浏览器都支持把网页保存到本地，而且是打包成一个文件，mht或者mhtml文件，可以直接打开，而且其中包含了博客链接地址，非常方便。

最绝的是，我可以用文件夹的方式管理我的所有博客，脱离CSDN的标签系统，因为CSDN一直在变，太难用了。

三，构建所有博客列表

因为CSDN恶心的限制，一篇文章不能超过64000字，所以我不得不用Markdown编辑器，不过利用markdown编辑器自动生成标题和目录，倒是非常方便。

依赖项：所有博客都有本地备份文件。

1，构建所有博客列表

对照Markdown的语法，控制程序输出的格式：

import re, os
import urllib.request

def out(n):
    if n:
        print("#", end='')
        out(n-1)


def readfile(A, file):
    f = open(file, 'r').read()
    try:
        url = re.findall('Content-Location: http[s]*://blog.csdn.net/nameofcsdn/article/details/[0-9]+', str(f))
        eachurl = url[0][18:]
    except:
        url = re.findall('http[s]*://blog.csdn.net/nameofcsdn/article/details/[0-9]+', str(f))
        eachurl = url[0]
    html = urllib.request.urlopen(eachurl).read().decode('UTF-8')
    title = re.findall('var articleTitle =.*;', str(html))
    eachtitle = title[0]
    aurl = eachtitle[20:-2] + ' '+eachurl
    A.append(aurl)


def outpath(path1, path2, deep):
    path1 = os.path.join(path1, path2)
    mylist = os.listdir(path1)
    out(deep)
    if os.path.isdir(os.path.join(path1, mylist[0])):  # 全是目录
        print(' ', path2)
        for adir in mylist:
            outpath(path1, adir, deep + 1)
    else:  # 全是文件
        A = []
        for adir in mylist:
            try:
                readfile(A, os.path.join(path1, adir))
            except:
                A.append(adir)
        print(' ', path2, '  共', len(mylist), '篇')
        A.sort()
        for each in A:
            print(each[:-58] + '    [博客链接](' + each[-58:] + ')')


outpath('D:\\朱聪', '博客备份（2020年10月20日）', 0)  # 0对应path2 = '博客备份'以此类推

运行结果：

把输出的内容直接粘贴到博客里面就得到了：所有博客导航

之所以用“博客链接”这个词而不是用“Link”这个词，是因为link在一些博客标题里面出现了，所以不是很方便。

2，优化——避免手动计算目录深度

把最后一行函数调用的地方换成：

path = r'D:\朱聪\博客备份\7，数学与逻辑\7.6, 从数学到编程'
loc = path.rfind('\\')
path1 = path[0:loc]
outpath(path1, path[loc+1:], len(path1.split('\\'))-2)

这样每次只需要直接把绝对路径复制过来即可，不需要其他任何操作。

3，优化——加速

逐个打开网页的方法太慢了，于是我决定把信息存下来，下次运行就不用再访问网页了。

考虑了几种方式之后，我选择了直接修改文件名的方式，把每个博客的文件名直接改成博客标题。

但是这样又有个问题，有些符号没法作为文件名，

所以实际策略是，能改文件名的就改，改不了的就不改，就需要每次访问网页。

在不能用作文件名的符号中，英文冒号:出现在我的博客里面很多，其他的都是个例，所以我在改文件名的时候，把英文冒号改成中文冒号。

代码：

import re, os
import urllib.request

def out(n):
    if n:
        print("#", end='')
        out(n - 1)

def getUrlFromFile(file):
    f = open(file, 'r').read()
    try:
        url = re.findall('Content-Location: http[s]*://blog.csdn.net/nameofcsdn/article/details/[0-9]+', str(f))
        ret = url[0][18:]
    except:
        url = re.findall('http[s]*://blog.csdn.net/nameofcsdn/article/details/[0-9]+', str(f))
        ret = url[0]
    return ret

def getTitleFromFile(file):
    eachurl = getUrlFromFile(file)
    try:
        html = urllib.request.urlopen(eachurl).read().decode('UTF-8')
    except:
        print(file)
    title = re.findall('var articleTitle =.*;', str(html))
    eachtitle = title[0]
    aurl = eachtitle[20:-2]
    return aurl

def outpath(path1, path2, deep):
    path1 = os.path.join(path1, path2)
    mylist = os.listdir(path1)
    out(deep)
    if os.path.isdir(os.path.join(path1, mylist[0])):  # 全是目录
        print(' ', path2)
        for adir in mylist:
            outpath(path1, adir, deep + 1)
    else:  # 全是文件
        A = []
        for adir in mylist:
            file = os.path.join(path1, adir)
            filename = getUrlFromFile(file)
            if 'CSDN博客' in adir or 'nameofcsdn' in adir:
                title = getTitleFromFile(file)
            else:
                title = adir[:-4]
            A.append(title + ' ' + filename)
            title = title.replace(':','：')
            try:
                os.replace(file, os.path.join(path1, title + '.mht'))
            except:
                nothing = 0
        print(' ', path2, '  共', len(mylist), '篇')
        A.sort()
        for each in A:
            print(each[:-58] + '    [博客链接](' + each[-58:] + ')')

path = r'D:\朱聪\博客备份\CSDN'
loc = path.rfind('\\')
path1 = path[0:loc]
outpath(path1, path[loc + 1:], len(path1.split('\\')) - 3)

优化之前，扫描1300篇博客大概需要30分钟，优化之后只需要2分钟

四，获得所有博客标题

依赖项：所有博客都有本地备份文件。

根据本地备份的博客，生成博客列表：

import re, os
import urllib.request

def out(n):
    if n:
        print("#", end='')
        out(n-1)


def readfile(A, file):
    f = open(file, 'r').read()
    try:
        url = re.findall('Content-Location: http[s]*://blog.csdn.net/nameofcsdn/article/details/[0-9]+', str(f))
        eachurl = url[0][18:]
    except:
        url = re.findall('http[s]*://blog.csdn.net/nameofcsdn/article/details/[0-9]+', str(f))
        eachurl = url[0]
    html = urllib.request.urlopen(eachurl).read().decode('UTF-8')
    title = re.findall('var articleTitle =.*;', str(html))
    eachtitle = title[0]
    aurl = eachtitle[20:-2] + ' '+eachurl
    A.append(aurl)


def outpath(path1, path2, deep):
    path1 = os.path.join(path1, path2)
    mylist = os.listdir(path1)
    # out(deep)
    if os.path.isdir(os.path.join(path1, mylist[0])):  # 全是目录
        # print(' ', path2)
        for adir in mylist:
            outpath(path1, adir, deep + 1)
    else:  # 全是文件
        A = []
        for adir in mylist:
            try:
                readfile(A, os.path.join(path1, adir))
            except:
                A.append(adir)
        # print(' ', path2, '  共', len(mylist), '篇')
        A.sort()
        for each in A:
            print(each[:-58])
            # print(each[:-58] + '    [博客链接](' + each[-58:] + ')')


outpath('D:\\朱聪', '博客备份（2020年10月18日）', 0)  # 0对应path2 = '博客备份'以此类推
# outpath('D:\\朱聪\\博客备份（2020年10月18日）', '9，其他', 1)

运行：

五，如何看哪些博客不止1个分类

csdn博客可以自定义分类，形式和微信一样，每篇博客可以不分到任何一类，也可以分到一类或多类。

前不久csdn更新之后终于可以查看哪些博客是没有分类的了，

直接打开CSDN就可以看得到：

现在的问题是，如何查看哪些博客不止1个分类呢？为此我写了一个简单的python程序。

依赖项：本地D盘的 csdn.txt

# coding=utf-8
import re
import urllib.request
import os

fp = open('D:' + '\\' + 'csdn.txt','r').readlines()
for line in fp:
    if line == '\n':
        break
    try:
        html = urllib.request.urlopen(line.strip('\n')).read().decode('utf-8')
        m = re.findall('个人分类：</span>.*?https://blog.csdn.net/nameofcsdn/article.*?"_blank">.*?</a>', str(html),re.DOTALL)
        print(m[0][121:-10])
        print(line)
    except:
        print(line)

六，寻找丢失的博客

1，方法一（比较标题，已失效）

2020年10月20日。

因为我重写所有博客导航这篇博客之后发现少了一篇，系统统计是1570篇，我本地是1569篇。

我排查了标签有new的博客，并没有找到。老办法，数据挖掘，数据分析，python上场。

首先我们要拿到系统给的所有博客的列表：

（因为2篇博客正在审核中，所以截图这会显示的是1568）

这里的导出数据功能只能导出1000个（CSDN你真的是要我每天和你斗智斗勇啊，为了写点博客想尽办法，比密室逃脱还刺激）

所以要正序、逆序各导出一份，然后拼接起来。

然后根据本地备份的博客生成博客标题列表。

最后，用excel或者文本比较工具，就可以找出这个幽灵了。

原来，是一篇new博客，但是标签贴错了，贴成更新了。

哎，我怎么早没想到呢，搞得大半夜三点半写完代码在这忏悔。

啥也不是，睡觉！

2，方法二（比较链接）

2021年9月11日。

CSDN又更新了，上面的方法又不行了，我真的真的真的无比痛恨CSDN反反复复反反复复的更新，太烦了。

我又找到了新的方法，把系统上的所有博客链接和本地的所有博客链接对比。

（1）系统上的所有博客链接

依赖项：getAllUrl函数

访问新版主页nameofcsdn的博客，不停的按【page down】按钮，直到把所有博客都显示出来，然后用ctrl+s把页面保存到本地。

然后再打开博客，查看网页源代码，复制，放到D:\\nameofcsdn.txt里，

再运行程序即可获得所有博客的列表。

f = open('D:\\nameofcsdn.txt','r',encoding='utf-8',errors='ignore').read()
f = str(f)
getAllUrl(f)

（2）本地的所有博客链接

依赖项：所有博客都有本地备份文件。

对于本地博客，直接把上面的代码略改，不输出博客名，只输出博客链接：

import re, os
import urllib.request

def getUrlFromFile(file):
    f = open(file, 'r').read()
    try:
        url = re.findall('Content-Location: http[s]*://blog.csdn.net/nameofcsdn/article/details/[0-9]+', str(f))
        ret = url[0][18:]
    except:
        url = re.findall('http[s]*://blog.csdn.net/nameofcsdn/article/details/[0-9]+', str(f))
        ret = url[0]
    return ret


def getTitleFromFile(file):
    eachurl = getUrlFromFile(file)
    try:
        html = urllib.request.urlopen(eachurl).read().decode('UTF-8')
    except:
        print(file)
    title = re.findall('var articleTitle =.*;', str(html))
    eachtitle = title[0]
    aurl = eachtitle[20:-2]
    return aurl


def outpath(path1, path2, deep):
    path1 = os.path.join(path1, path2)
    mylist = os.listdir(path1)
    if os.path.isdir(os.path.join(path1, mylist[0])):  # 全是目录
        for adir in mylist:
            outpath(path1, adir, deep + 1)
    else:  # 全是文件
        A = []
        for adir in mylist:
            file = os.path.join(path1, adir)
            filename = getUrlFromFile(file)
            if 'CSDN博客' in adir or 'nameofcsdn' in adir:
                title = getTitleFromFile(file)
            else:
                title = adir[:-4]
            A.append(title + ' ' + filename)
        A.sort()
        for each in A:
            each = each[-58:]
            while each[0] == ' ':
                each = each[1:]
            print(each)


path = r'D:\\朱聪\\博客备份\\CSDN'
loc = path.rfind('\\')
path1 = path[0:loc]
outpath(path1, path[loc + 1:], len(path1.split('\\')) - 3)

3，方法三（改进的方法二）

直接把2个博客列表都算出来，然后进行比对。

依赖项：getTxtWithAllUrl函数、getAllUrl函数

先更新D:\\nameofcsdn.txt，步骤同方法二。

然后运行：

import re
import urllib.request

def showDif(lista,listb):
    for a in lista:
        if a not in listb:
            print(a)

f = open('D:\\nameofcsdn.txt','r',encoding='utf-8',errors='ignore').read()
lista = getAllUrl(str(f))
listb = getAllUrl(getTxtWithAllUrl())
print(len(lista))
print(len(listb))
showDif(lista,listb)
print("---")
showDif(listb,lista)

输出2个博客列表，第一个列表中要么是无效链接，要么是本地丢失的博客，第二个列表都是私密博客的链接

七，获取所有包含代码的博客链接

方法：寻找html源码中的代码标签

import re
import urllib.request


def getTxtWithAllUrl():
    url = 'https://blog.csdn.net/nameofcsdn/article/details/109147261'
    html = urllib.request.urlopen(url).read().decode('utf-8')
    return str(html)

strg = getTxtWithAllUrl()
all_url = re.findall('http[s]*://blog.csdn.net/nameofcsdn/article/details/[0-9]+', strg, re.IGNORECASE)
all_url = list(set(all_url))
in_list=[]
not_in_list=[]
for url in all_url:
    html = ''
    try:
        html = urllib.request.urlopen(url).read().decode('utf-8')
        print('open succ')
    except:
        print('open fail',end=' ')
        print(url)
    if 'code class' in html:
        in_list.append(url)
    else:
        not_in_list.append(url)

print('in_list')
for url in in_list:
    print(url)
print('not_in_list')
for url in not_in_list:
    print(url)