主要参考这篇文章,实现将csdn博客批量迁移(即文章批量导出为md文件)
https://blog.csdn.net/pang787559613/article/details/105444286
文中有些细节没说,这里补充一下
实现原理
利用python 爬虫完成对页面的访问
流程
https://blog.csdn.net/你的博客名/article/list/页码
1.通过这个api访问所有页,拿到所有文章。
https://blog.csdn.net/你的博客名/article/details/105360586
2.每一篇文章有唯一的id,解析页面获得文章id以及发布时间等信息。
https://blog-console-api.csdn.net/v1/editor/getArticle?id=文章id
3.你拿到文章id后,上面那个api访问就可以得到一个包含文章所有信息的json,解析拿到markdowncontent、文章标题、tags、key等信息。
4.保存为hexo格式的markdown。
两个问题
1.原文中这句
reply = requests.get(url)
parse = BeautifulSoup(reply.content, "lxml")
运行报错。
解决:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Safari/537.36 Edg/87.0.664.47'}
reply = requests.get(url, headers=headers)
parse = BeautifulSoup(reply.content, "html.parser")
2.原文这句
headers = {
"cookie":"uuid_tt_dd=10_20621362900-1586421086666-163599; dc_session_id=10_1586421086666.420623; Hm_lvt_6bcd52f51e9b3dce32bec4a3997715ac=1586421618; dc_sid=d4ceee41911ac755c162110ff811aee3; __gads=ID=608336bee91baf3d:T=1586421689:S=ALNI_MZozulzITWLw3Hxzo3jrnu5fmz8CA; c_ref=https%3A//blog.csdn.net/pang787559613/article/list/2; c-toolbar-writeguide=1; SESSION=3b5e7c88-b27d-4fcc-a2d5-4e97c1438a3c; UN=pang787559613; Hm_ct_6bcd52f51e9b3dce32bec4a3997715ac=6525*1*10_20621362900-1586421086666-163599!5744*1*pang787559613; announcement=%257B%2522isLogin%2522%253Atrue%252C%2522announcementUrl%2522%253A%2522https%253A%252F%252Fblog.csdn.net%252Fblogdevteam%252Farticle%252Fdetails%252F105203745%2522%252C%2522announcementCount%2522%253A0%252C%2522announcementExpire%2522%253A3600000%257D; UserName=pang787559613; UserInfo=604f13922dc04f2d8071fe0834e95db3; UserToken=604f13922dc04f2d8071fe0834e95db3; UserNick=%E7%AC%91%E8%83%96%E4%BB%94; AU=5FE; BT=1586422795507; p_uid=U000000; Hm_lpvt_6bcd52f51e9b3dce32bec4a3997715ac=1586422842; TY_SESSION_ID=ffc735f4-f5ae-40ed-9b98-ff01e337bf76; dc_tos=q8ijsv",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"
}
一开始我没在意这句,后面运行出错,找了一下午,最终结论就是,cookie很重要,如果没cookie,访问页面的时候就会报错,叫你先登录再操作。这个cookie就是保存登录记录,python爬取的时候就不用再登录了。
解决:
查看自己的cookie
我用的谷歌浏览器。
-
到csdn的首页,f12进去
如果是空白那就刷新一下 -
往下划一滑就可以找到cookie
把这个cookie复制过去,就可以不用登陆访问自己的博客页面了
代码
import json
import uuid
import time
import requests
import datetime
from bs4 import BeautifulSoup
def request_blog_list(page):
"""获取博客列表
主要包括博客的id以及发表时间等
"""
url = f'https://blog.csdn.net/你的博客名/article/list/{page}'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Safari/537.36 Edg/87.0.664.47'}
reply = requests.get(url, headers=headers)
parse = BeautifulSoup(reply.content, "html.parser")
spans = parse.find_all('div', attrs={'class':'article-item-box csdn-tracking-statistics'})
blogs = []
for span in spans[:40]:
try:
href = span.find('a', attrs={'target':'_blank'})['href']
# read_num = span.find('span', attrs={'class':'read_num'}).get_text()
date = span.find('span', attrs={'class':'date'}).get_text()
blog_id = href.split("/")[-1]
blogs.append([blog_id, date])
except:
print('Wrong, ' + href)
return blogs
def request_md(blog_id, date):
"""获取博客包含markdown文本的json数据"""
url = f"https://blog-console-api.csdn.net/v1/editor/getArticle?id={blog_id}"
# headers = {
# "cookie":"",
# "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"
# }
headers = {
'Cookie':'',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.82 Safari/537.36'}
data = {"id": blog_id}
reply = requests.get(url, headers=headers, data=data)
try:
write_hexo_md(reply.json(), date)
except Exception as e:
print("***********************************")
print(e)
print(url)
# print(reply.json())
def write_hexo_md(data, date):
"""将获取的json数据解析为hexo的markdown格式"""
title = data["data"]["title"]
title = title.replace("[", "【")
title = title.replace("]", "】")
tags = data["data"]["tags"]
# # 页面唯一标识符,用于统计系统和评论系统
# key = "key" + str(uuid.uuid4())
time = f"{date[0]}-{date[1]}-{date[2]}"
tag = "tags:\n - " + "\n - ".join(tags.split(","))
#hexo内容顶部拼接:title、tag、key
header = "---\n" + f"title: {title}\n" + f"date: {time}\n"+ tag + "\n" + "---\n\n"
# header = "---\n" + f"title: {title}\n" + tag + "\n" + "---\n\n"
#hexo内容顶部拼接:仅title
# header = "---\n" + f"title: {title}\n" + "---\n\n"
content = data["data"]["markdowncontent"].replace("@[toc]", "")
# hexo格式markdown
with open(f"blogs/{title}.md", "w", encoding="utf-8") as f:
f.write(header + content)
print(f"写入 {title}")
# # 用来博客迁移。遂仅保留内容
# with open(f"blogs/{title}.md", "w", encoding="utf-8") as f:
# f.write(content)
# print(f"写入 {title}")
def main(total_pages=2):
"""
获取博客列表,包括id,时间
获取博客markdown数据
保存hexo格式markdown
"""
blogs = []
for page in range(1, total_pages + 1):
blogs.extend(request_blog_list(page))
for blog in blogs:
blog_id = blog[0]
date = blog[1].split()[0].split("-")
request_md(blog_id, date)
time.sleep(1)