python爬取csdn专栏并储存为md文件

最新推荐文章于 2024-05-03 18:58:10 发布

leyuxuan1230

最新推荐文章于 2024-05-03 18:58:10 发布

阅读量1.4k

点赞数 1

文章标签： python 爬虫

本文链接：https://blog.csdn.net/leyuxuan1230/article/details/119754505

版权

今天闲着无聊，突然想要爬一下csdn里的内容。又因为我是用hexo建的blog，就想保存为md文件慢慢观看。

先打开一个专栏，就选 Jack-Cui 的 Python3网络爬虫入门 吧。

首先我们分析一下它的目录的元素：

看到了吗，这些a标签就是目录的元素。

现在，编写代码：

from bs4 import BeautifulSoup
import requests
headers={
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36"
}
url = 'https://blog.csdn.net/c406495762/category_9268672.html'
res = requests.get(url,headers=headers)
soup = BeautifulSoup(res.text,'html.parser')
div = soup.find('div',id = 'column')
a = div.find('ul',class_ = 'column_article_list').find_all('a')
url = []
for i in a:
    url.append(i.get('href'))
print(url)

然后，让我们把单个页面转成md吧。

这个问题看似很难，可是我们上网搜呗。

你看这么多方法，我最喜欢的是 html2text 库，使用起来非常简单：

def md(url):
    res = requests.get(url)
    soup = BeautifulSoup(res.text,'html.parser')
    content = str(soup.find('div',id = 'article_content')).replace('<blockquote>','').replace('</blockquote>','')
    name = soup.find('h1').text
    start = f'''---
title: {name}
---
'''
    end = f'''
文章转载自:{url}
'''
    md = html2text.html2text(content).replace('-\n','-')
    w = start + md + end
    with open(f'{name}.md','w') as f:
        f.write(w)
md('https://jackcui.blog.csdn.net/article/details/109046264')

简单来说就只需要这样一行代码：

html2text.html2text('HTML字符串')

再进行一些优化后，🆗了。

弄到在线编辑器里看一下：

怎么看不了图片？？？

这可能是因为csdn有通过Referer 的反爬机制。

不过我们可以在文件顶部添加这一行：

<meta name="referrer" content="no-referrer">

更改代码：

def d(str,d_list):
    s = str
    for i in d_list:
        s = s.replace(i,'')
    return s

def md(url):
    res = requests.get(url)
    soup = BeautifulSoup(res.text,'html.parser')
    content = d(str(soup.find('div',id = 'article_content')),['<blockquote>','</blockquote>'])
    name = soup.find('h1').text
    start = f'''
---
title: {name}
---
<meta name="referrer" content="no-referrer">

'''
    end = f'''
文章转载自:{url}
'''
    md = html2text.html2text(content).replace('-\n','-')
    w = start + md + end
    name = d(name,['\\','/','：',':','*','?','？','"','<','>','|'])
    print(name)
    with open(f'./md/{name}.md','w') as f:
        f.write(w)

md('https://jackcui.blog.csdn.net/article/details/58716886')

图片就出来了：

整合代码，保存所有文章吧。

from bs4 import BeautifulSoup
import requests,html2text,time

headers={
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36"
}
url = 'https://blog.csdn.net/c406495762/category_9268672.html'
res = requests.get(url,headers=headers)
soup = BeautifulSoup(res.text,'html.parser')
div = soup.find('div',id = 'column')
a = div.find('ul',class_ = 'column_article_list').find_all('a')
url = []
for i in a:
    url.append(i.get('href'))
print(url)

def d(str,d_list):
    s = str
    for i in d_list:
        s = s.replace(i,'')
    return s

def md(url):
    res = requests.get(url)
    soup = BeautifulSoup(res.text,'html.parser')
    content = d(str(soup.find('div',id = 'article_content')),['<blockquote>','</blockquote>'])
    name = soup.find('h1').text
    start = f'''
---
title: {name}
---
<meta name="referrer" content="no-referrer">

'''
    end = f'''
文章转载自:{url}
'''
    md = html2text.html2text(content).replace('-\n','-')
    w = start + md + end
    name = d(name,['\\','/','：',':','*','?','？','"','<','>','|'])
    print(name)
    with open(f'./md/{name}.md','w') as f:
        f.write(w)

for i in url:
    md(i)

复制到hexo的_post文件夹后，完成了！

leyuxuan1230

关注

1
点赞
踩
8

收藏

觉得还不错? 一键收藏
2
评论
python爬取csdn专栏并储存为md文件

今天闲着无聊，突然想要爬一下csdn里的内容。又因为我是用hexo建的blog，就想保存为md文件慢慢观看。先打开一个专栏，就选“Jack-Cui”的Python3网络爬虫入门吧。首先我们分析一下它的目录的元素：看到了吗，这些a标签就是目录的元素。现在，编写代码：from bs4 import BeautifulSoupimport requestsheaders={"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) A..
复制链接

扫一扫