今天闲着无聊,突然想要爬一下csdn里的内容。又因为我是用hexo建的blog,就想保存为md文件慢慢观看。
先打开一个专栏,就选 Jack-Cui 的 Python3网络爬虫入门 吧。
首先我们分析一下它的目录的元素:
看到了吗,这些a标签就是目录的元素。
现在,编写代码:
from bs4 import BeautifulSoup
import requests
headers={
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36"
}
url = 'https://blog.csdn.net/c406495762/category_9268672.html'
res = requests.get(url,headers=headers)
soup = BeautifulSoup(res.text,'html.parser')
div = soup.find('div',id = 'column')
a = div.find('ul',class_ = 'column_article_list').find_all('a')
url = []
for i in a:
url.append(i.get('href'))
print(url)
然后,让我们把单个页面转成md吧。
这个问题看似很难,可是我们上网搜呗。
你看这么多方法,我最喜欢的是 html2text 库,使用起来非常简单:
def md(url):
res = requests.get(url)
soup = BeautifulSoup(res.text,'html.parser')
content = str(soup.find('div',id = 'article_content')).replace('<blockquote>','').replace('</blockquote>','')
name = soup.find('h1').text
start = f'''---
title: {name}
---
'''
end = f'''
文章转载自:{url}
'''
md = html2text.html2text(content).replace('-\n','-')
w = start + md + end
with open(f'{name}.md','w') as f:
f.write(w)
md('https://jackcui.blog.csdn.net/article/details/109046264')
简单来说就只需要这样一行代码:
html2text.html2text('HTML字符串')
再进行一些优化后,🆗了。
弄到在线编辑器里看一下:
怎么看不了图片???
这可能是因为csdn有通过Referer 的反爬机制。
不过我们可以在文件顶部添加这一行:
<meta name="referrer" content="no-referrer">
更改代码:
def d(str,d_list):
s = str
for i in d_list:
s = s.replace(i,'')
return s
def md(url):
res = requests.get(url)
soup = BeautifulSoup(res.text,'html.parser')
content = d(str(soup.find('div',id = 'article_content')),['<blockquote>','</blockquote>'])
name = soup.find('h1').text
start = f'''
---
title: {name}
---
<meta name="referrer" content="no-referrer">
'''
end = f'''
文章转载自:{url}
'''
md = html2text.html2text(content).replace('-\n','-')
w = start + md + end
name = d(name,['\\','/',':',':','*','?','?','"','<','>','|'])
print(name)
with open(f'./md/{name}.md','w') as f:
f.write(w)
md('https://jackcui.blog.csdn.net/article/details/58716886')
图片就出来了:
整合代码,保存所有文章吧。
from bs4 import BeautifulSoup
import requests,html2text,time
headers={
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36"
}
url = 'https://blog.csdn.net/c406495762/category_9268672.html'
res = requests.get(url,headers=headers)
soup = BeautifulSoup(res.text,'html.parser')
div = soup.find('div',id = 'column')
a = div.find('ul',class_ = 'column_article_list').find_all('a')
url = []
for i in a:
url.append(i.get('href'))
print(url)
def d(str,d_list):
s = str
for i in d_list:
s = s.replace(i,'')
return s
def md(url):
res = requests.get(url)
soup = BeautifulSoup(res.text,'html.parser')
content = d(str(soup.find('div',id = 'article_content')),['<blockquote>','</blockquote>'])
name = soup.find('h1').text
start = f'''
---
title: {name}
---
<meta name="referrer" content="no-referrer">
'''
end = f'''
文章转载自:{url}
'''
md = html2text.html2text(content).replace('-\n','-')
w = start + md + end
name = d(name,['\\','/',':',':','*','?','?','"','<','>','|'])
print(name)
with open(f'./md/{name}.md','w') as f:
f.write(w)
for i in url:
md(i)
复制到hexo的_post文件夹后,完成了!