「玩转Python爬虫篇」：打造十万博文

最新推荐文章于 2020-11-30 11:56:41 发布

cpfsdzs2014

最新推荐文章于 2020-11-30 11:56:41 发布

阅读量99

点赞数

「玩转Python爬虫篇」：打造十万博文

前言

这里以爬取博客园文章为例，仅供学习参考，某些AD满天飞的网站太浪费爬虫的感情了。

爬取

使用 BeautifulSoup 获取博文
通过 html2text 将 Html 转 Markdown
保存 Markdown 到本地文件
下载 Markdown 中的图片到本地并替换图片地址
写入数据库

工具

使用到的第三方类库：BeautifulSoup、html2text、PooledDB

代码

获取博文：

# 获取标题和文章内容
def getHtml(blog):
 res = requests.get(blog, headers=headers)
 soup = BeautifulSoup(res.text, 'html.parser')
	# 获取博客标题
 title = soup.find('h1', class_='postTitle').text
	# 去除空格等
 title = title.strip()
	# 获取博客内容
 content = soup.find('div', class_='blogpost-body')
	# 去掉博客外层的DIV
 content = article.decode_contents(formatter="html")
 info = {"title": title, "content": content}
 return info

Html 转 Markdown：

# 这里使用开源第三方库 html2text
 md = text_maker.handle(info['content'])

保存到本地文件：

def createFile(md, title):
 print('系统默认编码：{}'.format(sys.getdefaultencoding()))
 save_file = str(title) +".md"
 # print(save_file)
 print('准备写入文件：{}'.format(save_file))
 # r+ 打开一个文件用于读写。文件指针将会放在文件的开头。
 # w+ 打开一个文件用于读写。如果该文件已存在则将其覆盖。如果该文件不存在，创建新文件。
 # a+ 打开一个文件用于读写。如果该文件已存在，文件指针将会放在文件的结尾。文件打开时会是追加模式。如果该文件不存在，创建新文件用于读写。
 f = codecs.open(save_file, 'w+', 'utf-8')
 f.write(md)
 f.close()
 print('写入文件结束：{}'.format(f.name))
 return save_file

下载图片到本地并替换图片地址：

def replace_md_url(md_file):
 """
 把指定MD文件中引用的图片下载到本地，并替换URL
 """
 if os.path.splitext(md_file)[1] != '.md':
 print('{}不是Markdown文件，不做处理。'.format(md_file))
 return
 cnt_replace = 0
 # 日期时间为目录存储图片
 dir_ts = time.strftime('%Y%m', time.localtime())
 isExists = os.path.exists(dir_ts)
 # 判断结果
 if not isExists:
 os.makedirs(dir_ts)
 with open(md_file, 'r', encoding='utf-8') as f: # 使用utf-8 编码打开
 post = f.read()
 matches = re.compile(img_patten).findall(post)
 if matches and len(matches) > 0:
 for match in list(chain(*matches)):
 if match and len(match) > 0:
 array = match.split('/')
 file_name = array[len(array) - 1]
 file_name = dir_ts + "/" + file_name
 img = requests.get(match, headers=headers)
 f = open(file_name, 'ab')
 f.write(img.content)
 new_url = "https://blog.52itstyle.vip/{}".format(file_name)
 # 更新MD中的URL
 post = post.replace(match, new_url)
 cnt_replace = cnt_replace + 1
 # 如果有内容的话，就直接覆盖写入当前的markdown文件
 if post and cnt_replace > 0:
 url = "https://blog.52itstyle.vip"
 open(md_file, 'w', encoding='utf-8').write(post)
 print('{0}的{1}个URL被替换到{2}/{3}'.format(os.path.basename(md_file), cnt_replace, url, dir_ts))
 elif cnt_replace == 0:
 print('{}中没有需要替换的URL'.format(os.path.basename(md_file)))

写入数据库：

# 写入数据库
def write_db(title, content, url):
 sql = "INSERT INTO blog (title, content,url) VALUES(%(title)s, %(content)s, %(url)s);"
 param = {"title": title, "content": content, "url": url}
 mysql.insert(sql, param)

小结

互联网时代一些开放的博客社区的确方便了很多，但是也伴随着随时消失的可能性，最好就是自己备份一份到本地；你也可以选择自己喜欢的博主，爬取下收藏。更多的 Python爬虫教程也会在接下来的教程中为大家讲解，或者伙伴们有什么想看想学的内容也可以留言或者私信我哦！

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/69923331/viewspace-2652850/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/69923331/viewspace-2652850/

cpfsdzs2014

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
「玩转Python爬虫篇」：打造十万博文

「玩转Python爬虫篇」：打造十万博文前言这里以爬取博客园文章为例，仅供学习参考，某些AD满天飞的网站太浪费爬虫的感情了。爬取使用 Beaut...
复制链接

扫一扫