我用 Python 做了一个轻松爬取各大网站文章并输出为 Markdown 的工具！

最新推荐文章于 2024-12-29 10:13:54 发布

程序员小八

最新推荐文章于 2024-12-29 10:13:54 发布

阅读量1k

点赞数 16

文章标签： python 开发语言前端 web安全

本文链接：https://blog.csdn.net/z099164/article/details/135240989

版权

前言

最近摸鱼看技术文章的时候，突然想到了两个需求，想与大家分享一下：

爬取各大技术网站的文章，转化为 Markdown 格式，防止文章由于不明原因下架。这样可以在本地保存一些高质量文章。
整理自己过去发布的文章。（我之前写的一些文章并没有在本地备份）

说干就干，我用了几个小时，编写并发布了一个文章爬取工具：Article Crawler，

接下来，我给大家分享一下我的制作过程！

需求分析与技术选型

对于爬取类的需求来说，我毫不犹豫地选择了 Python 来编写代码，毕竟一提到爬虫，大家第一反应就是 Python。它确实很方便，提供了很多方便快捷的包。

我们首先拆解一下需求，来确定最终需要使用的 Python 包。

从某个网站中爬取文章，需要定位文章的位置。网站中除了文章信息之外，可能还有推荐信息、作者信息、广告信息等。因此，我们需要将整个网站内容爬取下来，并从中搜索得出文章的内容。
将 HTML 文章内容转换 Markdown 格式，并输出到本地指定目录中。

对于第一个需求，我们使用 request 与 BeautifulSoup 包。

使用 request 包向指定网站发送请求，获取其 HTML 内容。
使用 BeautifulSoup 包在指定 HTML 内容中，查找对应的文章内容。

Beautiful Soup 是一个可以从 HTML 或 XML 文件中提取数据的 Python 库。它能够通过你喜欢的转换器实现惯用的文档导航 / 查找 / 修改文档的方式。Beautiful Soup 会帮你节省数小时甚至数天的工作时间。

对于第二个需求，我们使用 html2text 包。

使用 html2text 包，将指定的 HTML 文章内容，渲染为对应的 Markdown 格式。

总结技术栈如下：

技术栈	作用
request	向指定网站发送请求，获取 HTML 内容
BeautifulSoup (bs4)	快速从 HTML 内容中依据指定条件查找内容
html2text	将指定的 HTML 内容染为 Markdown 格式

实现方案

实现流程图如下：

whiteboard_exported_image (19).png

对于这一系列流程，我将其抽象为一个类 ArticleCrawler。

具体代码位于 article_crawler/article_crawler.py 文件中

其初始化 __init__ 方法如下：

def __init__(self, url, output_folder, tag, class_, id=''):
    self.url = url
    self.headers = {
        'user-agent': random.choice(USER_AGENT_LIST)
    }
    self.tag = tag
    self.class_ = class_
    self.id = id
    self.html_str = html_str
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)
        print(f"{output_folder} does not exist, automatically create...")
    self.output_folder = output_folder

url：指定网站地址
output_folder：输出目录
tag / class_ / id：用于定位文章在网站中所处的位置。
- 举个🌰，我们通过 F12 打开网站控制台，定位文章被该标签包裹：<div id="article_content" class="article_content clearfix"></div>
  
  在这里，对应的 tag 为 div，class_ 为 article_content clearfix，id 为 article_content。

类中主要包含如下 3 个方法：

send_request：向指定网站发送请求，获取其 HTML 内容。

def send_request(self, url):
    response = requests.get(url=url, headers=self.headers)
    response.encoding = "utf-8"
    if response.status_code == 200:
        return response

parse_detail：通过 BeautifulSoup 定位文章位置，获取到对应的 HTML 内容。

def parse_detail(self, response):
    html = response.text
    soup = BeautifulSoup(html, 'lxml')
    content = soup.find(self.tag, id=self.id, class_=self.class_)
    html = self.html_str.format(article=content)
    self.write_content(html, 'article')

write_content：将 HTML 和渲染得到的 Markdown 文本写入到指定的目录 output_folder 中。

def write_content(self, content, name):
    if not os.path.exists(self.output_folder + '/HTML'):
        os.makedirs(self.output_folder + '/HTML')
    if not os.path.exists(self.output_folder + '/MD'):
        os.makedirs(self.output_folder + '/MD')
    name = self.change_title(name)
    html_path = os.path.join(self.output_folder, "HTML", name + ".html")
    md_path = os.path.join(self.output_folder, "MD", name + ".md")

    with open(html_path, 'w', encoding="utf-8") as f:
        f.write(content)
        print(f"create {name}.html in {self.output_folder} successfully")

    html_text = open(html_path, 'r', encoding='utf-8').read()
    markdown_text = html2text.html2text(html_text)
    with open(md_path, 'w', encoding='utf-8') as file:
        file.write(markdown_text)
        print(f"create {name}.md in {self.output_folder} successfully")

优化

在 ArticleCrawler 中，我们需要自己去网站中查找文章元素，并指定 tag / class_ / id 属性，这样比较麻烦。

日常学习中，我们会经常使用几个网站，如：CSDN、掘金、知乎、简书等，于是我将这几个常用的网站抽取成单独的类，作为 ArticleCrawler 的子类。

其中需要改变的方法为 __init__ 与 parse_detail，将 tag / class_ / id 属性写死，不需要人为指定。

命令方式运行

我们通过命令的方式使用该工具，因此我们需要指定一个程序入口 __main__ 文件：

我们通过 OptionParser，指定命令参数详情，其中包含包描述、版本号、参数简写、参数名、帮助手册等信息。

if __name__ == '__main__':
    from optparse import OptionParser

    parser = OptionParser(prog=prog, description=description, version='%prog ' + version, usage=usage)
    parser.add_option("-u", "--url", dest="url", help="crawled url (required)")
    parser.add_option("-t", "--type", dest="type", default="",
                      help="crawled article type [csdn] | [juejin] | [zhihu] | [jianshu]")
    parser.add_option("-o", "--output_folder", dest="output_folder",
                      help="output html / markdown / pdf folder (required)")
    parser.add_option("-w", "--website_tag", dest="website_tag",
                      help="position of the article content in HTML (not required if 'type' is specified)")
    parser.add_option("-c", "--class", dest="class_", default="",
                      help="position of the article content in HTML (not required if 'type' is specified)")
    parser.add_option("-i", "--id", dest="id", default="",
                      help="position of the article content in HTML (not required if 'type' is specified)")
    options, args = parser.parse_args()
    main()

进入main 方法中，我们需要依据代码逻辑，对参数进行额外校验，如：空参数异常、参数错误异常等
- url 与 output_folder 不得为空
- type / website_tag / class_ / id 不得同时为空
- type 必须在指定的类型内
- 参数校验完毕后，创建对应的类对象，并执行 start 方法

def main():
    url = options.url
    type = options.type
    output_folder = options.output_folder
    website_tag = options.website_tag
    class_ = options.class_
    id = options.id
    if not url:
        parser.error("url must be specified.")
    if not output_folder:
        parser.error("output folder must be specified.")
    if type == "" and website_tag == "" and class_ == "" and id == "":
        parser.error("'type', 'website_tag', 'class_', 'id' cannot be empty at the same time.")
    if type not in ["csdn", "juejin", "zhihu", "jianshu"]:
        parser.error(
            "The current article type is not supported, you need to specify 'class_' or 'id' to locate the position of the article.")
    if type != '':
        crawler = class_dic[type](url=url, output_folder=output_folder)
    else:
        crawler = ArticleCrawler(url=url, output_folder=output_folder, tag=website_tag, class_=class_, id=id)
    crawler.start()

最终效果

最终，我们将其打包发布到 pypi 中，并重新安装到本地，执行命令：

pip install article-crawler
python3 -m article_crawler -u https://zhuanlan.zhihu.com/p/644525159 -o /Users/lty/Downloads/article_output -t zhihu

其实现效果如下：

我们打开输出的 Markdown 文件，看看效果：

大家可以看到，除了换行问题外，其它部分的转换效果还是很不错的，基本与原文一致～

总结

今天，我从需求分析、技术选型、实现方案、优化、效果展示等角度，从 0 到 1 实现了 Article Crawler 工具，并向大家介绍了详细的实现过程。

对于如何从 0 到 1 发布一个 Pypi 包，我会再下一篇文章中，详细进行介绍～

今天的内容就到这里啦，大家觉得有用的话麻烦帮忙点个赞、点个 Star 支持一下呀，下期再见！

如果你对Python感兴趣，想要学习python，这里给大家分享一份Python全套学习资料，都是我自己学习时整理的，希望可以帮到你，一起加油！

😝有需要的小伙伴，可以点击下方链接免费领取或者V扫描下方二维码免费领取🆓
Python全套学习资料

在这里插入图片描述

1️⃣零基础入门

① 学习路线

对于从来没有接触过Python的同学，我们帮你准备了详细的学习成长路线图。可以说是最科学最系统的学习路线，你可以按照上面的知识点去找对应的学习资源，保证自己学得较为全面。
在这里插入图片描述

② 路线对应学习视频

还有很多适合0基础入门的学习视频，有了这些视频，轻轻松松上手Python~
在这里插入图片描述

③练习题

每节视频课后，都有对应的练习题哦，可以检验学习成果哈哈！
在这里插入图片描述

2️⃣国内外Python书籍、文档

① 文档和书籍资料

在这里插入图片描述

3️⃣Python工具包+项目源码合集

①Python工具包

学习Python常用的开发软件都在这里了！每个都有详细的安装教程，保证你可以安装成功哦！
在这里插入图片描述

②Python实战案例

光学理论是没用的，要学会跟着一起敲代码，动手实操，才能将自己的所学运用到实际当中去，这时候可以搞点实战案例来学习。100+实战案例源码等你来拿！
在这里插入图片描述

③Python小游戏源码

如果觉得上面的实战案例有点枯燥，可以试试自己用Python编写小游戏，让你的学习过程中增添一点趣味！
在这里插入图片描述

4️⃣Python面试题

我们学会了Python之后，有了技能就可以出去找工作啦！下面这些面试题是都来自阿里、腾讯、字节等一线互联网大厂，并且有阿里大佬给出了权威的解答，刷完这一套面试资料相信大家都能找到满意的工作。
在这里插入图片描述

上述所有资料 ⚡️ ，朋友们如果有需要的，可以扫描下方👇👇👇二维码免费领取🆓
在这里插入图片描述