[毕设记录]两种markdown转纯文本的方式

临风而眠

于 2024-04-27 10:30:34 发布

阅读量796

点赞数 16

分类专栏：基础技能毕设全记录文章标签： python

本文链接：https://blog.csdn.net/qq_52431436/article/details/138243813

版权

基础技能同时被 2 个专栏收录

51 篇文章 1 订阅

订阅专栏

毕设全记录

39 篇文章 2 订阅

订阅专栏

参考链接：https://segmentfault.com/q/1010000043277064

方法一：md→html→文本

先用markdown.markdown把文本转为html格式
然后用bs4提取出纯文本内容

import markdown
from bs4 import BeautifulSoup

def markdown_to_text(md):
    html = markdown.markdown(md)
    print("Generated HTML:\n", html)  # 输出生成的HTML以便检查
    soup = BeautifulSoup(html, features='html.parser')
    return soup.get_text()

# 示例用法
md = """
**Bold Text** and _Italic Text_ with `code snippet` inclusion.
[Example Link](http://example.com)
![Example Image](http://example.com/image.png)
### Subheading: Lists
- Bullet list item 1
- Bullet list item 2
  - Nested bullet item
1. Numbered list item 1
2. Numbered list item 2
   1. Nested numbered item

#### Subheading: Blockquotes
> Blockquote text under a subheading.

##### Subheading: Code Blocks

###### Subheading: Tables
| Header 1 | Header 2 | Header 3 |
|----------|----------|----------|
| Cell 1   | Cell 2   | Cell 3   |
| Cell 4   | Cell 5   | Cell 6   |
| Cell 7   | Cell 8   | Cell 9   |
"""

text = markdown_to_text(md)
print("Extracted Text:\n", text)

在这里插入图片描述

方法二：md→文本

直接从Markdown的解析树中提取文本

from markdown import Markdown
from io import StringIO


def unmark_element(element, stream=None):
    if stream is None:
        stream = StringIO()
    if element.text:
        stream.write(element.text)
    for sub in element:
        unmark_element(sub, stream)
    if element.tail:
        stream.write(element.tail)
    return stream.getvalue()


# patching Markdown
Markdown.output_formats["plain"] = unmark_element
__md = Markdown(output_format="plain")
__md.stripTopLevelTags = False


def unmark(text):
    return __md.convert(text)


# 打印提取的文本
print(unmark(md))