方法一:md→html→文本
先用markdown.markdown把文本转为html格式
然后用bs4提取出纯文本内容
import markdown
from bs4 import BeautifulSoup
def markdown_to_text(md):
html = markdown.markdown(md)
print("Generated HTML:\n", html) # 输出生成的HTML以便检查
soup = BeautifulSoup(html, features='html.parser')
return soup.get_text()
# 示例用法
md = """
**Bold Text** and _Italic Text_ with `code snippet` inclusion.
[Example Link](http://example.com)
![Example Image](http://example.com/image.png)
### Subheading: Lists
- Bullet list item 1
- Bullet list item 2
- Nested bullet item
1. Numbered list item 1
2. Numbered list item 2
1. Nested numbered item
#### Subheading: Blockquotes
> Blockquote text under a subheading.
##### Subheading: Code Blocks
###### Subheading: Tables
| Header 1 | Header 2 | Header 3 |
|----------|----------|----------|
| Cell 1 | Cell 2 | Cell 3 |
| Cell 4 | Cell 5 | Cell 6 |
| Cell 7 | Cell 8 | Cell 9 |
"""
text = markdown_to_text(md)
print("Extracted Text:\n", text)
方法二:md→文本
直接从Markdown的解析树中提取文本
from markdown import Markdown
from io import StringIO
def unmark_element(element, stream=None):
if stream is None:
stream = StringIO()
if element.text:
stream.write(element.text)
for sub in element:
unmark_element(sub, stream)
if element.tail:
stream.write(element.tail)
return stream.getvalue()
# patching Markdown
Markdown.output_formats["plain"] = unmark_element
__md = Markdown(output_format="plain")
__md.stripTopLevelTags = False
def unmark(text):
return __md.convert(text)
# 打印提取的文本
print(unmark(md))
- 下面是gpt的解读