python处理转载博客html

最新推荐文章于 2023-04-13 18:00:04 发布

B站：阿里武

最新推荐文章于 2023-04-13 18:00:04 发布

阅读量1.1k

点赞数

分类专栏： python笔记文章标签： python BeautifulSoup

本文链接：https://blog.csdn.net/qq874455953/article/details/83722211

版权

python笔记专栏收录该内容

9 篇文章 0 订阅

订阅专栏

前景

在转载别人博客的时候通常我们会通过复制html然后放到编辑器里面，但是通常html里有很多杂七杂八的东西，比如script， svg这些标签导致排版出现问题

例如由lu标签引起的

在这里插入图片描述

由svg标签引起的
在这里插入图片描述
当然要说你直接把不要的东西删除也可以，但是作为一个程序员，能用电脑做的事当然是不用自己做啦，于是就有了下面一步

代码实现

代码采用Python，因为Python有BeautifulSoup，能很好的处理html文件，例如指定标签删除等，所以就采用Python3来写这些代码。

分析出现排版问题的原因

代码行下方出现数字是因为有
开头显示不正常是因为注释和

<svg>

在这里插入图片描述

如何去除指定标签和注释

#去除属性ul
[s.extract() for s in soup("ul")]
# 去除属性svg
[s.extract() for s in soup("svg")]
# 去除属性script
[s.extract() for s in soup("script")]

Python代码

# 输入网址把 html变成md
import requests
import time

from bs4 import BeautifulSoup, Comment
def get_page_source(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return "failed"

if __name__ == '__main__':

    blogUrl = "https://blog.csdn.net/qq_36124194/article/details/83686823"

    #blogUrl = input("请输入转载地址\n")


    blogText = get_page_source(blogUrl)

    soup = BeautifulSoup(blogText, 'html.parser')

    #去除属性ul
    [s.extract() for s in soup("ul")]
    # 去除属性svg
    [s.extract() for s in soup("svg")]
    # 去除属性script
    [s.extract() for s in soup("script")]
    #去除注释
    comments = soup.findAll(text=lambda text: isinstance(text, Comment))
    [comment.extract() for comment in comments]
    #得到正文
    articleText = soup.find('div', attrs={'class': 'markdown_views prism-atom-one-dark'})
    # 加入 转载地址说明
    finalStr = "## 转载地址   \n" + "## " +blogUrl + "  \n" + str(articleText)

    print(finalStr)