分享一个从html格式得文本中，提取正文内容的方式

最新推荐文章于 2024-04-11 19:00:12 发布

数分大拿的Statham

最新推荐文章于 2024-04-11 19:00:12 发布

阅读量367

点赞数 10

文章标签： html 前端数据分析 pandas 大数据

本文链接：https://blog.csdn.net/weixin_44228413/article/details/136630736

版权

数据表里存放了多种html格式的文本数据，但是通过正则，需要写出某些格式才能抽取来，格式又不固定。

这里分享一个方法,原始数据都类似与下图，杂乱无序但是需要吧正文给抽取出来。

'<p><img class="rich_pages wxw-img __bg_gif js_darkmode__0" data-backh="40" data-backw="360" data-galleryid="" data-imgfileid="503238271" data-ratio="0.1111111111111111" src="/gts-server/specialManager/editorFile?bucketName=speciallibrary&fileName=dynamic/weixin/公众号-军工圈（中国）/2024/02/19/8db766385ee94ff6a65b550d01ea778c/图1.jpg" data-type="gif" "450" style="margin: 0px;padding: 0px;outline: 0px;max-width: 100%;vertical-align: bottom;color: rgba(0, 0, 0, 0.9);font-family: system-ui, -apple-system, BlinkMacSystemFont, &quot;Helvetica Neue&quot;, &quot;PingFang SC&quot;, &quot;Hiragino Sans GB&quot;, &quot;Microsoft YaHei UI&quot;, &quot;Microsoft YaHei&quot;, Arial, sans-serif;font-size: var(--articleFontsize);font-style: normal;font-variant-ligatures: normal;font-variant-caps: normal;font-weight: 400;orphans: 2;text-align: justify;text-indent: 0px;text-transform: none;widows: 2;word-spacing: 0px;-webkit-text-stroke-width: 0px;white-space: normal;text-decoration-thickness: initial;text-decoration-style: initial;text-decoration-color: initial;letter-spacing: 0.544px;background-color: rgb(255, 255, 255);box-sizing: border-box !important;overflow-wrap: break-word !important;height: auto !important;width: 360px !important;visibility: visible !important;" "360px"></p>
 <p><span style="font-size: var(--articleFontsize);letter-spacing: 0.034em;"><span class="wx_text_underline">在刚过去的春节假期中，最让中国军迷们欢欣鼓舞的事件就是在中国第一艘航母辽宁舰的飞行甲板上，出现了歼</span></span><span style="font-size: var(--articleFontsize);letter-spacing: 0.034em;"><span class="wx_text_underline">-35</span></span><span style="font-size: var(--articleFontsize);letter-spacing: 0.034em;"><span><span class="wx_text_underline">隐形舰载战斗机的身影——毫无疑问，这意味着中国海军现役三艘航母都将搭载隐形战斗机，极大提升远海作战能力和威慑效能。日本吹嘘许久的“</span></span><span class="wx_text_underline">出云<i class="wx_search_keyword"></i></span><span><span class="wx_text_underline">”级航母比中国滑跃式航母更有优势的说法，也彻底被打破了！</span></span></span><br></p>
 <p style="text-align: center;"><img class="rich_pages wxw-img" data-galleryid="" data-imgfileid="100013093" data-ratio="0.6971935007385525" data-s="300,640" src="/gts-server/specialManager/editorFile?bucketName=speciallibrary&fileName=dynamic/weixin/公众号-军工圈（中国）/2024/02/19/8db766385ee94ff6a65b550d01ea778c/图2.jpg" data-type="png" "677" style="height: auto !important;visibility: visible !important;width: 360px !important;" "360px"></p>'

那么我们已知html代码都会有标签这一说，我们可以通过<>将标签中的内容匹配出来。

然后将<>标签替换为空

代码如下，结合pandas直接对原有格式进行操作。

import pandas as pd 
import numpy as np 
from sqlalchemy import create_engine
import re
#数据库引擎
engine=create_engine('mysql+pymysql://user_name:password@192.168.0.182:3306/name?charset=utf8mb4')

#读取表格
df=pd.read_sql('resource',con=engine)


#重点在这里，这里有处理方式
for i in range(0,len(df)):
    text=df.loc[i,'text']
    #通过正则，将内容 <>中的内容匹配出来
    ls=re.findall('<(.*?)>',text)
    #这里ls是一个可能包含重复内容的列表
    #使用集合将其去重
    ls=set(ls)
    for j in ls:
        #字符串拼接
        replace_str='<'+j+'>'
        #替换内容
        text=text.replace(replace_str,'')
    #将原有位置的数据，替换为处理好的数据。
    df.loc[i,'text']=text

如上图，我们就将内容处理好了。

数分大拿的Statham

关注

10
点赞
踩
8

收藏

觉得还不错? 一键收藏
1
评论
分享一个从html格式得文本中，提取正文内容的方式

数据表里存放了多种html格式的文本数据，但是通过正则，需要写出某些格式才能抽取来，格式又不固定。那么我们已知html代码都会有标签这一说，我们可以通过<>将标签中的内容匹配出来。这里分享一个方法,原始数据都类似与下图，杂乱无序但是需要吧正文给抽取出来。代码如下，结合pandas直接对原有格式进行操作。如上图，我们就将内容处理好了。然后将<>标签替换为空。
复制链接

扫一扫