基于文本内容的敏感信息识别

最新推荐文章于 2024-08-10 23:14:03 发布

达尔西（Darsy）

最新推荐文章于 2024-08-10 23:14:03 发布

阅读量370

点赞数 5

分类专栏： Python 文章标签： python 数据分析

本文链接：https://blog.csdn.net/WANGXBHS/article/details/137996051

版权

Python 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

本文介绍了如何通过Python进行数据获取和预处理，包括违规和非违规数据的占比分析，使用词云图展示高频词汇，以及设计主题模型并应用SMOTE进行数据平衡。最后，通过决策树模型进行敏感信息识别的性能评估。

摘要由CSDN通过智能技术生成

基于文本内容的敏感信息识别

一、数据获取与处理

import pandas as pd
sens_train_data=pd.read_csv('../训练数据集/train_sensitiveness.csv',encoding='gb18030')
insens_train_data=pd.read_csv('../训练数据集/train_insensitiveness.csv',encoding='gb18030')

data=pd.concat([sens_train_data,insens_train_data],axis=0)
data.columns=['content','label']
data.reset_index(inplace=True,drop=True)
data.shape

data.label.value_counts()

结果：在这里插入图片描述

二、违规和非违规数据占比情况分析

代码块：

import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif']='Songti SC'
num=data['label'].value_counts()
plt.pie(num, labels=['非违规','违规'], autopct='%.2f%%')
plt.show()

运行程序生成饼图：
非违规使用为73.06%，违规使用为26.92%。

三、绘制词云图

import jieba
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from tkinter import _flatten

# 1 统计词频
# 1.1 分词
data_cut = data['content'].apply(jieba.lcut)
# 1.2 去除停用词
with open('./stoplist.txt', 'r', encoding='utf-8') as f:
    stop = f.read().split()

data_after = data_cut.apply(lambda x: [w for w in x if w not in stop])
data_after=data_after.apply(lambda x:[i for i in x if i!=' '])

# 1.3 统计词频
num = pd.Series(_flatten(list(data_after))).value_counts()

# 2 词云绘制
# 2.2 设置词云参数
wc = WordCloud(font_path='./simhei.ttf',  
               background_color='white')
wc.fit_words(num)
# 2.3 展示词云
plt.imshow(wc)
plt.axis('off')
plt.show()

运行结果：
在这里插入图片描述

//词条信息分布柱状图生成
num=data['cut'].apply(lambda x:len(x))
import  matplotlib.pyplot as plt
plt.plot(range(len(num)),num)
plt.show()

在这里插入图片描述

ind1 = data['label'] ==1
sens_comment = data.loc[ind1, 'cut']
ind2 = data['label'] ==0
insens_comment = data.loc[ind2, 'cut']

def draw_wc(data):
    # 1.3 统计词频
    num = pd.Series(_flatten(list(data))).value_counts()
    print(num[:10])
    # 2 词云绘制
    # 2.2 设置词云参数
    wc = WordCloud(font_path='../data/tmp/simhei.ttf', 
                   background_color='white')
    wc.fit_words(num)
    # 2.3 展示词云
    plt.imshow(wc)
    plt.axis('off')
    plt.show()

四、主题模型设计


```python
from gensim.corpora import  Dictionary
from gensim.models import  LdaModel
neg_dict = Dictionary(sens_comment) #建立词典,词到index的一个映射
neg_corpus = [neg_dict.doc2bow(i) for i in sens_comment] #建立语料库，词的ID和词出现的频次
neg_lda = LdaModel(neg_corpus, num_topics =3, id2word = neg_dict) #LDA 模型训练
print("\n敏感信息:")
for i in range(3):
    print("主题%d : " %i)
    print(neg_lda.print_topic(i,topn=15)) #输出每个主题
    
# 将分词的单词合并为一句话
data['cut']=data['cut'].apply(lambda x:' '.join(x))
data_new=data[['cut','label']]

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(token_pattern=r"(?u)\b\w+\b")
tfidf_data = vectorizer.fit_transform(data['cut'])
tfidf_data.toarray()

from imblearn.over_sampling import SMOTE,BorderlineSMOTE, ADASYN
model_smote=SMOTE()
model_smote_x,model_smote_y=model_smote.fit_resample(tfidf_data.toarray(),data_new['label'].values)
# pd.Series(model_smote_y).value_counts()
model_smote_x=pd.DataFrame(model_smote_x)
# model_smote_x

len(model_smote_y)
model_smote_y.shape

#划分数据集
from sklearn.model_selection import train_test_split
test_ratio = 0.2
tfidf_train,tfidf_test, y_train, y_test= train_test_split(model_smote_x,model_smote_y, test_size=test_ratio,random_state=123) #stratify分层抽样

模型一：构建决策书模型

#模型一：构建决策树模型
from sklearn.metrics import accuracy_score,recall_score,classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, classification_report

DT_clf1 =  DecisionTreeClassifier(class_weight='balanced').fit(tfidf_train, y_train)  # 构建决策树模型
  
res1 = DT_clf1.predict(tfidf_test)            # 模型预测

print('DT test accuracy %s' % accuracy_score(y_test, res1 ))
print(' DT test F1_score %s' % recall_score(y_test, res1))
print(classification_report(y_test, res1) )      # 结果报告

在这里插入图片描述
相关数据集与代码包请到主页资源库，欢迎大家评论交流指导！

达尔西（Darsy）

关注

5
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
基于文本内容的敏感信息识别

随着互联网时代浪潮席卷而来，自媒体行业发展成为热潮，互联网有着无尽网民驻扎，网络安全管理、舆论控制成为重要风险管控要点，与此同时，敏感信息的识别分析有着重要研究意义，为下一步网络空间提供有力支持。本文从数据识别的初始、数据分析可视化以及模型搭建进行简单介绍。可以实现敏感词条提取与分析。
复制链接

扫一扫