文本分类_初探

最新推荐文章于 2022-05-07 10:36:51 发布

vitacode

最新推荐文章于 2022-05-07 10:36:51 发布

阅读量108

点赞数

文章标签： nlp

本文链接：https://blog.csdn.net/weixin_45996005/article/details/107521730

版权

对匿名化数据进行读取，获取概览信息

df_train['text'].map(lambda x: len(x.split())).describe()

count    200000.000000
mean        907.207110
std         996.029036
min           2.000000
25%         374.000000
50%         676.000000
75%        1131.000000
max       57921.000000
Name: text, dtype: float64

2W条语句，句子平均长度为907，最少长度为2，最长为57921。

句子长度可视化分布

3.类别的分布可视化

4.字频统计
查看下前20高频字

''' 统计每类新闻中出现次数最多的字符 '''
import pandas as pd
from collections import Counter
train_df = pd.read_csv('data/train_set.csv', sep='\t')
# 同一类的拼接到一起
for i in range(0, 10):
    df = train_df[train_df['label'] == i]['text']
    all_lines = ' '.join(list(df))  # 将每一行text再以空格，连起来
    word_count = Counter(all_lines.split(" "))  # 统计数字和出现次数
    word_count = sorted(word_count.items(), key=lambda d: d[1], reverse=True)  # 排序
    print(i, word_count[0])