NLP入门赛Task2-数据读取与分析_text(0.5, 1.0, 'histogram of char count')-CSDN博客

本文链接：https://blog.csdn.net/weixin_42517469/article/details/107524368

本文介绍了如何使用pandas读取NLP任务的数据，并进行了详细的数据分析，包括句子长度、新闻类别分布和字符统计。发现句子平均长度约900个字符，科技类新闻最多，星座类最少，且存在类别分布不均衡的问题。同时，平均每篇新闻包含80个句子。

摘要由CSDN通过智能技术生成

数据读取

使用pandas读取数据

from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive

import pandas as pd
train_df = pd.read_csv(r'/content/drive/My Drive/天池比赛/train_set.csv', sep='\t')
train_df.head()

	label	text
0	2	2967 6758 339 2021 1854 3731 4109 3792 4149 15...
1	11	4464 486 6352 5619 2465 4802 1452 3137 5778 54...
2	3	7346 4068 5074 3747 5681 6093 1777 2226 7354 6...
3	2	7159 948 4866 2109 5520 2490 211 3956 5520 549...
4	3	3646 3055 3055 2490 4659 6065 3370 5814 2465 5...

这里的read_csv由三部分构成：

读取文件的路径
分隔符sep，为每列的分隔符，设置为\t即可
读取行数nows

数据分析

句子长度分析

可以直接统计单词的个数得到每个句子的长度。

train_df['text_len'] = train_df['text'].apply(lambda x: len(x.split()))
train_df.head()

	label	text	text_len
0	2	2967 6758 339 2021 1854 3731 4109 3792 4149 15...	1057
1	11	4464 486 6352 5619 2465 4802 1452 3137 5778 54...	486
2	3	7346 4068 5074 3747 5681 6093 1777 2226 7354 6...	764
3	2