NLP文本解析数据预处理的方法

最新推荐文章于 2024-08-07 14:52:38 发布

shu_qdHao

最新推荐文章于 2024-08-07 14:52:38 发布

阅读量4.4k

点赞数

分类专栏： tensorflow 训练集验证集测试集文章标签： NLP 数据与处理文本分析

本文链接：https://blog.csdn.net/weixin_37203756/article/details/80671393

版权

假设我们现在有一个文本的多标签的分类任务。其数据集的格式为w9410 w305 w1893 w307 w3259 w4480 w1718 w5700 w18973 w346 w11 w855 w1038 w12475 w146978 w11 w1076 w25 w7512 w45368 w1718 w4668 w6 w11061 w111 c866 c28 c423 c1869 c1331 c431 c17 c204 c4 c274 c56 c1841 c1770 c3266 c17 c350 c4 c370 c116 c406 c734 c28 c423 c768 c769 c485 c11 c506 c734 c184 __label__807273409165680991 __label__8175048003539471998。可以看到这条数据有两个标签。其余的用字符表示（经过脱敏处理），所以我们需要自己来创建word_vocabulary。

具体如下：

                import codecs
                from collections import Counter 
                #1.load raw data 
		file_object = codecs.open(training_data_path,mode='r',encoding='utf-8')
		lines = file_object.readlines()
		#2.loop each line ,put to counter 
		c_inputs = Counter()
		c_labels = Counter()
		for line in lines:
			raw_list = line.strip().split("__label__")
			input_list = raw_list[0].strip().split(" ")