文本数据预处理（自己定义）

最新推荐文章于 2023-04-08 16:30:00 发布

行走的五花肉

最新推荐文章于 2023-04-08 16:30:00 发布

阅读量712

点赞数

分类专栏：文本分类

本文链接：https://blog.csdn.net/weixin_42545466/article/details/107400549

版权

文本分类专栏收录该内容

5 篇文章 0 订阅

订阅专栏

分词

def tokenizer(text):
        return [tok.text for tok in spacy_en.tokenizer(text)]
#顺序错了 应该先导入数据，对数据进行处理，在用loadCSV
csvdata = loadCSV(r'F:\研一\NLP\数据集\ag_news_csv\train.csv')

csvdata1=tokenizer(str(csvdata))

在这里插入图片描述
通过初步的分词，可以看到还是有许多标点符号，这些都是无用的。所以要去掉

去除标点符号

table = str.maketrans('', '', string.punctuation)
tokens = [w.translate(table) for w in csvdata1]

在这里插入图片描述
但是还有许多空的字符，如果对我们的模型没什么帮助，我们需要去掉。

去除空字符

刚开始是想着如何在列表中去除空字符，但是该代码很巧妙，根据长度来去除空字符。

tokens = [word for word in tokens if len(word) > 1]

在这里插入图片描述

将特殊字符转换成ascii形式

问题：看到这样的数据：Marek Čech、Beniardá怎样变成相对应的ascii码呢
解决：

import unicodedata
s = u"Marek Čech"   #(u表示是unicode而非 ascii码，不加报错！)
line = unicodedata.normalize('NFKD',s).encode('ascii','ignore')
print line

构建词典

自己再写一遍

使用预训练词向量

当训练预训练词向量的时候，我们训练的word2vec做了相关的工作，其中就已经包括构建一个词典，如果要做我们的下游任务，我们还需要将我们训练出来的词向量的权重转到网络结构的Embedding层。
在这里插入图片描述

# 将我们训练好的词向量放到Embedding层
pretrained_embeddings = model.wv.vectors
model.embedding.weight.data.copy_(pretrained_embeddings)

博客上的预处理方法

from nltk.corpus import stopwords
import string

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text

# turn a doc into clean tokens
def clean_doc(doc):
	# split into tokens by white space
	tokens = doc.split()
	# remove punctuation from each token
	table = str.maketrans('', '', string.punctuation)
	tokens = [w.translate(table) for w in tokens]
	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]
	# filter out stop words
	stop_words = set(stopwords.words('english'))
	tokens = [w for w in tokens if not w in stop_words]
	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]
	return tokens

# load the document
filename = 'txt_sentoken/pos/cv000_29590.txt'
text = load_doc(filename)
tokens = clean_doc(text)
print(tokens)