分词
def tokenizer(text):
return [tok.text for tok in spacy_en.tokenizer(text)]
#顺序错了 应该先导入数据,对数据进行处理,在用loadCSV
csvdata = loadCSV(r'F:\研一\NLP\数据集\ag_news_csv\train.csv')
csvdata1=tokenizer(str(csvdata))
通过初步的分词,可以看到还是有许多标点符号,这些都是无用的。所以要去掉
去除标点符号
table = str.maketrans('', '', string.punctuation)
tokens = [w.translate(table) for w in csvdata1]
但是还有许多空的字符,如果对我们的模型没什么帮助,我们需要去掉。
去除空字符
刚开始是想着如何在列表中去除空字符,但是该代码很巧妙,根据长度来去除空字符。
tokens = [word for word in tokens if len(word) > 1]
将特殊字符转换成ascii形式
问题:看到这样的数据:Marek Čech、Beniardá怎样变成相对应的ascii码呢
解决:
import unicodedata
s = u"Marek Čech" #(u表示是unicode而非 ascii码,不加报错!)
line = unicodedata.normalize('NFKD',s).encode('ascii','ignore')
print line
构建词典
自己再写一遍
使用预训练词向量
当训练预训练词向量的时候,我们训练的word2vec做了相关的工作,其中就已经包括构建一个词典,如果要做我们的下游任务,我们还需要将我们训练出来的词向量的权重转到网络结构的Embedding层。
# 将我们训练好的词向量放到Embedding层
pretrained_embeddings = model.wv.vectors
model.embedding.weight.data.copy_(pretrained_embeddings)
博客上的预处理方法
from nltk.corpus import stopwords
import string
# load doc into memory
def load_doc(filename):
# open the file as read only
file = open(filename, 'r')
# read all text
text = file.read()
# close the file
file.close()
return text
# turn a doc into clean tokens
def clean_doc(doc):
# split into tokens by white space
tokens = doc.split()
# remove punctuation from each token
table = str.maketrans('', '', string.punctuation)
tokens = [w.translate(table) for w in tokens]
# remove remaining tokens that are not alphabetic
tokens = [word for word in tokens if word.isalpha()]
# filter out stop words
stop_words = set(stopwords.words('english'))
tokens = [w for w in tokens if not w in stop_words]
# filter out short tokens
tokens = [word for word in tokens if len(word) > 1]
return tokens
# load the document
filename = 'txt_sentoken/pos/cv000_29590.txt'
text = load_doc(filename)
tokens = clean_doc(text)
print(tokens)