NLP 预处理总结

最新推荐文章于 2024-07-20 19:16:01 发布

dreampai

最新推荐文章于 2024-07-20 19:16:01 发布

阅读量492

点赞数

分类专栏： NLP 算法文章标签： nlp 自然语言处理

本文链接：https://blog.csdn.net/Jiassheng/article/details/119005420

版权

NLP 算法专栏收录该内容

5 篇文章 2 订阅

订阅专栏

在处理 NLP 相关任务的时候（文本分类、聚类，智能客服等），首要任务是对文本数据进行预处理。结合自己的实践经验，总结了 N 条预处理的方法。

去掉一些无用的符号

文本中可能会出现连续的符号（比如感叹号！！！或一些奇怪的单词等。）我们将文本按照符号进行分割然后再组装。

def tokenizer(ori_list):
   SYMBOLS = re.compile('[\s;\"\",.!?\\/\[\]]+')
   new_list = []
   for q in ori_list:
   	words=SYMBOLS.split(q.lower().strip())
   	new_list.append(' '.join(words))
   return new_list

停用词过滤

网上有很多开源的停用词集合，也可以根据自己业务建立领域停用词表。（或者直接使用NLTK自带的）

def removeStopWord(ori_list):
   new_list = []
   #nltk中stopwords包含what等，但是在QA问题中，这算关键词，所以不看作stop words
   restored = ['what','when','which','how','who','where']
   english_stop_words = list(set(stopwords.words('english')))
   for w in restored:
   	english_stop_words.remove(w)
   for q in ori_list:
   	sentence = ' '.join([w for w in q.strip().split(' ') if w not in english_stop_words])
   	new_list.append(sentence)
   return new_list

去掉出现频率很低的词

我们去除低频词，可以基于词典设置一个阈值，比如出现次数少于10,20…

def removeLowFrequence(ori_list,vocabulary,thres = 10):
	#根据thres筛选词表，小于thres的词去掉
	new_list = []
	for q in ori_list:
		sentence = ' '.join([w for w in q.strip().split(' ') if w in vocabulary and vocabulary[w] >= thres])
		new_list.append(sentence)
	return new_list

对于数字的处理

分词完只有有些单词可能就是数字比如44，415，把所有这些数字都看成是一个单词，这个新的单词我们可以定义为 “#number”

def replaceDigits(ori_list,replace = '#number'):
	#将数字统一替换replace,默认#number
	DIGITS = re.compile('\d+')
	new_list = []
	for q in ori_list:
		q = DIGITS.sub(replace,q)
		new_list.append(q)
	return new_list