我尝试了两种删除停用词的方法,但都遇到了问题:
方法1:
cachedStopWords = stopwords.words("english")
words_to_remove = """with some your just have from it's /via & that they your there this into providing would can't"""
remove = tu.removal_set(words_to_remove, query)
remove2 = tu.removal_set(cachedStopWords, query)
在这种情况下,只有第一个删除功能起作用. remove2不起作用.
方法2:
lines = tu.lines_cleanup([sentence for sentence in sentence_list], remove=remove)
words = '\n'.join(lines).split()
print words # list of words
输出看起来像这样[“ Hello”,“ Good”,“ day”]
我尝试从单词中删除停用词.这是我的代码:
for word in words:
if word in cachedStopwords:
continue
else:
new_words='\n'.join(word)
print new_words
输出看起来像这样:
H
e
l
l
o
不能弄清楚以上两种方法有什么问题.请指教.
解决方法:
使用它来增加停用词列表:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(len(stop_words))
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
print(len(stop_words))
输出:
179
184
标签:nlp,nltk,stop-words,python
来源: https://codeday.me/bug/20191120/2044905.html