Natural Language Toolkit,自然语言处理工具包,在NLP领域中,最常使用的一个Python库。
1、安装nltk
pip install -upgrade nltk
2、安装nltk_data
import nltk
nltk.download('punkt') # 英文且此、词根、切句等方法
nltk.download('stopwords') # 英文停用词库
我是用上面python代码下载相关数据集,一直报错
[nltk_data] Error loading punkt: <urlopen error [Errno 8] nodename nor
[nltk_data] servname provided, or not known>
[nltk_data] Error loading stopwords: <urlopen error [Errno 8] nodename
[nltk_data] nor servname provided, or not known>
最后去github手动下载,下载packages中的所有内容
下载后放到本地文件夹,我放在了/Users/sunwenjun/anaconda3/envs/python310/nltk_data/
,注意有些压缩包要解压。
from nltk.data import find
print(find('punkt')) # /Users/sunwenjun/anaconda3/envs/python310/nltk_data/punkt
print(find('tokenizers')) # /Users/sunwenjun/anaconda3/envs/python310/nltk_data/tokenizers
3、nltk使用
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
input_string = 'Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks'
# 分词
word_tokens = word_tokenize(input_string)
print(word_tokens) # ['Retrieval-Augmented', 'Generation', 'for', 'Knowledge-Intensive', 'NLP', 'Tasks']
# 去停用词
stop_words = set(stopwords.words('english'))
filtered_words = [w for w in word_tokens if not w.lower() in stop_words]
print(filtered_words) # ['Retrieval-Augmented', 'Generation', 'Knowledge-Intensive', 'NLP', 'Tasks']
# 取词根
ps = PorterStemmer()
ps_words = [ps.stem(w) for w in filtered_words]
print(ps_words) # ['retrieval-aug', 'gener', 'knowledge-intens', 'nlp', 'task']
4、nltk_data可存放的路径
LookupError:
**********************************************************************
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt')
For more information see: https://www.nltk.org/data.html
Attempted to load corpora/punkt
Searched in:
- '/Users/sunwenjun/nltk_data'
- '/Users/sunwenjun/anaconda3/envs/python310/nltk_data'
- '/Users/sunwenjun/anaconda3/envs/python310/share/nltk_data'
- '/Users/sunwenjun/anaconda3/envs/python310/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
**********************************************************************