解决nltk安装资源包问题

we19a0sen

已于 2024-08-10 03:01:33 修改

阅读量332

点赞数 17

文章标签： python nlp

于 2024-08-10 02:58:13 首次发布

本文链接：https://blog.csdn.net/m0_59145816/article/details/141072903

版权

一、问题背景

该代码是用于英文分词并对词性标注，运行以下代码，

#英文分词样例代码 
import nltk
import re

content="A roboticist at the Bristol Robotics Laboratory programmed a robot to save human proxies called 'H-bots' from danger."
content=re.sub('[^\w ]','',content)
print(content)
print(nltk.word_tokenize(content))               #英文句子分词
print(nltk.pos_tag(nltk.word_tokenize(content))) #对分完词的结果进行词性标注

报错显示：

LookupError: 
**********************************************************************
  Resource averaged_perceptron_tagger not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('averaged_perceptron_tagger')
  
  For more information see: https://www.nltk.org/data.html

  Attempted to load taggers/averaged_perceptron_tagger/averaged_perceptron_tagger.pickle

  Searched in:
    - 'C:\\Users\\26943/nltk_data'
    - 'd:\\psw\\miniconda3\\envs\\py311\\nltk_data'
    - 'd:\\psw\\miniconda3\\envs\\py311\\share\\nltk_data'
    - 'd:\\psw\\miniconda3\\envs\\py311\\lib\\nltk_data'
    - 'C:\\Users\\26943\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************

二、解决问题

1、报错解析

这是提示资源包 averaged_perceptron_tagger 找不到，请使用 NLTK Downloader 获取资源

  Resource averaged_perceptron_tagger not found.
  Please use the NLTK Downloader to obtain the resource:

以下是使用 NLTK Downloader 获取资源的方式

  >>> import nltk
  >>> nltk.download('averaged_perceptron_tagger')

如何以上方式解决不了问题，可以通过以下地址查看更多信息

  For more information see: https://www.nltk.org/data.html

以下是显示搜索资源包的路径

  Attempted to load taggers/averaged_perceptron_tagger/averaged_perceptron_tagger.pickle

  Searched in:
    - 'C:\\Users\\26943/nltk_data'
    - 'd:\\psw\\miniconda3\\envs\\py311\\nltk_data'
    - 'd:\\psw\\miniconda3\\envs\\py311\\share\\nltk_data'
    - 'd:\\psw\\miniconda3\\envs\\py311\\lib\\nltk_data'
    - 'C:\\Users\\26943\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'

2、解决步骤

在报错信息中，提示更多信息查看网址：NLTK :: Installing NLTK Data

网站内容提供三种安装方式：

Interactive installer（交互式安装器）【网络问题，无法正常安装】
Command line installation（命令行安装）【本人未尝试】
Manual installation（手动安装）【该文章使用的解决方法】

（1）进入网站查看 Manual installation 相关信息，我对这段英文总结了以下步骤

Manual installation

Create a folder nltk_data, e.g. C:\nltk_data, or /usr/local/share/nltk_data, and subfolders chunkers, grammars, misc, sentiment, taggers, corpora, help, models, stemmers, tokenizers.

Download individual packages from https://www.nltk.org/nltk_data/ (see the “download” links). Unzip them to the appropriate subfolder. For example, the Brown Corpus, found at: https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/brown.zip is to be unzipped to nltk_data/corpora/brown.

Set your NLTK_DATA environment variable to point to your top level nltk_data folder.

①创建文件夹名为 nltk_data 用于存放资源包，路径可以自己指定，我通常放在写代码的文件夹下

②在 nltk_data 文件夹中创建各个子文件夹chunkers, grammars, misc, sentiment, taggers, corpora, help, models, stemmers, tokenizers.用于存放具体资源包，比如 averaged_perceptron_tagger 资源包存放在 taggers 子文件夹中

③前往 https://www.nltk.org/nltk_data/ 下载资源包

④下载相关资源包，比如 averaged_perceptron_tagger 资源包都是以压缩包的形式

⑤解压资源包到指定路径，比如 averaged_perceptron_tagger 资源包存放在 taggers 子文件夹中

（2）进入下载资源包网站 https://www.nltk.org/nltk_data/

（3）按 ctrl + f 对网页内容进行搜索（我使用的是Edge浏览器），如搜索内容为 averaged_perceptron_tagger

（4）点击 Averaged Perceptron Tagger [ download | source ] 中的 download 进行下载zip 压缩包

（5）将压缩包放在 nltk_data/taggers/ 中并在当前目录解压，由于我是跟代码存放在一起，所以我的存放路径是：项目路径/nltk_data/taggers/

（6）代码中明确资源包路径，代码如下

#英文分词样例代码 
import nltk
# nltk.download() #交互式安装
# nltk.download('punkt') #指定安装
from nltk import data #本地导入
data.path.append('nltk_data/') # 明确资源包路径
import re

content="A roboticist at the Bristol Robotics Laboratory programmed a robot to save human proxies called 'H-bots' from danger."
content=re.sub('[^\w ]','',content)
print(content)
print(nltk.word_tokenize(content))               #英文句子分词
print(nltk.pos_tag(nltk.word_tokenize(content))) #对分完词的结果进行词性标注