安装Stanza(处理Stanza无法下载语言模型的错误：ConnectionError)

原创已于 2023-12-27 17:08:26 修改

· 8.5k 阅读

54 ·

版权

文章标签：

#语言模型 #人工智能 #自然语言处理

于 2021-12-11 09:58:54 首次发布

Debug记录专栏收录该内容

6 篇文章

订阅专栏

安装Stanza的Debug记录

处理Stanza无法下载语言模型的错误:ConnectionError

问题：

根据官方文档进行stanza初始安装

pip install stantza
>>> import stanza
>>> stanza.download('en')
>>>nlp = stanza.pipeline('

在执行stanza.download(‘en’)报错：

ConnectionError: HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Max retries exceeded with url: /stanfordnlp/stanza-resources/main/resources_1.3.0.json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000001801EE33610>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed'))

总体解决思路：

离线下载缺失的文件
确保文件、模型等版本一致

Debug细节：

产生此错误的原因是网络问题，有条件的直接科学上网应该就可以解决。
我这里边网络环境不允许，于是选择离线下载，并放到指定的文件夹下。

我们可以选择到github下载需要的resource.json对应的github库为https://github.com/stanfordnlp/stanza-resources
我下载的文件为resource_1.3.0.json，下载后改名为resources.json并放置到~\stanza_resources\目录下。如果缺少该文件，你可能会遇到类似以下错误：

ResourcesFileNotFoundError: Resources file not found at: C:\Users\gz927\stanza_resources\resources.json  Try to download the model again.

有了resource.json，还需要相应的语言包，语言包相对较大，这里我们选择到huggingface去下载。对应的地址为https://huggingface.co/stanfordnlp/stanza-en。下载后解压并放置到~\stanza_resources\en\目录下

============================

如果安装了resource.json和语言包后，提示缺少某个特定的模型，产生类似下面的报错：

FileNotFoundError: Could not find model file C:\Users\gz927\stanza_resources\en\tokenize\combined.pt, although there are other models downloaded for language en.  Perhaps you need to download a specific model.  Try: stanza.download(lang="en",package=None,processors={"tokenize":"combined"})

则说明你下载的resouce.json和语言模型的版本不是对应的,关于版本匹配问题，一个比较粗暴的解决方案是都下载最新的版本（因为huggingface上的语言模型是最新的，所以我就是这样做的：把resource.json换成了最新的1.3.1版本）。下载完成后进行测试：

>>> nlp = stanza.Pipeline('en')
2021-12-11 08:06:27 INFO: Loading these models for language: en (English):
============================
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined  |
| lemma        | combined  |
| depparse     | combined  |
| sentiment    | sstplus   |
| constituency | wsj       |
| ner          | ontonotes |
============================

2021-12-11 08:06:27 INFO: Use device: cpu
2021-12-11 08:06:27 INFO: Loading: tokenize
2021-12-11 08:06:27 INFO: Loading: pos
2021-12-11 08:06:27 INFO: Loading: lemma
2021-12-11 08:06:27 INFO: Loading: depparse
2021-12-11 08:06:27 INFO: Loading: sentiment
2021-12-11 08:06:28 INFO: Loading: constituency
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-3-543633ac586b> in <module>
----> 1 nlp = stanza.Pipeline('en')
...
RuntimeError: Error(s) in loading state_dict for LSTMModel:
        size mismatch for tag_tensors: copying a param with shape torch.Size([50]) from checkpoint, the shape in current model is torch.Size([48]).
        size mismatch for tag_embedding.weight: copying a param with shape torch.Size([50, 20]) from checkpoint, the shape in current model is torch.Size([48, 20]).

这里的输出反映了上面的解决思路是对的，但是依然有报错，推测依然存在版本不对应的问题。
到huggingFace库的Files and Versions选项卡，点击右侧的History进行检查，发现最近的一次改动更新了contituency模型，这正好是我们报错提示的地方。
在这里插入图片描述
页面上有改动的时间，这个时间和github中resources.json的时间并不一致。因此，我这里猜测可能是huggingface上的这次改动出现了没有同步的异常情况。

=====================

继续寻找资源，在其他博客中发现了另外的下载链接：http://nlp.stanford.edu/software/stanza/1.0.0/en/default.zip
我将疑似版本号的位置进行了更改，并成功下载到了我需要的资源：
http://nlp.stanford.edu/software/stanza/1.3.1/en/default.zip
将此语言模型解压并放置到~/resources/en/目录下后，在进行测试，程序终于正常运行

>>>nlp = stanza.Pipeline('en')
2021-12-11 08:26:49 INFO: Loading these models for language: en (English):
============================
| Processor    | Package   |
----------------------------
| tokenize     | combined  |
| pos          | combined  |
| lemma        | combined  |
| depparse     | combined  |
| sentiment    | sstplus   |
| constituency | wsj       |
| ner          | ontonotes |
============================

2021-12-11 08:26:49 INFO: Use device: cpu
2021-12-11 08:26:49 INFO: Loading: tokenize
2021-12-11 08:26:49 INFO: Loading: pos
2021-12-11 08:26:49 INFO: Loading: lemma
2021-12-11 08:26:49 INFO: Loading: depparse
2021-12-11 08:26:49 INFO: Loading: sentiment
2021-12-11 08:26:49 INFO: Loading: constituency
2021-12-11 08:26:50 INFO: Loading: ner
2021-12-11 08:26:50 INFO: Done loading processors!