安装Stanza的Debug记录
处理Stanza无法下载语言模型的错误:ConnectionError
问题:
根据官方文档进行stanza初始安装
pip install stantza
>>> import stanza
>>> stanza.download('en')
>>>nlp = stanza.pipeline('
在执行stanza.download(‘en’)报错:
ConnectionError: HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Max retries exceeded with url: /stanfordnlp/stanza-resources/main/resources_1.3.0.json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000001801EE33610>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed'))
总体解决思路:
- 离线下载缺失的文件
- 确保文件、模型等版本一致
Debug细节:
产生此错误的原因是网络问题,有条件的直接科学上网应该就可以解决。
我这里边网络环境不允许,于是选择离线下载,并放到指定的文件夹下。
我们可以选择到github下载需要的resource.json对应的github库为https://github.com/stanfordnlp/stanza-resources
我下载的文件为resource_1.3.0.json,下载后改名为resources.json并放置到~\stanza_resources\目录下。如果缺少该文件,你可能会遇到类似以下错误:
ResourcesFileNotFoundError: Resources file not found at: C:\Users\gz927\stanza_resources\resources.json Try to download the model again.
有了resource.json,还需要相应的语言包,语言包相对较大,这里我们选择到huggingface去下载。对应的地址为https://huggingface.co/stanfordnlp/stanza-en。下载后解压并放置到~\stanza_resources\en\目录下
============================
如果安装了resource.json和语言包后,提示缺少某个特定的模型,产生类似下面的报错:
FileNotFoundError: Could not find model file C:\Users\gz927\stanza_resources\en\tokenize\combined.pt, although there are other models downloaded for language en. Perhaps you need to download a specific model. Try: stanza.download(lang="en",package=None,processors={"tokenize":"combined"})
则说明你下载的resouce.json和语言模型的版本不是对应的,关于版本匹配问题,一个比较粗暴的解决方案是都下载最新的版本(因为huggingface上的语言模型是最新的,所以我就是这样做的:把resource.json换成了最新的1.3.1版本)。下载完成后进行测试:
>>> nlp = stanza.Pipeline('en')
2021-12-11 08:06:27 INFO: Loading these models for language: en (English):
============================
| Processor | Package |
----------------------------
| tokenize | combined |
| pos | combined |
| lemma | combined |
| depparse | combined |
| sentiment | sstplus |
| constituency | wsj |
| ner | ontonotes |
============================
2021-12-11 08:06:27 INFO: Use device: cpu
2021-12-11 08:06:27 INFO: Loading: tokenize
2021-12-11 08:06:27 INFO: Loading: pos
2021-12-11 08:06:27 INFO: Loading: lemma
2021-12-11 08:06:27 INFO: Loading: depparse
2021-12-11 08:06:27 INFO: Loading: sentiment
2021-12-11 08:06:28 INFO: Loading: constituency
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-3-543633ac586b> in <module>
----> 1 nlp = stanza.Pipeline('en')
...
RuntimeError: Error(s) in loading state_dict for LSTMModel:
size mismatch for tag_tensors: copying a param with shape torch.Size([50]) from checkpoint, the shape in current model is torch.Size([48]).
size mismatch for tag_embedding.weight: copying a param with shape torch.Size([50, 20]) from checkpoint, the shape in current model is torch.Size([48, 20]).
这里的输出反映了上面的解决思路是对的 ,但是依然有报错,推测依然存在版本不对应的问题。
到huggingFace库的Files and Versions选项卡,点击右侧的History进行检查,发现最近的一次改动更新了contituency模型,这正好是我们报错提示的地方。
页面上有改动的时间,这个时间和github中resources.json的时间并不一致。因此,我这里猜测可能是huggingface上的这次改动出现了没有同步的异常情况。
=====================
继续寻找资源,在其他博客中发现了另外的下载链接:http://nlp.stanford.edu/software/stanza/1.0.0/en/default.zip
我将疑似版本号的位置进行了更改,并成功下载到了我需要的资源:
http://nlp.stanford.edu/software/stanza/1.3.1/en/default.zip
将此语言模型解压并放置到~/resources/en/目录下后,在进行测试,程序终于正常运行
>>>nlp = stanza.Pipeline('en')
2021-12-11 08:26:49 INFO: Loading these models for language: en (English):
============================
| Processor | Package |
----------------------------
| tokenize | combined |
| pos | combined |
| lemma | combined |
| depparse | combined |
| sentiment | sstplus |
| constituency | wsj |
| ner | ontonotes |
============================
2021-12-11 08:26:49 INFO: Use device: cpu
2021-12-11 08:26:49 INFO: Loading: tokenize
2021-12-11 08:26:49 INFO: Loading: pos
2021-12-11 08:26:49 INFO: Loading: lemma
2021-12-11 08:26:49 INFO: Loading: depparse
2021-12-11 08:26:49 INFO: Loading: sentiment
2021-12-11 08:26:49 INFO: Loading: constituency
2021-12-11 08:26:50 INFO: Loading: ner
2021-12-11 08:26:50 INFO: Done loading processors!