CpmTokenizer requires the SentencePiece library but it was not found in your environment.

苏云南雁

于 2024-03-15 11:58:29 发布

阅读量1.3k

点赞数 20

文章标签： python 人工智能

本文链接：https://blog.csdn.net/qq_22059611/article/details/136735459

版权

一、报错信息分析

完整报错信息：

ImportError:
CpmTokenizer requires the SentencePiece library but it was not found in your environment. Checkout the instructions on the installation page of its repo: https://github.com/google/sentencepiece#installation and follow the ones that match your environment. Please note that you may need to restart your runtime after installation.

ImportError是Python解释器在导入模块时出现的错误，也就是导包不成功，看到这个报错就知道它的复杂度不高，最多是考虑清楚包之间的依赖关系。

CpmTokenizer requires the SentencePiece library but it was not found in your environment.这一句便是关键，某个类缺了库，缺少安装包要么直接pip install，要么去官网下载下来，按照依赖自己安装。

Checkout the instructions on the installation page of its repo: https://github.com/google/sentencepiece #installation and follow the ones that match your environment. 这一句的意思是你可以去github仓库里找到适合你环境的指引

Please note that you may need to restart your runtime after installation.这一句就是让你记得安装好环境之后重启运行环境。

二、安装对应库

安装SentencePiece即解决问题

pip install SentencePiece

实际上pycharm有自动更新包的功能，不用手动重启环境。

三、依赖关系分析

SentencePiece 是一种无监督的文本 tokenizer 和 detokenizer，主要用于基于神经网络的文本生成系统，其中，词汇量在神经网络模型训练之前就已经预先确定了。 SentencePiece 实现了subword单元（例如，字节对编码 (BPE)）和 unigram 语言模型），并可以直接从原始句子训练字词模型(subword model)。这使得我们可以制作一个不依赖于特定语言的预处理和后处理的纯粹的端到端系统。

上很多tokenizer的实现调用了SentencePiece 来处理文本，主要用在分词这个领域，看代码说明。预训练的分词器需要jieba、SentencePiece。

Construct a CPM tokenizer. Based on [Jieba](https://pypi.org/project/jieba/) and
[SentencePiece](https://github.com/google/sentencepiece).

This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should
refer to this superclass for more information regarding those methods.

四、总结

读懂英文，问题归类，探究多一点点，理清部分脉络。

苏云南雁

关注

20
点赞
踩
19

收藏

觉得还不错? 一键收藏
0
评论
CpmTokenizer requires the SentencePiece library but it was not found in your environment.

读懂英文，问题归类，探究多一点点，理清部分脉络。
复制链接

扫一扫