Python3：语言探测工具langdetect和langid

最新推荐文章于 2025-04-22 18:13:09 发布

nana-li

最新推荐文章于 2025-04-22 18:13:09 发布

阅读量2.3w

点赞数 6

分类专栏： Programming 文章标签：语言探测工具 langdetect langid python

本文链接：https://blog.csdn.net/quiet_girl/article/details/79653037

版权

Programming 专栏收录该内容

25 篇文章

订阅专栏

一、写在前面

本篇博客主要介绍两款语言探测工具langdetect和langid，用于区分文本到底是什么语言，也是网上找到的一些资料，除了这两款之后，看到网上有的说使用NGram来解决这个问题也比较好。

二、运行环境

python3.6（anaconda）

三、langdetect

网址：https://code.google.com/archive/p/language-detection

1、安装

直接在DOS窗口下使用pip安装（如不可以，搜下如何使用pip安装）：

pip install langdetect

2、使用

程序使用比较简单，直接调用即可，代码如下：

from langdetect import detect
from langdetect import detect_langs

s1 = "本篇博客主要介绍两款语言探测工具，用于区分文本到底是什么语言，"
s2 = 'We are pleased to introduce today a new technology – Record Matching –that automatically finds relevant historical records for every family tree on MyHerit'
s3 = "Javigator：Java代码导读及分析管理工具的设计"

print(detect(s1))
print(detect(s2))
print(detect(s3))     # detect()输出探测出的语言类型
print(detect_langs(s3))    # detect_langs()输出探测出的所有语言类型及其所占的比例

输出结果如下：
注：语言类型主要参考的是ISO 639-1语言编码标准，详见ISO 639-1百度百科

zh-cn    # 中文
en      # 英文
et     # 爱沙尼亚语
[et:0.7139002269697295, lt:0.1432406269337342, no:0.142858586700596]  # 这里是所探测的句子中包含的比例及其所占的比例。

3、总结

从上面简单的示例可以看出，s3其实是一篇中文论文的题目，但是探测错误，所以个人觉得langdetect准确率不是很高。

四、langid

网址：https://github.com/saffsd/langid.py