python语言检测模块langid、langdetect使用

Together_CZ

已于 2022-05-06 13:40:24 修改

阅读量5.6k

点赞数 1

分类专栏：软件工具使用文章标签： python 开发语言

于 2019-01-28 15:43:51 首次发布

本文链接：https://blog.csdn.net/Together_CZ/article/details/86678423

版权

软件工具使用专栏收录该内容

130 篇文章 4 订阅

订阅专栏

之前使用数据编码风格检测的模块chardet比较多一点，今天提到的两个模块是检测数据的语言类型，比如是：中文还是英文，模块的使用方法也比较简单，我这里只是简单地使用了一下，因为项目中有这个需求，所以拿来用了一下，并没有深入地去研究这两个模块，模块的地址链接我都给出来了，需要的话可以去研究下：

def langidFunc():
    '''
    https://github.com/yishuihanhan/langid.py
    '''
    print langid.classify("We Are Family")
    print langid.classify("Questa e una prova")
    print langid.classify("我们都有一个家")
    identifier=LanguageIdentifier.from_modelstring(model,norm_probs=True)
    print identifier.classify("We Are Family")

def langdetectFunc():
    '''
    https://github.com/yishuihanhan/langdetect
    '''
    s1=u"本篇博客主要介绍两款语言探测工具，用于区分文本到底是什么语言，"
    s2=u'We are pleased to introduce today a new technology'
    print detect(s1)
    print detect(s2)
    print detect_langs(s2)    # detect_langs()输出探测出的所有语言类型及其所占的比例
    print detect_langs("Otec matka syn.")

结果如下：


('en', 9.061840057373047)
('it', -35.41771221160889)
('zh', -85.79573845863342)
('en', 0.16946150595865334)
zh-cn
en
[en:0.999998109575]
[pl:0.571426592237, fi:0.428568772028]