文本语种检测---langid

一、langid

github源码

1.1 特点

(1) Fast
(2) Pre-trained over a large number of languages (currently 97)
(3) Not sensitive to domain-specific features (e.g. HTML/XML markup)
(4) Single .py file with minimal dependencies
(5) Deployable as a web service

1.2 语种与数据

1.2.1 目前支持 97 语种

af, am, an, ar, as, az, be, bg, bn, br, bs, ca, cs, cy, da, de, dz, el, en, eo, es, et, eu, fa, fi, fo, fr, ga, gl, gu, he, hi, hr, ht, hu, hy, id, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, lv, mg, mk, ml, mn, mr, ms, mt, nb, ne, nl, nn, no, oc, or, pa, pl, ps, pt, qu, ro, ru, rw, se, si, sk, sl, sq, sr, sv, sw, ta, te, th, tl, tr, ug, uk, ur, vi, vo, wa, xh, zh, zu

1.2.2 训练数据来源—5种

  • JRC-Acquis
  • ClueWeb 09
  • Wikipedia
  • Reuters RCV2
  • Debian i18n

1.3 使用

1.3.1 命令行运行

langid.py 参数

在这里插入图片描述

(1)pip install langid,安装langid库
输入:langid,输入测试句子即可,如图:

在这里插入图片描述
注意:(‘ru’, -549.8846204280853) 中的负值不是概率,输入:langid -n ,会转为概率,如图:在这里插入图片描述

(2)没安装langid库
直接运行: python langid.py ,界面同上。

1.3.2 python程序import

(1)预测
import langid
# 语种识别,返回 (language , confindence)  对数概率
print("classify","\t",langid.classify("I do not speak english"))
# 设置语种, None 为默认 识别语种
langid.set_languages(['it','ru'])
print("set_languages","\t",langid.classify("I do not speak english"))
# confindence 排序
print("rank","\t",langid.rank("I do not speak english"))

在这里插入图片描述

(2)预测 Probability Normalization

命令行中直接传入 -n 即可,但在python中需要实例化自己 LanguageIdentifier,具体如下

# Probability Normalization
from langid.langid import LanguageIdentifier, model
identifier = LanguageIdentifier.from_modelstring(model, norm_probs=True)
print("Probability Normalization","\t",identifier.classify("I do not speak english"))

在这里插入图片描述
使用时注意:Normalization更耗时,检测97种语言,一个短句子由2.38ms—>3.05ms

1.4 训练model

1.4.1 总览

langid 提供了训练的工具,使用机器学习中的 Naive Bayes 模型,运行tran.py,具体步骤在 train 目录下,如下图:
在这里插入图片描述
其中 information gain计算最耗时,占90%以上。

1.4.2 具体步骤

(1)数据准备
1、单语语料库(独立文件)
2、文件目录结构为:
			./corpus/domain1/en/File1.txt    或
			./corpus/domainX/en/001-file.xml
(2)训练
1、index.py

build a list of training documents

python index.py ./corpus

This will create a directory corpus.model, and produces a list of paths to documents in the corpus, with their associated language and domain.

2、tokenize.py

tokenize the files using the default byte n-gram tokenizer

python tokenize.py corpus.model

This runs each file through the tokenizer, tabulating the frequency of each token according to language and domain. This information is distributed into buckets according to a hash of the token, such that all the counts for any given token will be in the same bucket.

3、DFfeatureselect.py

identify the most frequent tokens by document frequency

python DFfeatureselect.py corpus.model

This sums up the frequency counts per token in each bucket, and produces a list of the highest-df tokens for use in the IG calculation stage. Note that this implementation of DFfeatureselect assumes byte n-gram tokenization, and will thus select a fixed number of features per ngram order. If tokenization is replaced with a word-based tokenizer, this should be replaced accordingly.

4、IGweight.py

compute the IG weights of each of the top features by DF. This is computed separately for domain and for language

python IGweight.py -d corpus.model # domain
python IGweight.py -lb corpus.model  # language
5、LDfeatureselect.py

Based on the IG weights, we compute the LD score for each token:

python LDfeatureselect.py corpus.model

This produces the final list of LD features to use for building the NB model.

6、scanner.py

assemble the scanner

python scanner.py corpus.model

The scanner is a compiled DFA over the set of features that can be used to count the number of times each of the features occurs in a document in a single pass over the document. This DFA is built using Aho-Corasick string matching.

7、NBtrain.py

learn the actual Naive Bayes parameters

python NBtrain.py corpus.model
(3)调用
新的模型:
	./corpus.model/model
调用:
	1、python langid.py -m ./corpus.model/model
	2、修改 langid.py 中 model 的 string
  • 2
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 4
    评论
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值