识别文本用哪种语言写成

最新推荐文章于 2022-09-20 17:58:41 发布

SAN_YUN

最新推荐文章于 2022-09-20 17:58:41 发布

阅读量281

点赞数

分类专栏： nltk 文章标签：人工智能 python

本文链接：https://blog.csdn.net/SAN_YUN/article/details/84477019

版权

nltk 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

原文：http://blog.youxu.info/2007/11/08/guess-language-of-text/

ASPN Python Cookbook 提到了一个使用 zlib 库识别文本用哪种语言写成的程序. 其核心代码不超过20行, 据我观察, 识别精度不低于95％. 我略做了一下修改, 把联合国联合国人权宣言作为语料库，目前从 wikipedia 上随便抓一篇爪哇文的文章下来, 都能识别得九不离十。

class Entropy:
    def __init__(self):      
		self.entro = []

    def register(self, name, corpus):
        """
        register a text as corpus for a language or author.
        <name> may also be a function or whatever you need
        to handle the result.
        """
        corpus = str(corpus)
        ziplen = len(zlib.compress(corpus))
        print name, ziplen
	self.entro.append((name, corpus, ziplen))

    def guess(self, part):
        """
        <part> is a text that will be compared with the registered
        corpora and the function will return what you defined as
        <name> in the registration process.
        """
        what = None
        diff = 0
        part = str(part)

        for name, corpus, ziplen in self.entro:
		nz = len(zlib.compress(corpus+part)) - ziplen
		if diff==0 or nz<diff:
                	what = name
        		diff = nz
        return what

先贴代码, 有时间细讲一下语言模型和信息论的妙用. 简单而小巧的模型解决看上去不可解决的问题，这就是人工智能的精华.

[所有文件打包下载(包含语料源文件10Mb). 代码本身其实只有50行]

SAN_YUN

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
识别文本用哪种语言写成

原文：http://blog.youxu.info/2007/11/08/guess-language-of-text/ ASPN Python Cookbook 提到了一个使用 zlib 库识别文本用哪种语言写成的程序. 其核心代码不超过20行, 据我观察, 识别精度不低于95％. 我略做了一下修改, 把联合国联合国人权宣言作为语料库，目前从 wikipedia 上随便抓一篇爪哇文的...
复制链接

扫一扫