实现识别中文(识别多种语言)(字符集:unicode 编码集:utf-8)

最新推荐文章于 2024-08-21 17:59:48 发布

qq_30620793

最新推荐文章于 2024-08-21 17:59:48 发布

阅读量971

点赞数

分类专栏： python 文章标签： python

本文链接：https://blog.csdn.net/qq_30620793/article/details/119955202

版权

python 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

参考网址:

Python检测多国语言: https://www.jianshu.com/p/c3ad8934ec38

unicode字符集划分: https://unicode-table.com/cn/blocks/cjk-compatibility/

------[[[[[[[[[[------识别中文(实际包含中日韩)----------------------------------------------------------------

langen.lua文件:

langen_source={
AAA=[[DDDD]],
BBB=[[aa.a啊1]],
}

python代码:

# -*- coding: UTF-8 -*-
# python中的函数code标准需要在上面空两行

import sys

# 文件名称
fileName = ''

if len(sys.argv) > 1:
    fileName = sys.argv[1]


# 获取lua文件的词条，并且转化为字典
def readLuaConvertDict(url):
    with open(url, 'r', encoding='utf-8') as file:
        dic = []
        for index, line in enumerate(file.readlines()):
            # 去掉换行符\n
            line = line.strip('\n')
            # 有等号并且=左边没有"langcn_source"字符串
            if line.find("=") != -1:
                if line.find("langcn_source", 0, line.find("=")) == -1:
                    # 将每一行以空格为分隔符转换成列表
                    b = line.split('=', 1)
                    if len(b) == 2:
                        b[0] = b[0].strip()
                        b[1] = b[1].strip()
                    dic.append(b)
        dic = dict(dic)
        return dic


# 判断字符串是否有中文
def __is_chinese(string):
    """
    检查整个字符串是否包含中文
    [\u4e00-\u9fa5] 汉字
    [\u2E80-\u9FFF] 汉字 + 日韩中文
    :param string: 需要检查的字符串
    :return: bool
    """
    for ch in string:
        if u'\u2E80' <= ch <= u'\u9FFF':
            return True
    return False


if __name__ == '__main__':
    print(fileName)
    # 不传递参数,默认查询文件[langen.lua]
    if fileName == '':
        fileName = 'langen.lua'

    # 查询到中文的词条
    chineseRecode = {}
    data = readLuaConvertDict(fileName)
    for key, value in data.items():
        if __is_chinese(value):
            chineseRecode[key] = value
    print("总共检测词条总数:", len(data))
    print("文件[", fileName, "]查询到的中文词条总数:", len(chineseRecode))
    print("有中文的词条", chineseRecode)

--------------------------------------------------------]]]]]]------------------

-----------------------[[[[[[[[[---------识别多国语言----------------------

-----------------------]]]]]]]]]]]]]-----------------------------------------------