NLP From Scratch：使用char-RNN对姓氏进行分类_nlp如何对姓进行处理-CSDN博客

本文链接：https://blog.csdn.net/qq_29678299/article/details/103046625

本文介绍了如何从头构建一个自然语言处理模型，特别是使用字符循环神经网络(char-RNN)对不同的姓氏进行分类。通过这个过程，读者将深入理解RNN的工作原理，并能应用到文本分类任务中。

摘要由CSDN通过智能技术生成

from __future__ import unicode_literals, print_function,division
from io import open
import glob 
import os

# 返回匹配规则的所有路径 
def findFiles(path):
    return glob.glob(path)

print(findFiles('data/data/names/*.txt'))

import unicodedata
import string

all_letters=string.ascii_letters+ " .,;'"
n_letters=len(all_letters)
#总共57个字母  如果不采用词嵌入模型的化  用独热编码的维度是57
print("n_letters:", n_letters)

#turn a unicode string to plain ascii thans to 
#讲Unicode 编码转为ascii编码
def unicodeToassii(s):
    return ' '.join(
        c for c in unicodedata.normalize('NFD',s)
        if unicodedata.category(c) !='Mn'
        and c in all_letters
    )


"""
整理数据格式为这样
18个
all_categories: ['Arabic', 'Chinese', 'Czech', 'Dutch', 'English', 'French', 'German', 'Greek',
'Irish', 'Italian', 'Japanese', 'Korean', 'Polish', 'Portuguese', 'Russian', 'Scottish', 'Spanish', 'Vietnamese']
category_lines: {'Arabic': ['K h o u r y', 'N a h a s', 'D a h e r', 'G e r g e s', 'N a z a r i', 'M a a l o u f', 'G e r g e s',
'N a i f e h', 'G u i r g u i s', 'B a b a', 'S a b b a g h', 'A t t i a', 'T a h a n', 'H a d d a d', 'A s w a d', 'N a j j a r'....],
.......................................................................................................[ ]}
"""
#创建字典
category_lines={}
all_categories=[]

#read a file and spilt into lines
#读取每个语言姓氏的每行组成一个列表
def readLines(filename):
    lines=open(filename,encoding='utf-8').read().strip().split('\n')
    return [unicodeToassii(line) for line in lines]

#度所有文件
for filename in findFiles('data/data/names/*.txt'):
    #获取语言名称
    category=os.path.splitext(os.path.basename(filename))[0]
    #语言名称列表  [ chinese  .......]
    all_categories.append(category)
    lines=rea