Python不掉包初探自然语言处理One-Hot编码与解码

最新推荐文章于 2024-01-04 21:51:17 发布

GarveyPython

最新推荐文章于 2024-01-04 21:51:17 发布

阅读量1.4k

点赞数 1

本文链接：https://blog.csdn.net/m0_48946024/article/details/123408171

版权

python 自然语言处理开发语言

背景导入：

在⾃然语⾔处理中，算法⽆法直接处理字符⽂本。通常将每个词表示为⼀个 One-hot 向量，句⼦便可以表示为⼀个矩阵，然后就可以对⽂本进⾏计算。

机器学习数据预处理1：独热编码（One-Hot）及其代码_梦Dancing的博客-CSDN博客_onehot编码1. 为什么使用 one-hot 编码？问题：在机器学习算法中，我们经常会遇到分类特征，例如：人的性别有男女，祖国有中国，美国，法国等。这些特征值并不是连续的，而是离散的，无序的。目的：如果要作为机器学习算法的输入，通常我们需要对其进行特征数字化。什么是特征数字化呢？例如：性别特征：["男"，"女"] ...

https://blog.csdn.net/qq_15192373/article/details/89552498?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522164690792716780271965340%2522%252C%2522scm%2522%253A%252220140713.130102334..%2522%257D&request_id=164690792716780271965340&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2~all~top_positive~default-1-89552498.article_score_rank&utm_term=onehot&spm=1018.2226.3001.4187

实现步骤：

⽂本预处理：

全部转换为⼩写

去除特殊符号

连续多个空⽩符号处理为 1 个

标点符号与词汇分开

缩写的处理

it's 处理为 it 's

i've 处理为 i 've

don't 处理为 do n't

i'll 处理为 i 'll

i'd 处理为 i 'd

构建词典：根据⽂本数据统计出⼀个词典，为每个词编号；

编码：根据词典，将⽂本切分为句⼦，每个句⼦利⽤ One-hot 表示为⼀个矩阵；

解码：根据词典，将 One-hot 序列转换为⽂本句⼦。

实现代码：

txt1='''Hi! This is Wang.
Hello! It's Sun speaking.
Do you have free time this evening?
Uh... Let's me see. What's the matter?
I hope you can see a new movie Super Hero with me. I've been wanting to see it for a long time.
Sorry, I didn't catch that. Could you say the name again?
Super Hero.
Oh! Super Hero. I'm very interested in it. I haven't seen it. I can go with you. When the movie start?
It begin at seven o'clock. Let's gather at the cinema at ten to seven.
Ok, I'll arrive there on time. Goodbye!
Bye.'''
txt2="Hi! Wang. I've arrive the cinema. Where are you?"

#自定义文本预处理函数
def txtpre():
    global txt#txt为全局变量
    txt=txt.lower()
    #变特殊字符为空格
    for ch in "!,.?":    
        txt=txt.replace(ch," ")
    #对于缩写，通过空格按要求分隔
    txt=txt.replace("it's","it 's")
    txt=txt.replace("i've","i 've")
    txt=txt.replace("don't","do n't")
    txt=txt.replace("i'll","i 'll")
    txt=txt.replace("i'd","i 'd")

#将txt1赋给txt，作预处理
txt=txt1
txtpre()
list1=list(txt.split())#将字符串按空格分隔转换成列表
dict1=dict.fromkeys(list1,0)#创建一个新字典，默认键对应的值为0

#记录词频到字典的值当中，避免重复词(键)
for x in list1:
    dict1[x]+=1

#将词(键)组合成列表，并添加"UNKNOWN"为列表的最后一个元素
key_lst=[]
for k in range(len(dict1)):
    key_lst=list(dict1.keys())
key_lst.append("UNKNOWN")

char_to_int = dict((c, i) for i, c in enumerate(key_lst))#词转化为编号需要用到的数据类型
int_to_char = dict((i, c) for i, c in enumerate(key_lst))#编号转化为词需要用到的数据类型
print(int_to_char)#以“编号 词”的方式输出词典

#将txt2赋给txt，作预处理
txt=txt2
txtpre()
list2=list(txt.split())

integer_encoded = []#编号组合成的整数矩阵
#依次检索表2元素，如果在key_lst中，加编号到整数矩阵中，否则加"UNKONWN"的编号到矩阵中
for char in list2:
    if (char in key_lst):
        integer_encoded.append(char_to_int[char])
    else:
        integer_encoded.append(key_lst.index("UNKNOWN"))       

#构成onehot形式并输出，编码的过程
onehot_encoded =[]
for value in integer_encoded:
       letter = [0 for _ in range(len(key_lst))]
       letter[value] = 1
       onehot_encoded.append(letter)
print(onehot_encoded)

#解码，并输出由词组合成的列表
list_decode=[]
for i in range(len(onehot_encoded)):
    decode = int_to_char[integer_encoded[onehot_encoded.index(onehot_encoded[i])]]#逐步往回推
    list_decode.append(decode)
print(list_decode)

运行结果：

{0: 'hi', 1: 'this', 2: 'is', 3: 'wang', 4: 'hello', 5: 'it', 6: "'s", 7: 'sun', 8: 'speaking', 9: 'do', 10: 'you', 11: 'have', 12: 'free', 13: 'time', 14: 'evening', 15: 'uh', 16: "let's", 17: 'me', 18: 'see', 19: "what's", 20: 'the', 21: 'matter', 22: 'i', 23: 'hope', 24: 'can', 25: 'a', 26: 'new', 27: 'movie', 28: 'super', 29: 'hero', 30: 'with', 31: "'ve", 32: 'been', 33: 'wanting', 34: 'to', 35: 'for', 36: 'long', 37: 'sorry', 38: "didn't", 39: 'catch', 40: 'that', 41: 'could', 42: 'say', 43: 'name', 44: 'again', 45: 'oh', 46: "i'm", 47: 'very', 48: 'interested', 49: 'in', 50: "haven't", 51: 'seen', 52: 'go', 53: 'when', 54: 'start', 55: 'begin', 56: 'at', 57: 'seven', 58: "o'clock", 59: 'gather', 60: 'cinema', 61: 'ten', 62: 'ok', 63: "'ll", 64: 'arrive', 65: 'there', 66: 'on', 67: 'goodbye', 68: 'bye', 69: 'UNKNOWN'}
[[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
['hi', 'wang', 'i', "'ve", 'arrive', 'the', 'cinema', 'UNKNOWN', 'UNKNOWN', 'you']

GarveyPython

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
打赏
0
评论
Python不掉包初探自然语言处理One-Hot编码与解码

背景导入：在⾃然语⾔处理中，算法⽆法直接处理字符⽂本。通常将每个词表示为⼀个One-hot向量，句⼦便可以表示为⼀个矩阵，然后就可以对⽂本进⾏计算。机器学习数据预处理1：独热编码（One-Hot）及其代码_梦Dancing的博客-CSDN博客_onehot编码1. 为什么使用 one-hot 编码？问题：在机器学习算法中，我们经常会遇到分类特征，例如：人的性别有男女，祖国有中国，美国，法国等。这些特征值并不是连续的，而是离散的，无序的。目的：如果要作为机器学习算法的输入，通常我们需要对其进行特
复制链接

扫一扫