自然语言处理自学笔记-01 基于TensorFlow 单词表示的经典方法

最新推荐文章于 2024-02-07 09:20:10 发布

布比与迈克大炮

最新推荐文章于 2024-02-07 09:20:10 发布

阅读量208

点赞数

分类专栏： nlp 文章标签：自然语言处理 python

本文链接：https://blog.csdn.net/bubid/article/details/108428247

版权

nlp 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

本文介绍了自然语言处理中处理单词表示的几种经典方法，包括WordNet、独热编码、TF-IDF和共现矩阵。WordNet是一个包含词汇信息的数据库，利用同义词关系表示单词。独热编码是简单的单词表示法，TF-IDF通过词频和逆文档频率评估单词重要性，共现矩阵则编码单词的上下文关系。

摘要由CSDN通过智能技术生成

WordNet

WordNet是处理单词表示的最流行的经典方法。它是一个外部词汇数据库，对给定单词的定义、同义词、祖先、派生词等信息进行编码。它可以推断给定单词的各种信息。
WordNet是一个词汇数据库，用于对单词间的磁性标签关系（名词、动词、形容词、副词）进行编码。由美国普林斯顿大学心理学系首创。WordNet考虑单词之间的同义性来评估单词之间的关系。有多国的语言可供选择：http://globalwordnet.org/resources/wordnets-in-the-world/。
WordNet使用synset来表示一群或一组同义词。没个synset都有一个definition用来结合四synset的内容。synset中包含的同义词称为lemmas。
对于给定的synset存在上位词和下位词，比如苹果的上位词是水果，下位词是红苹果。

import nltk
from nltk.corpus import wordnet as wn

nltk.download('wordnet')

# shows all the available synsets
word = 'car'
car_syns = wn.synsets(word)
print('All the available Synsets for ', word)
print('\t', car_syns,'\n')

# The definition of the first two synsets
syns_defs = [car_syns[i].definition() for i in range(len(car_syns))]
print('Example definitions of available Synsets ...')
for i in range(3):
    print('\t',car_syns[i].name(),': ',syns_defs[i])
print('\n')

# Get the lemmas for the first Synset
print('Example lemmas for the Synset ', car_syns[i].name())
car_lemmas = car_syns[0].lemmas()[:3]
print('\t', [lemma.name() for lemma in car_lemmas], '\n')

# Let us get hypernyms for a Synset (general superclass)
syn = car_syns[0]
print('Hypernyms of the Synset ', syn.name())
print('\t', syn.hypernyms()[0].name(),'\n')

# Let us get hyponyms for a Synset (specific subclass)
syn = car_syns[0]
print('Hyponyms of the Synset ', syn.name())
print('\t', [hypo.name() for hypo in syn.hyponyms()[:3]], '\n')

# Let us get part-holonyms for a Synset (specific subclass)
# also there is another holonym category called "substance-holonyms"
syn = car_syns[2]
print('Holonyms (Part) of the Synset ',syn.name())
print('\t', [holo.name() for holo in syn.part_holonyms()], '\n')

# Let us get meronyms for a Synset (specific subclass)
# also there is another meronym category called "substance-meronyms"
syn = car_syns[0]
print('Meronyms (Part) of the Synset ', syn.name())
print('\t', [mero.name() for mero in syn.part_meronyms()[:3]], '\n')

word1, word2, word3 = 'car', 'lorry', 'tree'
w1_syns, w2_syns, w3_syns = wn.synsets(word1), wn.synsets(word2), wn.synsets(word3)

print('Word Similarity (%s)<->(%s): ' % (word1, word2), wn.wup_similarity(w1_syns[0], w2_syns[0]))
print('Word Similarity (%s)<->(%s): ' % (word1, word3), wn.wup_similarity(w1_syns[0], w3_syns[0]))

结果展示：

All the available Synsets for  car
	 [Synset('car.n.01'), Synset('car.n.02'), Synset('car.n.03'), Synset('car.n.04'), Synset('cable_car.n.01')] 

Example definitions of available Synsets ...
	 car.n.01 :  a motor vehicle with four wheels; usually propelled by an internal combustion engine
	 car.n.02 :  a wheeled vehicle adapted to the rails of railroad
	 car.n.03 :  the compartment that is suspended from an airship and that carries personnel and the cargo and the power plant


Example lemmas for the Synset  car.n.03
	 ['car', 'auto', 'automobile'] 

Hypernyms of the Synset  car.n.01
	 motor_vehicle.n.01 

Hyponyms of the Synset  car.n.01
	 ['ambulance.n.01', 'beach_wagon.n.01', 'bus.n.04'] 

Holonyms (Part) of the Synset  car.n.03
	 ['airship.n.01'] 

Meronyms (Part) of the Synset  car.n.01
	 ['accelerator.n.01', 'air_bag.n.01', 'auto_accessory.n.01'] 

Word Similarity (car)<->(lorry):  0.6956521739130435
Word Similarity (car)<->(tree):  0.38095238095238093

独热编码

表示单词的更简单方式是独热编码。如果有一个V大小的单词表，对于第 $i$ 个词 $w_i$ ，用一个长度为V的向量 $[0, 0, 0, . . ., 0, 1, 0, . . ., 0, 0]$ 来表示单词 $w_i$ 其中第 $i$ 个元素为1，其余元素都是0。比如，我们是朋友这句话。每个单词的独热表示如下：
我： $[1, 0, 0, 0, 0]$
们： $[0, 1, 0, 0, 0]$
是： $[0, 0, 1, 0, 0]$
朋： $[0, 0, 0, 1, 0]$
友： $[0, 0, 0, 0, 1]$

TF-IDF方法

TF-IDF方法是一种基于频率的方法，它考虑单词在语料库中出现的频率。单词在文档中出现的频率越高，该单词在文档中越重要。同时还要去除像的、了这样的常用词，将他们的频率置为0。
TF代表词频率：
$TF(w_i)=单词出现的次数/单词数$
IDF代表逆文档频率：
$IDF(w_i)=log(文档总数/拥有该单词的文档数)$
$TF-IDF(w_i)=TF(w_i)*IDF(w_i)$
例如：

文件1：这是一只小狗。
文件2：这是一只小猫。
$T F - I D F (猫, 文件 2) = (1 / 6) * l o g (2 / 1) = 0.05$
$T F - I D F (这, 文件 2) = (1 / 6) * l o g (2 / 2) = 0$
因此猫这个词比这这个词更具有丰富的信息。