WordNet
WordNet是处理单词表示的最流行的经典方法。它是一个外部词汇数据库,对给定单词的定义、同义词、祖先、派生词等信息进行编码。它可以推断给定单词的各种信息。
WordNet是一个词汇数据库,用于对单词间的磁性标签关系(名词、动词、形容词、副词)进行编码。由美国普林斯顿大学心理学系首创。WordNet考虑单词之间的同义性来评估单词之间的关系。有多国的语言可供选择:http://globalwordnet.org/resources/wordnets-in-the-world/。
WordNet使用synset来表示一群或一组同义词。没个synset都有一个definition用来结合四synset的内容。synset中包含的同义词称为lemmas。
对于给定的synset存在上位词和下位词,比如苹果的上位词是水果,下位词是红苹果。
import nltk
from nltk.corpus import wordnet as wn
nltk.download('wordnet')
# shows all the available synsets
word = 'car'
car_syns = wn.synsets(word)
print('All the available Synsets for ', word)
print('\t', car_syns,'\n')
# The definition of the first two synsets
syns_defs = [car_syns[i].definition() for i in range(len(car_syns))]
print('Example definitions of available Synsets ...')
for i in range(3):
print('\t',car_syns[i].name(),': ',syns_defs[i])
print('\n')
# Get the lemmas for the first Synset
print('Example lemmas for the Synset ', car_syns[i].name())
car_lemmas = car_syns[0].lemmas()[:3]
print('\t', [lemma.name() for lemma in car_lemmas], '\n')
# Let us get hypernyms for a Synset (general superclass)
syn = car_syns[0]
print('Hypernyms of the Synset ', syn.name())
print('\t', syn.hypernyms()[0].name(),'\n')
# Let us get hyponyms for a Synset (specific subclass)
syn = car_syns[0]
print('Hyponyms of the Synset ', syn.name())
print('\t', [hypo.name() for hypo in syn.hyponyms()[:3]], '\n')
# Let us get part-holonyms for a Synset (specific subclass)
# also there is another holonym category called "substance-holonyms"
syn = car_syns[2]
print('Holonyms (Part) of the Synset ',syn.name())
print('\t', [holo.name() for holo in syn.part_holonyms()], '\n')
# Let us get meronyms for a Synset (specific subclass)
# also there is another meronym category called "substance-meronyms"
syn = car_syns[0]
print('Meronyms (Part) of the Synset ', syn.name())
print('\t', [mero.name() for mero in syn.part_meronyms()[:3]], '\n')
word1, word2, word3 = 'car', 'lorry', 'tree'
w1_syns, w2_syns, w3_syns = wn.synsets(word1), wn.synsets(word2), wn.synsets(word3)
print('Word Similarity (%s)<->(%s): ' % (word1, word2), wn.wup_similarity(w1_syns[0], w2_syns[0]))
print('Word Similarity (%s)<->(%s): ' % (word1, word3), wn.wup_similarity(w1_syns[0], w3_syns[0]))
结果展示:
All the available Synsets for car
[Synset('car.n.01'), Synset('car.n.02'), Synset('car.n.03'), Synset('car.n.04'), Synset('cable_car.n.01')]
Example definitions of available Synsets ...
car.n.01 : a motor vehicle with four wheels; usually propelled by an internal combustion engine
car.n.02 : a wheeled vehicle adapted to the rails of railroad
car.n.03 : the compartment that is suspended from an airship and that carries personnel and the cargo and the power plant
Example lemmas for the Synset car.n.03
['car', 'auto', 'automobile']
Hypernyms of the Synset car.n.01
motor_vehicle.n.01
Hyponyms of the Synset car.n.01
['ambulance.n.01', 'beach_wagon.n.01', 'bus.n.04']
Holonyms (Part) of the Synset car.n.03
['airship.n.01']
Meronyms (Part) of the Synset car.n.01
['accelerator.n.01', 'air_bag.n.01', 'auto_accessory.n.01']
Word Similarity (car)<->(lorry): 0.6956521739130435
Word Similarity (car)<->(tree): 0.38095238095238093
独热编码
表示单词的更简单方式是独热编码。如果有一个V大小的单词表,对于第
i
i
i个词
w
i
w_i
wi,用一个长度为V的向量
[
0
,
0
,
0
,
.
.
.
,
0
,
1
,
0
,
.
.
.
,
0
,
0
]
[0,0,0,...,0,1,0,...,0,0]
[0,0,0,...,0,1,0,...,0,0]来表示单词
w
i
w_i
wi其中第
i
i
i个元素为1,其余元素都是0。比如,我们是朋友这句话。每个单词的独热表示如下:
我:
[
1
,
0
,
0
,
0
,
0
]
[1,0,0,0,0]
[1,0,0,0,0]
们:
[
0
,
1
,
0
,
0
,
0
]
[0,1,0,0,0]
[0,1,0,0,0]
是:
[
0
,
0
,
1
,
0
,
0
]
[0,0,1,0,0]
[0,0,1,0,0]
朋:
[
0
,
0
,
0
,
1
,
0
]
[0,0,0,1,0]
[0,0,0,1,0]
友:
[
0
,
0
,
0
,
0
,
1
]
[0,0,0,0,1]
[0,0,0,0,1]
TF-IDF方法
TF-IDF方法是一种基于频率的方法,它考虑单词在语料库中出现的频率。单词在文档中出现的频率越高,该单词在文档中越重要。同时还要去除像的、了这样的常用词,将他们的频率置为0。
TF代表词频率:
T
F
(
w
i
)
=
单
词
出
现
的
次
数
/
单
词
数
TF(w_i)=单词出现的次数/单词数
TF(wi)=单词出现的次数/单词数
IDF代表逆文档频率:
I
D
F
(
w
i
)
=
l
o
g
(
文
档
总
数
/
拥
有
该
单
词
的
文
档
数
)
IDF(w_i)=log(文档总数/拥有该单词的文档数)
IDF(wi)=log(文档总数/拥有该单词的文档数)
T
F
−
I
D
F
(
w
i
)
=
T
F
(
w
i
)
∗
I
D
F
(
w
i
)
TF-IDF(w_i)=TF(w_i)*IDF(w_i)
TF−IDF(wi)=TF(wi)∗IDF(wi)
例如:
- 文件1:这是一只小狗。
- 文件2:这是一只小猫。
T F − I D F ( 猫 , 文 件 2 ) = ( 1 / 6 ) ∗ l o g ( 2 / 1 ) = 0.05 TF-IDF(猫,文件2)=(1/6)*log(2/1)=0.05 TF−IDF(猫,文件2)=(1/6)∗log(2/1)=0.05
T F − I D F ( 这 , 文 件 2 ) = ( 1 / 6 ) ∗ l o g ( 2 / 2 ) = 0 TF-IDF(这,文件2)=(1/6)*log(2/2)=0 TF−IDF(这,文件2)=(1/6)∗log(2/2)=0
因此猫这个词比这这个词更具有丰富的信息。
共现矩阵
共现矩阵对单词上下文进行编码。
例如:
- Jerry and Mary are friends.
- Jerry buy flowers for Mary.
可以得到如下的矩阵。
Jerry | and | Mary | are | friends | buys | flowers | for | |
---|---|---|---|---|---|---|---|---|
Jerry | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
and | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
Marry | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 |
are | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
friends | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
buys | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
flowers | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
flowers | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |