word python 域_如何在Python中使用WordNet获取单词域?

How can I find domain of words using nltk Python module and WordNet?

Suppose I have words like (transaction, Demand Draft, cheque, passbook) and the domain for all these words is "BANK". How can we get this using nltk and WordNet in Python?

I am trying through hypernym and hyponym relationship:

For example:

from nltk.corpus import wordnet as wn

sports = wn.synset('sport.n.01')

sports.hyponyms()

[Synset('judo.n.01'), Synset('athletic_game.n.01'), Synset('spectator_sport.n.01'), Synset('contact_sport.n.01'), Synset('cycling.n.01'), Synset('funambulism.n.01'), Synset('water_sport.n.01'), Synset('riding.n.01'), Synset('gymnastics.n.01'), Synset('sledding.n.01'), Synset('skating.n.01'), Synset('skiing.n.01'), Synset('outdoor_sport.n.01'), Synset('rowing.n.01'), Synset('track_and_field.n.01'), Synset('archery.n.01'), Synset('team_sport.n.01'), Synset('rock_climbing.n.01'), Synset('racing.n.01'), Synset('blood_sport.n.01')]

and

bark = wn.synset('bark.n.02')

bark.hypernyms()

[Synset('noise.n.01')]

解决方案

There is no explicit domain information in the Princeton WordNet nor the NLTK's WN API.

I would recommend you get a copy of the WordNet Domain resource and then link your synsets using the domains, see http://wndomains.fbk.eu/

After you've registered and completed the download you will see a wn-domains-3.2-20070223 textfile, which is a tab-delimited file with first column the offset-PartofSpeech identifier and the 2nd column contains the domain tags separated by spaces, e.g.

00584282-v military pedagogy

00584395-v military school university

00584526-v animals pedagogy

00584634-v pedagogy

00584743-v school university

00585097-v school university

00585271-v pedagogy

00585495-v pedagogy

00585683-v psychological_features

Then you use the following script to access synsets' domain(s):

from collections import defaultdict

from nltk.corpus import wordnet as wn

# Loading the Wordnet domains.

domain2synsets = defaultdict(list)

synset2domains = defaultdict(list)

for i in open('wn-domains-3.2-20070223', 'r'):

ssid, doms = i.strip().split('\t')

doms = doms.split()

synset2domains[ssid] = doms

for d in doms:

domain2synsets[d].append(ssid)

# Gets domains given synset.

for ss in wn.all_synsets():

ssid = str(ss.offset).zfill(8) + "-" + ss.pos()

if synset2domains[ssid]: # not all synsets are in WordNet Domain.

print ss, ssid, synset2domains[ssid]

# Gets synsets given domain.

for dom in sorted(domain2synsets):

print dom, domain2synsets[dom][:3]

Also look for the wn-affect that is very useful to disambiguate words for sentiment within the WordNet Domain resource.

With updated NLTK v3.0, it comes with the Open Multilingual WordNet (http://compling.hss.ntu.edu.sg/omw/), and since the French synsets share the same offset IDs, you can simply use the WND as a crosslingual resource. The french lemma names can be accessed as such:

# Gets domains given synset.

for ss in wn.all_synsets():

ssid = str(ss.offset()).zfill(8) + "-" + ss.pos()

if synset2domains[ssid]: # not all synsets are in WordNet Domain.

print ss, ss.lemma_names('fre'), ssid, synset2domains[ssid]

Note that the most recent version of NLTK changes synset properties to "get" functions: Synset.offset -> Synset.offset()

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值