stanford coreNLP简单使用

最新推荐文章于 2023-04-26 20:22:45 发布

置顶咪咕班克斯

最新推荐文章于 2023-04-26 20:22:45 发布

阅读量1.5k

点赞数 4

分类专栏： pytorch 第三方工具包代码调试文章标签： python

本文链接：https://blog.csdn.net/u012211422/article/details/116176180

版权

pytorch 同时被 2 个专栏收录

45 篇文章 5 订阅

订阅专栏

第三方工具包代码调试

10 篇文章 0 订阅

订阅专栏

编写不易如果觉得不错，麻烦关注一下~

突然发现新大陆，但是还没有着手，看上去贼高大上~

https://ltp.readthedocs.io/zh_CN/latest/appendix.html

1.安装参考链接：https://blog.csdn.net/l919898756/article/details/81670228?spm=1001.2014.3001.5506

出现系统找不到文件，说明没有安装好jdk 11 以及环境变量。证明jdk 安装成功【注意】需重启cmd，在cmd 中输入 java -version 不报错即可。

下载：https://www.oracle.com/java/technologies/javase-jdk11-downloads.html

配置环境：https://jingyan.baidu.com/article/77b8dc7fa2a7c66175eab661.html

在linux 安装java 环境https://blog.csdn.net/u010993514/article/details/82926514 其中新版的java16 没有jre dt tools 等就不用配置。

下面两句是重点。可以用的时候，用命令输入一下即可生效。需每次输入

export JAVA_HOME=/usr/local/java

export PATH=$PATH:$JAVA_HOME/bin

【注意】配置好后，需要重启pycharm 否则报错消除不了

如果喜欢JAVA8 可以参考这个链接https://www.cnblogs.com/yuanqt/p/10551801.html

#Java Env
export JAVA_HOME=/usr/jdk1.8.0_121
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$PATH:$JAVA_HOME/bin

2.树结构可视化：安装nltk第三方包

from stanfordcorenlp import StanfordCoreNLP

nlp = StanfordCoreNLP(r'C:/Users/bydlearner/PycharmProjects/pythonProject3/stanford-corenlp-4.2.0')
# 这里改成你stanford-corenlp所在的目录#,/home/byd/stanford-corenlp-4.2.0'
sentence = 'Guangdong University of Foreign Studies is located in Guangzhou.'
print('Tokenize:', nlp.word_tokenize(sentence))
print('Part of Speech:', nlp.pos_tag(sentence))
print('Named Entities:', nlp.ner(sentence))
print('Constituency Parsing:', nlp.parse(sentence))
print('Dependency Parsing:', nlp.dependency_parse(sentence))
import nltk
from nltk import Tree
from nltk.draw.util import CanvasFrame
from nltk.draw import TreeWidget
tree1 = nltk.Tree('NP',['Alick'])
print(tree1)
tree2 = nltk.Tree('N',['Alick','Rabbit'])
print(tree2)
tree3 = nltk.Tree('S',[tree1,tree2])
print(tree3.label()) #查看树的结点
tree3.draw()
cf = CanvasFrame()
t = Tree.fromstring(nlp.parse(sentence))
tc = TreeWidget(cf.canvas(),t)
cf.add_widget(tc,10,10) # (10,10) offsets
cf.print_to_file('tree.ps')
cf.destroy()

nlp.close()  # Do not forget to close! The backend server will consume a lot memery.

输出结果：

C:\Users\bydlearner\anaconda3\python.exe C:/Users/bydlearner/PycharmProjects/pythonProject3/nlptry.py
Tokenize: ['Guangdong', 'University', 'of', 'Foreign', 'Studies', 'is', 'located', 'in', 'Guangzhou', '.']
Part of Speech: [('Guangdong', 'NNP'), ('University', 'NNP'), ('of', 'IN'), ('Foreign', 'NNP'), ('Studies', 'NNPS'), ('is', 'VBZ'), ('located', 'VBN'), ('in', 'IN'), ('Guangzhou', 'NNP'), ('.', '.')]
Named Entities: [('Guangdong', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('of', 'ORGANIZATION'), ('Foreign', 'ORGANIZATION'), ('Studies', 'ORGANIZATION'), ('is', 'O'), ('located', 'O'), ('in', 'O'), ('Guangzhou', 'CITY'), ('.', 'O')]
Constituency Parsing: (ROOT
(S
(NP
(NP (NNP Guangdong) (NNP University))
(PP (IN of)
(NP (NNP Foreign) (NNPS Studies))))
(VP (VBZ is)
(VP (VBN located)
(PP (IN in)
(NP (NNP Guangzhou)))))
(. .)))
Dependency Parsing: [('ROOT', 0, 7), ('compound', 2, 1), ('nsubj:pass', 7, 2), ('case', 5, 3), ('compound', 5, 4), ('nmod', 2, 5), ('aux:pass', 7, 6), ('case', 9, 8), ('obl', 7, 9), ('punct', 7, 10)]
(NP Alick)
(N Alick Rabbit)
S

也可以直接将之前的语句，可视化其解析树（Guangdong University of Foreign Studies is located in Guangzhou.）

Tree.fromstring(nlp.parse(sentence)).draw()

当然如果你想将draw保存为png ，也不是不可能。需要下载一个库，参考此处链接：https://stackoverflow.com/questions/44587376/oserror-unable-to-locate-ghostscript-on-paths

download: https://www.ghostscript.com/download/gsdnld.html

Tell the variable(EpsImagePlugin.gs_windows_binary) what the path of EXE(gswin64c, gswin32c, gs ) it is. (If you don't want to change the system path.)
from PIL import EpsImagePlugin
EpsImagePlugin.gs_windows_binary =  r'X:\...\gs\gs9.52\bin\gswin64c'
im = Image.open('myimage.eps')
im.save('myimage.png')

from stanfordcorenlp import StanfordCoreNLP

nlp = StanfordCoreNLP(r'C:/Users/bydlearner/PycharmProjects/pythonProject3/stanford-corenlp-4.2.0')
# 这里改成你stanford-corenlp所在的目录#,/home/byd/stanford-corenlp-4.2.0'
sentence = 'Guangdong University of Foreign Studies is located in Guangzhou.'
print('Tokenize:', nlp.word_tokenize(sentence))
print('Part of Speech:', nlp.pos_tag(sentence))
print('Named Entities:', nlp.ner(sentence))
print('Constituency Parsing:', nlp.parse(sentence))
print('Dependency Parsing:', nlp.dependency_parse(sentence))
import nltk
from nltk import Tree
from nltk.draw.util import CanvasFrame
from nltk.draw import TreeWidget
from PIL import EpsImagePlugin
EpsImagePlugin.gs_windows_binary =r'C:\Program Files\gs\gs9.54.0\bin\gswin64c'
#Tree.fromstring(nlp.parse(sentence)).draw()
import os
cf = CanvasFrame()
t = Tree.fromstring(nlp.parse(sentence))
tc = TreeWidget(cf.canvas(),t)
cf.add_widget(tc,10,10) # (10,10) offsets
cf.print_to_file('tree.ps')
from PIL import Image
psimage=Image.open('tree.ps')
psimage.save('tree.png')
nlp.close()  # Do not forget to close! The backend server will consume a lot memery.

这时就可以保存一个png文件

显示的字体好🐖

3.一种简单使用

参考链接：

【1】https://blog.csdn.net/chansonzhang/article/details/84326415

【2】 https://blog.csdn.net/weixin_32115639/article/details/111899147

【3】http://www.voidcn.com/article/p-nwdxcntd-btk.html

遍历所有子树

import nltk
s = '(ROOT (S (NP (NNP Europe)) (VP (VBZ is) (PP (IN in) (NP (DT the) (JJ same) (NNS trends)))) (. .)))'
tree = nltk.tree.Tree.fromstring(s)
def traverse_tree(tree):
    # print("tree:", tree)
    for subtree in tree:
        if type(subtree) == nltk.tree.Tree:
            traverse_tree(subtree)
traverse_tree(tree)
#它首先遍历您的树深度.

4. 树上除了句子中的单词，其他词的含义

参考链接：

https://blog.csdn.net/lihaitao000/article/details/51812618

https://www.cnblogs.com/elpsycongroo/p/9369111.html

1. CC Coordinating conjunction 连接词
2. CD Cardinal number 基数词
3. DT Determiner 限定词（如this,that,these,those,such，不定限定词：no,some,any,each,every,enough,either,neither,all,both,half,several,many,much,(a) few,(a) little,other,another.
4. EX Existential there 存在句
5. FW Foreign word 外来词
6. IN Preposition or subordinating conjunction 介词或从属连词
7. JJ Adjective 形容词或序数词
8. JJR Adjective, comparative 形容词比较级
9. JJS Adjective, superlative 形容词最高级
10. LS List item marker 列表标示
11. MD Modal 情态助动词
12. NN Noun, singular or mass 常用名词单数形式
13. NNS Noun, plural 常用名词复数形式
14. NNP Proper noun, singular 专有名词，单数形式
15. NNPS Proper noun, plural 专有名词，复数形式
16. PDT Predeterminer 前位限定词
17. POS Possessive ending 所有格结束词
18. PRP Personal pronoun 人称代词
19. PRP Possessive pronoun 所有格代名词
20. RB Adverb 副词
21. RBR Adverb, comparative 副词比较级
22. RBS Adverb, superlative 副词最高级
23. RP Particle 小品词
24. SYM Symbol 符号
25. TO to 作为介词或不定式格式
26. UH Interjection 感叹词
27. VB Verb, base form 动词基本形式
28. VBD Verb, past tense 动词过去式
29. VBG Verb, gerund or present participle 动名词和现在分词
30. VBN Verb, past participle 过去分词
31. VBP Verb, non-3rd person singular present 动词非第三人称单数
32. VBZ Verb, 3rd person singular present 动词第三人称单数
33. WDT Wh-determiner 限定词（如关系限定词：whose,which.疑问限定词：what,which,whose.）
34. WP Wh-pronoun 代词（who whose which）
35. WP Possessive wh-pronoun 所有格代词

36. WRB Wh-adverb 疑问代词（how where when）

ROOT：要处理文本的语句；IP：简单从句；NP：名词短语；VP：动词短语；PU：断句符，通常是句号、问号、感叹号等标点符号；LCP：方位词短语；PP：介词短语；CP：由‘的’构成的表示修饰性关系的短语；DNP：由‘的’构成的表示所属关系的短语；ADVP：副词短语；ADJP：形容词短语；DP：限定词短语；QP：量词短语；NN：常用名词；NR：固有名词；NT：

最终想实现的效果其实是想输出短语： NP：名词短语；VP：动词短语；LCP：方位词短语；PP：介词短语；CP：由‘的’构成的表示修饰性关系的短语；DNP：由‘的’构成的表示所属关系的短语；ADVP：副词短语；ADJP：形容词短语；DP：限定词短语；QP：量词短语；

解决方法：将遍历树都加入一个list中。遍历list 挑选出label()为上述类型的短语

可结合此链接代码：http://www.cocoachina.com/articles/259677

4.句法依赖树转换成连接矩阵

图片来自https://blog.csdn.net/l919898756/article/details/81670228

print('Dependency Parsing:', nlp.dependency_parse(sentence))

Dependency Parsing: [('ROOT', 0, 4), ('nsubj:pass', 4, 1), ('aux', 4, 2), ('aux:pass', 4, 3), ('case', 7, 5), ('compound', 7, 6), ('obl', 4, 7), ('punct', 4, 8)]

除了ROOT节点，其他就是构边联系的边界，但是注意这里面的边界都是从1开始，python 都是从0开始，需要减1。但是这个解析器对% 与#敏感，%根本不认识，#会添加split后的长度！！！请注意过滤！！！对 - 符号会断开，对，也会处理

在glove 的token中对逗号等进行去掉，对连字符的如果字典没有就置为

但是很多网络喜欢用glove ，但是最好对句子进行处理，否则很多都不能使用，比如去除一些数字前的#符号，否则数字会被认定为不能识别的字符，默认用19901作为padding 填充，不识别的字符用19900代替。也不认识时间点.............也不认识小数点................. 这不阻碍了我的工作？数字对我很重要。这里将？和逗号只作为分割符（这里的例子很多，详情见下）

最后发现原来是glove前自定义的预处理导致的，之前没看预处理

stanford nlp 解析器对wanna, gotta, gonna, doesn't 等默认分成两部分来处理，请注意！！！

对gotta,gonna,3d,wanna,cannot, 双引号， / ,( ) : ；mm 这种单位也可以被分割会自动分词。想错了并不是单位的问题，是这个解析器对以数字开头后面带字母的，都会分词！！！

这里注意并不是所有的数字都能转换成一个向量，比如这里的234 就没有对应的，但是23 就有对应的

下面是vqa2.0 中的有些句子，其token 的长度按照空格区分，但是nlp 则会细分，导致id 对应不一致，下面就是社死现场，我们发现大部分都是一个split 里面包含逗号情况！要不就是数字开头，字母后续，导致nlp 会对其细分！

#社死现场 分别为句子，token 个数， token id
Is it 4am or 4pm in the picture?
8
[1, 122, 14302, 31, 14303, 23, 11, 48]

What types of fruit are in the bowl>1?
8
[0, 712, 95, 239, 68, 23, 11, 15181]

What appliance,isn't white?
3
[0, 16217, 38]

What kind of hat is the girl wearing,who does it usually represent?
12
[0, 210, 95, 84, 1, 11, 298, 14643, 49, 122, 2392, 3231]

Are the nails,phone, flower, and lipstick all from the same color family?
12
[68, 11, 14864, 218, 226, 7093, 219, 61, 11, 110, 10, 1786]



#社死现场
What is it the street,that shouldn't be?
7
[0, 1, 122, 11, 8862, 3977, 81]

Are they looking for more guys to hang out with,or ladies?
11
[68, 183, 5, 99, 578, 2503, 141, 2366, 682, 9566, 544]

What color is the gas tank,on the far left motor bike?
11
[0, 10, 1, 11, 2169, 9927, 11, 988, 146, 4062, 883]

Does the coffee,have cream?
4
[49, 11, 12480, 246]

What are the round,red things?
5
[0, 68, 11, 10360, 454]


Three rocks are in the same shape,what shape is it?
10
[164, 158, 68, 23, 11, 110, 4212, 276, 1, 122]

Lid open,or closes?
3
[505, 8822, 8021]

What else is in the sky,besides kites?
7
[0, 2049, 1, 23, 11, 9449, 254]

Why is the person,s body in that position?
8
[196, 1, 11, 21, 83, 23, 207, 7]

What color,besides white,are the other planes?
6
[0, 11230, 11231, 11, 811, 281]

Why is the person,s body in that position?
8
[196, 1, 11, 21, 83, 23, 207, 7]

What is next to the mouse,and headphone?
7
[0, 1, 185, 141, 11, 13887, 13888]


What type of jacket is the man,on the bike wearing?
10
[0, 205, 95, 90, 1, 11, 12827, 11, 883, 58]

Which primary color is the man *not* wearing?
8
[97, 499, 10, 1, 11, 8, 4672, 58]



Is she making food...with an iron?
6
[1, 44, 836, 6749, 250, 4914]


What is it the street,that shouldn't be?
7
[0, 1, 122, 11, 8862, 3977, 81]

cc = nlp.dependency_parse(sentence)
ll = len(cc)
cc_matrix= np.zeros((ll,ll))
for i in range(len(cc)):
    if i>0:
        yz = cc[i]
        yz1=yz[1]-1
        yz2=yz[2]-1
        cc_matrix[yz1][yz2]=1
print(cc_matrix)

最终的结果示意：

句法依赖后结果的含义：

参考链接：https://zhuanlan.zhihu.com/p/52923442

bbrev: abbreviation modifier，缩写 
acomp: adjectival complement，形容词的补充； 
advcl : adverbial clause modifier，状语从句修饰词 
advmod: adverbial modifier状语 
agent: agent，代理，一般有by的时候会出现这个 
amod: adjectival modifier形容词 
appos: appositional modifier,同位词 
attr: attributive，属性 
aux: auxiliary，非主要动词和助词，如BE,HAVE SHOULD/COULD等到 
auxpass: passive auxiliary 被动词 
cc: coordination，并列关系，一般取第一个词 
ccomp: clausal complement从句补充 
complm: complementizer，引导从句的词好重聚中的主要动词 
conj : conjunct，连接两个并列的词。 
cop: copula。系动词（如be,seem,appear等），（命题主词与谓词间的）连系 
csubj : clausal subject，从主关系 
csubjpass: clausal passive subject 主从被动关系 
dep: dependent依赖关系 
det: determiner决定词，如冠词等 
dobj : direct object直接宾语 
expl: expletive，主要是抓取there 
infmod: infinitival modifier，动词不定式 
iobj : indirect object，非直接宾语，也就是所以的间接宾语； 
mark: marker，主要出现在有“that” or “whether”“because”, “when”, 
mwe: multi-word expression，多个词的表示 
neg: negation modifier否定词 
nn: noun compound modifier名词组合形式 
npadvmod: noun phrase as adverbial modifier名词作状语 
nsubj : nominal subject，名词主语 
nsubjpass: passive nominal subject，被动的名词主语 
num: numeric modifier，数值修饰 
number: element of compound number，组合数字 
parataxis: parataxis: parataxis，并列关系 
partmod: participial modifier动词形式的修饰 
pcomp: prepositional complement，介词补充 
pobj : object of a preposition，介词的宾语 
poss: possession modifier，所有形式，所有格，所属 
possessive: possessive modifier，这个表示所有者和那个’S的关系 
preconj : preconjunct，常常是出现在 “either”, “both”, “neither”的情况下 
predet: predeterminer，前缀决定，常常是表示所有 
prep: prepositional modifier 
prepc: prepositional clausal modifier 
prt: phrasal verb particle，动词短语 
punct: punctuation，这个很少见，但是保留下来了，结果当中不会出现这个 
purpcl : purpose clause modifier，目的从句 
quantmod: quantifier phrase modifier，数量短语 
rcmod: relative clause modifier相关关系 
ref : referent，指示物，指代 
rel : relative 
root: root，最重要的词，从它开始，根节点 
tmod: temporal modifier 
xcomp: open clausal complement 
xsubj : controlling subject 掌控者

咪咕班克斯

关注

4
点赞
踩
12

收藏

觉得还不错? 一键收藏
4
评论
stanford coreNLP简单使用

1.安装参考链接：https://blog.csdn.net/l919898756/article/details/81670228?spm=1001.2014.3001.55062.树结构可视化：安装nltk第三方包
复制链接

扫一扫

专栏目录