【已解决】win10环境下基于nltk搭建stanford parser环境，进行中文依存句法分析

最新推荐文章于 2021-07-30 11:36:27 发布

Victor-H

最新推荐文章于 2021-07-30 11:36:27 发布

阅读量1.3k

点赞数

分类专栏： NLP 文章标签： NLP StanfordNLP

NLP 专栏收录该内容

11 篇文章 1 订阅

订阅专栏

原文：http://blog.csdn.net/lr231654111/article/details/74129620

基本环境搭建

主要参考下面的文章进行搭建：

NLTK中使用Stanford parser：http://blog.csdn.net/sherrylml/article/details/45197187

但遇到了下面的报错：

OSError: Java command failed :

参考下面的两篇问答解决了这个问题：

https://stackoverflow.com/questions/35624245/java-command-failed-when-running-nltk-stanfordparser

https://github.com/nltk/nltk/issues/1239

所用的方法是第二个问答里提到的“hack the stanford parser classpath”，主要的代码是：

[python] view plain copy

print ?

from nltk.internals import find_jars_within_path
parser._classpath = tuple(find_jars_within_path(stanford_dir))

from nltk.internals import find_jars_within_path
parser._classpath = tuple(find_jars_within_path(stanford_dir))

中文分析

我的目的是比较stanford parser和LTP的依存分析结果的一致性如何。

用stanford处理中文要注意就两点：一是使用中文模型，二是注意编码（unicode输入）。

stanford 提供的中文模型在下面的参数中设置：

[python] view plain copy

print ?

parser = stanford.StanfordDependencyParser('D:/jars/stanford-parser.jar',
'D:/jars/stanford-parser-3.8.0-models.jar',
'D:/LearnNLTK/stanfordNLTK/stanford-parser-3.8.0-models/edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz') # 指定使用的模型

parser = stanford.StanfordDependencyParser('D:/jars/stanford-parser.jar',
    'D:/jars/stanford-parser-3.8.0-models.jar',
    'D:/LearnNLTK/stanfordNLTK/stanford-parser-3.8.0-models/edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz') # 指定使用的模型

主要模型有：chinesePCFG（最简洁快速），chineseFactored（据说更准确，但吃内存），xinhuaPCFG和xinhuaFactored（是根据某个国内的语料库训练的模型）<前面的模型都需要先进行分词>，xinhuaFactoredSegmenting（这个内置分词）。

然而，我最后没能做出stanford parser和LTP的依存分析结果对比，因为stanford parser对句子的完整度要求比较高，对于一些不完整的句子会报错：

the graph doesn't contain a node that depends on the root element.我想这个和我的语料中有大量无主句，不完整句子有关。