stanford parser 使用说明

preface: 最近忙着的项目想试着用斯坦福的parser,来解析句子生成句法分析树,然后分析子树,与treekernal结合起来,训练。stanford parser神器下载下来了,可使用却是蛋疼。一大堆说明,却没个方便快捷关于总的介绍。

一、必先利其器

stanford parser主页:http://nlp.stanford.edu/software/lex-parser.shtml

stanford parser下载:http://nlp.stanford.edu/software/lex-parser.shtml#Download

另外扩展工具:java、python等等随各自项目需要再说。

二、使用(stanford parser)

下载解压后,根据README.txt文件来,卤主是在ubuntu15.04系统下,java7,不够,根据上一篇博客四行代码安装java8:

$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java8-installer
$ java -version

准备好了java8后,才可继续在ubuntu下编译使用stanford parser. 根据说明,运行lexparser.sh文件,加入文件名参数,运行即可。testsent.txt包含5句英文。

On a Unix system you should be able to parse the English test file with the
following command:	

    ./lexparser.sh data/testsent.txt

This uses the PCFG parser, which is quick to load and run, and quite accurate.

[Notes: it takes a few seconds to load the parser data before parsing
begins; continued parsing is quicker. To use the lexicalized parser, replace
englishPCFG.ser.gz with englishFactored.ser.gz in the lexparser.sh script
and use the flag -mx600m to give more memory to java.]
在包含lexparser.sh文件夹里终端运行 ./lexparser.sh data/tentsent.txt 得到结果如下(部分):
Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ...  done [0.5 sec].
Parsing file: data/testsent.txt
Parsing [sent. 1 len. 21]: Scores of properties are under extreme fire threat as a huge blaze continues to advance through Sydney 's north-western suburbs .
(ROOT
  (S
    (NP
      (NP (NNS Scores))
      (PP (IN of)
        (NP (NNS properties))))
    (VP (VBP are)
      (PP (IN under)
        (NP (JJ extreme) (NN fire) (NN threat)))
      (SBAR (IN as)
        (S
          (NP (DT a) (JJ huge) (NN blaze))
          (VP (VBZ continues)
            (S
              (VP (TO to)
                (VP (VB advance)
                  (PP (IN through)
                    (NP
                      (NP (NNP Sydney) (POS 's))
                      (JJ north-western) (NNS suburbs))))))))))
    (. .)))

nsubj(threat-8, Scores-1)
case(properties-3, of-2)
nmod:of(Scores-1, properties-3)
cop(threat-8, are-4)
case(threat-8, under-5)
amod(threat-8, extreme-6)
compound(threat-8, fire-7)
root(ROOT-0, threat-8)
mark(continues-13, as-9)
det(blaze-12, a-10)
amod(blaze-12, huge-11)
nsubj(continues-13, blaze-12)
nsubj(advance-15, blaze-12)
advcl(threat-8, continues-13)
mark(advance-15, to-14)
xcomp(continues-13, advance-15)
case(suburbs-20, through-16)
nmod:poss(suburbs-20, Sydney-17)
case(Sydney-17, 's-18)
amod(suburbs-20, north-western-19)
nmod:through(advance-15, suburbs-20)
可以看出,stanford parser将英文很好的解析,而且有两种解析方式。换其他英文数据,也能很好的解析。骚年,你以为到这里就结束了么,too young too simple.

卤主弄的是中文啊。同样的方式,改了下lexparser.sh文件里面的“edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz”变成:“edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz”,数据改了中文的。以为也能解析,可是特么慢啊,慢啊,慢啊。而且无论怎么弄,它都解析为一个句子,是因为没分词,没分词,也可能是参数没有调好。找了其他博客也没找到合适的。

待续。。。

三、使用2(nltk+stanford-parser.jar)

同仁看到我忙着stanford parser,说到NLTK里面就有这个,瞬间就演示了下怎么在nltk里面用,我了个XX啊,神器在身边可是不会用啊,不知道nltk神器有这功能。不过只有列表形式的结果:

In [8]: from nltk.parse import stanford

In [9]: stanford.StanfordParser?
Type:            type
String form:     <class 'nltk.parse.stanford.StanfordParser'>
File:            /home/shifeng/anaconda/lib/python2.7/site-packages/nltk/parse/stanford.py
Init definition: stanford.StanfordParser(self, path_to_jar=None, path_to_models_jar=None, model_path=u'edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz', encoding=u'UTF-8', verbose=False, java_options=u'-mx1000m')
Docstring:
Interface to the Stanford Parser

>>> parser=StanfordParser(
...     model_path="edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz"
... )
>>> parser.raw_parse_sents((
...     "the quick brown fox jumps over the lazy dog",
...     "the quick grey wolf jumps over the lazy fox"
... ))
[Tree('ROOT', [Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['quick']), Tree('JJ', ['brown']),
Tree('NN', ['fox'])]), Tree('NP', [Tree('NP', [Tree('NNS', ['jumps'])]), Tree('PP', [Tree('IN', ['over']),
Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['lazy']), Tree('NN', ['dog'])])])])])]), Tree('ROOT', [Tree('NP',
[Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['quick']), Tree('JJ', ['grey']), Tree('NN', ['wolf'])]), Tree('NP',
[Tree('NP', [Tree('NNS', ['jumps'])]), Tree('PP', [Tree('IN', ['over']), Tree('NP', [Tree('DT', ['the']),
Tree('JJ', ['lazy']), Tree('NN', ['fox'])])])])])])]
我特么也这么干,不行啊不行啊。同仁说是没有下载jar包,打算通过nltk.download下载,结果没下好,在身边看得一愣一愣的我说已经在网上下好了。通过网上的博客介绍,nltk结合stanford-parser.jar解析句子:

In [12]: import os

In [13]: os.environ["STANFORD_PARSER"] = "stanford-parser.jar"
In [14]: os.environ["STANFORD_MODELS"] = "stanford-parser-3.5.2-models.jar"
In [15]: parser = stanford.StanfordParser(model_path=u'edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz')

In [16]: sentences = parser.raw_parse_sents(("the quick brown fox jumps over the lazy dog","the quick grey wolf jumps over the lazy fox"))

In [17]: sentences
Out[17]: 
[Tree('ROOT', [Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['quick']), Tree('JJ', ['brown']), Tree('NN', ['fox'])]), Tree('NP', [Tree('NP', [Tree('NNS', ['jumps'])]), Tree('PP', [Tree('IN', ['over']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['lazy']), Tree('NN', ['dog'])])])])])]),
 Tree('ROOT', [Tree('NP', [Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['quick']), Tree('JJ', ['grey']), Tree('NN', ['wolf'])]), Tree('NP', [Tree('NP', [Tree('NNS', ['jumps'])]), Tree('PP', [Tree('IN', ['over']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['lazy']), Tree('NN', ['fox'])])])])])])]

In [18]: sentences = parser.raw_parse_sents(("Hello, My name is Melroy.", "What is your name?"))

In [19]: sentences
Out[19]: 
[Tree('ROOT', [Tree('S', [Tree('INTJ', [Tree('UH', ['Hello'])]), Tree(',', [',']), Tree('NP', [Tree('PRP$', ['My']), Tree('NN', ['name'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('ADJP', [Tree('JJ', ['Melroy'])])]), Tree('.', ['.'])])]),
 Tree('ROOT', [Tree('SBARQ', [Tree('WHNP', [Tree('WP', ['What'])]), Tree('SQ', [Tree('VBZ', ['is']), Tree('NP', [Tree('PRP$', ['your']), Tree('NN', ['name'])])]), Tree('.', ['?'])])])]

四、使用3(eclipse+java)

本不太想用java,不太想在ubuntu中用eclipse,但看到师兄用eclipse把句法分析了,便想着试试。可行,只是只有树结构,可能初始化的对象是树,另外数组型式也应该是互通。

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.io.UnsupportedEncodingException;
import java.util.ArrayList;
import java.util.List;

import edu.stanford.nlp.ling.Word;
import edu.stanford.nlp.parser.lexparser.LexicalizedParser;
import edu.stanford.nlp.trees.Tree;

public class Parser {
	
	public static void main(String[] args) throws IOException {
//		String grammar = "edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz";
		String grammar = "edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz";
	    String[] options = {};
	    LexicalizedParser lp = LexicalizedParser.loadModel(grammar, options);
	    String line = "我 的 名字 叫 小明 ?";
	    Tree parse = lp.parse(line);
	    parse.pennPrint();
        String[] arg2 = {"-encoding", "utf-8",
                "-outputFormat", "penn,typedDependenciesCollapsed",
                "edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz",
                "/home/shifeng/shifengworld/study/tool/stanford_parser/stanford-parser-full-2015-04-20/data/chinese-onesent-utf8.txt"};
        LexicalizedParser.main(arg2);
	}
}
运行结果:

Picked up JAVA_TOOL_OPTIONS: -javaagent:/usr/share/java/jayatanaag.jar 
Loading parser from serialized file edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz ...  done [0.8 sec].
(ROOT
Loading parser from serialized file edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz ...   (IP
    (NP
      (DNP
        (NP (PN 我))
        (DEG 的))
      (NP (NN 名字)))
    (VP (VV 叫)
      (NP (NN 小明)))
    (PU ?)))
 done [4.1 sec].
Parsing file: /home/shifeng/shifengworld/study/tool/stanford_parser/stanford-parser-full-2015-04-20/data/chinese-onesent-utf8.txt
Parsing [sent. 1 len. 8]: 俄国 希望 伊朗 没有 制造 核武器 计划 。
(ROOT
  (IP
    (NP (NR 俄国))
    (VP (VV 希望)
      (IP
        (NP (NR 伊朗))
        (VP
          (ADVP (AD 没有))
          (VP (VV 制造)
            (NP (NN 核武器) (NN 计划))))))
    (PU 。)))

nsubj(希望-2, 俄国-1)
root(ROOT-0, 希望-2)
nsubj(制造-5, 伊朗-3)
neg(制造-5, 没有-4)
ccomp(希望-2, 制造-5)
nn(计划-7, 核武器-6)
dobj(制造-5, 计划-7)

Parsed file: /home/shifeng/shifengworld/study/tool/stanford_parser/stanford-parser-full-2015-04-20/data/chinese-onesent-utf8.txt [1 sentences].
Parsed 8 words in 1 sentences (30.42 wds/sec; 3.80 sents/sec).


java始终不是卤主擅长的,还是继续寻找其他的路。。。


五、经验之谈

多查资料。英文的也强看下去。

参考:

1. stackoverflowhttp://stackoverflow.com/questions/13883277/stanford-parser-and-nltk

2. nltk官网:http://www.nltk.org/api/nltk.parse.html

3. nltk官网:http://www.nltk.org/_modules/nltk/parse/stanford.html

4. stanford parser官网:http://nlp.stanford.edu/software/parser-faq.shtml

5. stanford parser下载:http://nlp.stanford.edu/software/lex-parser.shtml#Download

6. 博友博客:http://blog.sina.com.cn/s/blog_8e037f440101eg93.html

7. 博友博客:http://www.cnblogs.com/stGeekpower/p/3457746.html

8. 百度文库:http://wenku.baidu.com/link?url=KZDYgJDnme7yIDOCoPpNClV1Z95yiyf5n2YiT4BD-6eNTVcPM8sPTYmx5qxajsX6snGTgpaHUcsB0oI2W2jQOAC2nwdzUkdVkmwnEHQp0jG


  • 2
    点赞
  • 19
    收藏
    觉得还不错? 一键收藏
  • 6
    评论
About A natural language parser is a program that works out the grammatical structure of sentences, for instance, which groups of words go together (as "phrases") and which words are the subject or object of a verb. Probabilistic parsers use knowledge of language gained from hand-parsed sentences to try to produce the most likely analysis of new sentences. These statistical parsers still make some mistakes, but commonly work rather well. Their development was one of the biggest breakthroughs in natural language processing in the 1990s. You can try out our parser online. This package is a Java implementation of probabilistic natural language parsers, both highly optimized PCFG and lexicalized dependency parsers, and a lexicalized PCFG parser. The original version of this parser was mainly written by Dan Klein, with support code and linguistic grammar development by Christopher Manning. Extensive additional work (internationalization and language-specific modeling, flexible input/output, grammar compaction, lattice parsing, k-best parsing, typed dependencies output, user support, etc.) has been done by Roger Levy, Christopher Manning, Teg Grenager, Galen Andrew, Marie-Catherine de Marneffe, Bill MacCartney, Anna Rafferty, Spence Green, Huihsin Tseng, Pi-Chuan Chang, Wolfgang Maier, and Jenny Finkel. The lexicalized probabilistic parser implements a factored product model, with separate PCFG phrase structure and lexical dependency experts, whose preferences are combined by efficient exact inference, using an A* algorithm. Or the software can be used simply as an accurate unlexicalized stochastic context-free grammar parser. Either of these yields a good performance statistical parsing system. A GUI is provided for viewing the phrase structure tree output of the parser. As well as providing an English parser, the parser can be and has been adapted to work with other languages. A Chinese parser based on the Chinese Treebank, a German parser based on the Negra corpus and Arabic parsers based on the Penn Arabic Treebank are also included. The parser has also been used for other languages, such as Italian, Bulgarian, and Portuguese. The parser provides Stanford Dependencies output as well as phrase structure trees. Typed dependencies are otherwise known grammatical relations. This style of output is available only for English and Chinese. For more details, please refer to the Stanford Dependencies webpage. The current version of the parser requires Java 6 (JDK1.6) or later. (You can also download an old version of the parser, version 1.4, which runs under JDK 1.4, or version 2.0 which runs under JDK 1.5, but those distributions are no longer supported.) The parser also requires a reasonable amount of memory (at least 100MB to run as a PCFG parser on sentences up to 40 words in length; typically around 500MB of memory to be able to parse similarly long typical-of-newswire sentences using the factored model). The parser is available for download, licensed under the GNU General Public License (v2 or later). Source is included. The package includes components for command-line invocation, a Java parsing GUI, and a Java API. The parser code is dual licensed (in a similar manner to MySQL, etc.). Open source licensing is under the full GPL, which allows many free uses. For distributors of proprietary software, commercial licensing with a ready-to-sign agreement is available. If you don't need a commercial license, but would like to support maintenance of these tools, we welcome gift funding. The download is a 54 MB zipped file (mainly consisting of included grammar data files). If you unpack the zip file, you should have everything needed. Simple scripts are included to invoke the parser on a Unix or Windows system. For another system, you merely need to similarly configure the classpath.

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值