Stanford parser:入门使用

链接:https://www.cnblogs.com/stGeekpower/p/3457746.html

一、stanford parser是什么?
stanford parser是stanford nlp小组提供的一系列工具之一,能够用来完成语法分析任务。支持英文、中文、德文、法文、阿拉伯文等多种语言。

可以从这里(http://nlp.stanford.edu/software/lex-parser.shtml#Download)下载编译好的jar包、源码、javadoc等等。

http://nlp.stanford.edu/software/parser-faq.shtml是FAQ,看一下FAQ基本就能明白很多东西。当然,你得懂英文是吧?哈哈。

二、stanford parser怎么用(针对中文)?
这里只说如何在java工程中调用相关功能。从上面的地址下载到压缩包后,解压缩,将下面两个jar包加入到java build path里即可:

stanford-parser.jar
stanford-parser-xxx-models.jar
在stanford-parser.jar!\edu.stanford.nlp.parser.lexparser.demo包下面有两个最简单的例子,是可以直接运行的。

不过例子给出的是英文的使用方法,我们处理的中文还是有些不一样的。不过,在一中提到的FAQ页面上,是有简单的如何处理中文的使用方法的(第24问),不过处理方式是直接用java命令来做。

FAQ中给出的处理命令如下:

$ java -server -mx500m edu.stanford.nlp.parser.lexparser.LexicalizedParser -encoding utf-8 /u/nlp/data/lexparser/chineseFactored.ser.gz chinese-onesent-utf8.txt
很明显,所有的东西都是从edu.stanford.nlp.parser.lexparser.LexicalizedParser这个类开始的。所以我们只要把这个类的main函数搞清楚,如何处理中文我们大概也就知道了。

在这之前,先看一下训练好的中文grammars:

The parser is supplied with 5 Chinese grammars (and, with access to suitable training data, you could train other versions). You can find them inside the supplied stanford-parser-YYYY-MM-DD-models.jar file (in the GUI, select this file and then navigate inside it; at the command line, use jar -tf to see its contents). All of these grammars are trained on data from the Penn Chinese Treebank, and you should consult their site for details of the syntactic representation of Chinese which they use. They are:

image

The PCFG parsers are smaller and faster. But the Factored parser is significantly better for Chinese, and we would generally recommend its use. The xinhua grammars are trained solely on Xinhua newspaper text from mainland China. We would recommend their use for parsing material from mainland China. The chinese grammars also include some training material from Hong Kong SAR and Taiwan. We’d recommend their use if parsing material from these areas or a mixture of text types. Note, though that all the training material uses simplified characters; traditional characters were converted to simplified characters (usually correctly). Four of the parsers assume input that has already been word segmented, while the fifth does word segmentation internal to the parser. This is discussed further below. The parser also comes with 3 Chinese example sentences, in files whose names all begin with chinese.

三、LexicalizedParser类main函数分析
先从javadoc了解下基本用法,main函数支持多个选项。可以用来从treebank data完成建立和序列化一个解析器,可以解析文件或者url页面内容中的句子。

主要就是训练生成解析器,用解析器解析句子两大功能。以下摘自main函数的javadoc:

Sample Usages:

Train a parser (saved to serializedGrammarFilename) from a directory of trees (trainFilesPath, with an optional fileRange, e.g., 0-1000):
java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] -train trainFilesPath [fileRange] -saveToSerializedFile serializedGrammarFilename
Train a parser (not saved) from a directory of trees, and test it (reporting scores) on a directory of trees :
java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] -train trainFilesPath [fileRange] -testTreebank testFilePath [fileRange]
Parse one or more files, given a serialized grammar and a list of files :
java -mx512m edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] serializedGrammarPath filename [filename] …
Test and report scores for a serialized grammar on trees in an output directory :
java -mx512m edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] -loadFromSerializedFile serializedGrammarPath -testTreebank testFilePath [fileRange]
如果 serializedGrammarPath 以.gz结尾, 那么grammar是以gzip格式来读写的。

如果serializedGrammarPath是一个URL, 以http://开始,则会从URL读取解析器。

fileRange参数 specifies a numeric value that must be included within a filename for it to be used in training or testing (this works well with most current treebanks). It can be specified like a range of pages to be printed, for instance as 200-2199 or 1-300,500-725,9000 or just as 1 (if all your trees are in a single file, just give a dummy argument such as 0 or 1).

解析器可以将语法分写成ca序列化的Java object文件,或者输出到文本文件,或者同时输出两种方式,用如下命令来使用:

java edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] -train trainFilesPath [fileRange] [-saveToSerializedFile grammarPath] [-saveToTextFile grammarPath]
如果没有提供要解析的文件,则一个默认的句子会被解析。

Parameters:在-v 同样的位置,可以有很多其他的选项,比较常用的如下(水平有限,有的英文就懒的翻译了):

-tLPP class 当使用除英文外的语言或者English Penn Treebank之外的Treebank时候需要指定TreebankLangParserParams,该选项必须出现在其他的与语言相关的选项之前。(即使是导入一个序列化grammar时候,也建议制定该选项;it is necessary if the language pack specifies a needed character encoding or you wish to specify language-specific options on the command line.)

-encoding charset 指定输入输出文件的编码类型。 当这个选项出现在-tLPP选项后时,会覆盖TreebankLangParserParams中设置的值。

-tokenized 输入是否已经完成分词(以空白符分割各词)。此选项存在则忽略分词处理,只使用whitespace进行分词。除非用-escaper指定特殊的escape,否则需要确保分词结果中的特殊符号符合所用的Treebank。(例如,如果用Penn English Treebank, 必须将”(” 转为 “-LRB-“, “3/4” 转为 “3\/4”, 等等.)

-escaper class 指定一个Function

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
About A natural language parser is a program that works out the grammatical structure of sentences, for instance, which groups of words go together (as "phrases") and which words are the subject or object of a verb. Probabilistic parsers use knowledge of language gained from hand-parsed sentences to try to produce the most likely analysis of new sentences. These statistical parsers still make some mistakes, but commonly work rather well. Their development was one of the biggest breakthroughs in natural language processing in the 1990s. You can try out our parser online. This package is a Java implementation of probabilistic natural language parsers, both highly optimized PCFG and lexicalized dependency parsers, and a lexicalized PCFG parser. The original version of this parser was mainly written by Dan Klein, with support code and linguistic grammar development by Christopher Manning. Extensive additional work (internationalization and language-specific modeling, flexible input/output, grammar compaction, lattice parsing, k-best parsing, typed dependencies output, user support, etc.) has been done by Roger Levy, Christopher Manning, Teg Grenager, Galen Andrew, Marie-Catherine de Marneffe, Bill MacCartney, Anna Rafferty, Spence Green, Huihsin Tseng, Pi-Chuan Chang, Wolfgang Maier, and Jenny Finkel. The lexicalized probabilistic parser implements a factored product model, with separate PCFG phrase structure and lexical dependency experts, whose preferences are combined by efficient exact inference, using an A* algorithm. Or the software can be used simply as an accurate unlexicalized stochastic context-free grammar parser. Either of these yields a good performance statistical parsing system. A GUI is provided for viewing the phrase structure tree output of the parser. As well as providing an English parser, the parser can be and has been adapted to work with other languages. A Chinese parser based on the Chinese Treebank, a German parser based on the Negra corpus and Arabic parsers based on the Penn Arabic Treebank are also included. The parser has also been used for other languages, such as Italian, Bulgarian, and Portuguese. The parser provides Stanford Dependencies output as well as phrase structure trees. Typed dependencies are otherwise known grammatical relations. This style of output is available only for English and Chinese. For more details, please refer to the Stanford Dependencies webpage. The current version of the parser requires Java 6 (JDK1.6) or later. (You can also download an old version of the parser, version 1.4, which runs under JDK 1.4, or version 2.0 which runs under JDK 1.5, but those distributions are no longer supported.) The parser also requires a reasonable amount of memory (at least 100MB to run as a PCFG parser on sentences up to 40 words in length; typically around 500MB of memory to be able to parse similarly long typical-of-newswire sentences using the factored model). The parser is available for download, licensed under the GNU General Public License (v2 or later). Source is included. The package includes components for command-line invocation, a Java parsing GUI, and a Java API. The parser code is dual licensed (in a similar manner to MySQL, etc.). Open source licensing is under the full GPL, which allows many free uses. For distributors of proprietary software, commercial licensing with a ready-to-sign agreement is available. If you don't need a commercial license, but would like to support maintenance of these tools, we welcome gift funding. The download is a 54 MB zipped file (mainly consisting of included grammar data files). If you unpack the zip file, you should have everything needed. Simple scripts are included to invoke the parser on a Unix or Windows system. For another system, you merely need to similarly configure the classpath.

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值