使用Stanford Parser的PDFG算法进行句法分析

本文介绍了Stanford Parser,一个由斯坦福大学开发的句法分析器,它支持多种语言,包括中文,并在nltk库中提供了Python接口。使用时需要JDK环境,并安装nltk库。通过PCFG的中文句法分析案例,以句子“我爱北京天安门”演示了如何进行分词和句法分析。
摘要由CSDN通过智能技术生成

1 Stanford Parser简介与安装

Stanford Parser顾名思义是由斯坦福大学自然语言小组开发的开源句法分析器,是基于概率统计句法分析的一个Java实现。该句法分析器目前提供了5个中文文法的实现。他的优点在于:

  • 既是一个高度优化的概率上下文无关文法和词汇化依存分析器,又是一个词汇化上下文无关文法分析器;

  • 以权威的并州树库作为分析器的训练数据,支持多语言。目前已经支持英文,中文,德文,意大利文,阿拉伯文等;

  • 提供了多样化的分析输出形式,出句法分析树外,还支持分词和词性标注、短语结构、依存关系等输出;

  • 内置了分词,词性标注,基于自定义树库的分析器训练等辅助工作。

  • 支持多平台,并封装了多种常用语言的接口,例如:java,python,php,ruby等。

本介绍是基于Stanford Parser的Python接口。由于该句法分析器底层是由java实现,因此使用时需要确保安装JDK。当前,最新的Stanford Parser版本为3.9.1,对JDK的要求是1.8以上。网上JDK的安装教程有很多,可以搜索查看,需要注意的是要配置JAVA_

About A natural language parser is a program that works out the grammatical structure of sentences, for instance, which groups of words go together (as "phrases") and which words are the subject or object of a verb. Probabilistic parsers use knowledge of language gained from hand-parsed sentences to try to produce the most likely analysis of new sentences. These statistical parsers still make some mistakes, but commonly work rather well. Their development was one of the biggest breakthroughs in natural language processing in the 1990s. You can try out our parser online. This package is a Java implementation of probabilistic natural language parsers, both highly optimized PCFG and lexicalized dependency parsers, and a lexicalized PCFG parser. The original version of this parser was mainly written by Dan Klein, with support code and linguistic grammar development by Christopher Manning. Extensive additional work (internationalization and language-specific modeling, flexible input/output, grammar compaction, lattice parsing, k-best parsing, typed dependencies output, user support, etc.) has been done by Roger Levy, Christopher Manning, Teg Grenager, Galen Andrew, Marie-Catherine de Marneffe, Bill MacCartney, Anna Rafferty, Spence Green, Huihsin Tseng, Pi-Chuan Chang, Wolfgang Maier, and Jenny Finkel. The lexicalized probabilistic parser implements a factored product model, with separate PCFG phrase structure and lexical dependency experts, whose preferences are combined by efficient exact inference, using an A* algorithm. Or the software can be used simply as an accurate unlexicalized stochastic context-free grammar parser. Either of these yields a good performance statistical parsing system. A GUI is provided for viewing the phrase structure tree output of the parser. As well as providing an English parser, the parser can be and has been adapted to work with other languages. A Chinese parser based on the Chinese Treebank, a German parser based on the Negra corpus and Arabic parsers based on the Penn Arabic Treebank are also included. The parser has also been used for other languages, such as Italian, Bulgarian, and Portuguese. The parser provides Stanford Dependencies output as well as phrase structure trees. Typed dependencies are otherwise known grammatical relations. This style of output is available only for English and Chinese. For more details, please refer to the Stanford Dependencies webpage. The current version of the parser requires Java 6 (JDK1.6) or later. (You can also download an old version of the parser, version 1.4, which runs under JDK 1.4, or version 2.0 which runs under JDK 1.5, but those distributions are no longer supported.) The parser also requires a reasonable amount of memory (at least 100MB to run as a PCFG parser on sentences up to 40 words in length; typically around 500MB of memory to be able to parse similarly long typical-of-newswire sentences using the factored model). The parser is available for download, licensed under the GNU General Public License (v2 or later). Source is included. The package includes components for command-line invocation, a Java parsing GUI, and a Java API. The parser code is dual licensed (in a similar manner to MySQL, etc.). Open source licensing is under the full GPL, which allows many free uses. For distributors of proprietary software, commercial licensing with a ready-to-sign agreement is available. If you don't need a commercial license, but would like to support maintenance of these tools, we welcome gift funding. The download is a 54 MB zipped file (mainly consisting of included grammar data files). If you unpack the zip file, you should have everything needed. Simple scripts are included to invoke the parser on a Unix or Windows system. For another system, you merely need to similarly configure the classpath.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

智绘山河

你的鼓励可能解决你下一个问题

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值