stanford-segmenter中文分词基本使用

首先进入http://nlp.stanford.edu/software/segmenter.shtml#Download 下载stanford-segmenter-2016-10-31.zip,解压之后,进入目录中,查看README-Chinese.txt文件。

命令行操作

进入到解压后的文件目录中,输入下面代码

./segment.sh pku test.simp.utf8 UTF-8 0

可以看到如下结果
Stanford segmenter结果

eclipse中操作

新建一个java工程,将下载的压缩包解压后的arabic,data文件夹,文件test.simp.utf8,复制到java的工程下,并且java工程引用这三个jar文件:stanford-segmenter-3.7.0.jar、stanford-segmenter-3.7.0-javadoc.jar、stanford-segmenter-3.7.0-sources.jar。注意,eclipse中的java版本要求是1.8版本之上的。将解压之后的SegDemo.java文件,复制到工程之中。操作完成之后,eclipse是类似下面这张图片左边的样子。

eclipse中进行Stanford segmenter

j为SegDemo.java的运行添加参数“test.simp.utf8”,在工程目录上,右键Run As -> Run Configurations,在program arguments下面写上“test.simp.utf8”即可。
这里写图片描述
运行结果如第一张图片所示。

假设因为工程运行所需内存较大,而系统为java提供的内存不满足其要求,那么做以下操作即可
在当前页面Run As -> Run Configurations,在VM arguments中写入参数,由于Stanford-Sementer占用的内存比较大,所以需要设置VM arguments,不然就会超内存。在VM arguments:下面填写“-Xms512m -Xmx1024m”。[参考网页1]
假设出现编码错误,那么请设置工程的编码为utf-8。建议:最好对eclipse的默认编码设置为utf-8格式。

备注

  1. 在文章中使用时候,注意添加以下参考Pi-Chuan Chang, Michel Galley and Chris Manning. 2008. Optimizing Chinese Word Segmentation for Machine Translation Performance. In WMT.
  2. http://nlp.stanford.edu/pubs/sighan2005.pdf 这是1中的文章
  3. 此工具,还实现了对阿拉伯语的分词。并且为其他编程语言提供了Stanford Word Segmenter的package,如python,F#/C#/.NET,都可以在http://nlp.stanford.edu/software/segmenter.shtml 这个页面上查看相关说明以及下载使用。

参考网页
1. http://www.cfanz.cn/index.php?c=article&a=read&id=272910
2. http://blog.csdn.net/shijiebei2009/article/details/42525091
这个网页对NER与分词都作了介绍,而且作为一个整体一起使用。

Tokenization of raw text is a standard pre-processing step for many NLP tasks. For English, tokenization usually involves punctuation splitting and separation of some affixes like possessives. Other languages require more extensive token pre-processing, which is usually called segmentation. The Stanford Word Segmenter currently supports Arabic and Chinese. The provided segmentation schemes have been found to work well for a variety of applications. The system requires Java 1.6+ to be installed. We recommend at least 1G of memory for documents that contain long sentences. For files with shorter sentences (e.g., 20 tokens), decrease the memory requirement by changing the option java -mx1g in the run scripts. Arabic Arabic is a root-and-template language with abundant bound morphemes. These morphemes include possessives, pronouns, and discourse connectives. Segmenting bound morphemes reduces lexical sparsity and simplifies syntactic analysis. The Arabic segmenter model processes raw text according to the Penn Arabic Treebank 3 (ATB) standard. It is a stand-alone implementation of the segmenter described in: Spence Green and John DeNero. 2012. A Class-Based Agreement Model for Generating Accurately Inflected Translations. In ACL. Chinese Chinese is standardly written without spaces between words (as are some other languages). This software will split Chinese text into a sequence of words, defined according to some word segmentation standard. It is a Java implementation of the CRF-based Chinese Word Segmenter described in: Huihsin Tseng, Pichuan Chang, Galen Andrew, Daniel Jurafsky and Christopher Manning. 2005. A Conditional Random Field Word Segmenter. In Fourth SIGHAN Workshop on Chinese Language Processing. Two models with two different segmentation standards are included: Chinese Penn Treebank standard and Peking University standard. On May 21, 2008, we released a version that makes use of lexicon features. With external lexicon features, the segmenter segmen
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值