Stanford Word Segmenter使用

1,下载 Stanford Word Segmenter软件包;

Download Stanford Word Segmenter version 2014-06-16

2,在eclipse上建立一个Project  StanfordSegmenter。解压Stanford Word Segmenter软件包,将其中的data,arabic,test.sipe.utf8文件夹复制到项目下。

3,添加需要的jar包,seg.jar  ,  stanford-segmenter-3.4-javadoc.jar , stanford-segmenter-3.4-sources.jar.

  步骤:点击Project->Properties->Java Bulid Path->Libraries->Add External Jars

 

4,在项目下,建一个com.Seg包,在包下建立一个SegDemo.java,将解压出来的SegDemo的内容复制进去。

5,设置运行环境。

运行SegDemo,Run As-> Run Configurations,运行需要传入参数,test.simp.utf8.

由于Stanford-Sementer占用的内存比较大,所以需要设置VM arguments,不然就会超内存。

如果机子是64bit的可以设为,-mx2g。查看解压出来的segment.sh 文件, 可以看到JAVACMD语句的参数设置。

6,运行结果如下,可以看出分词的效果。

7,关联源码,进一步查看分词建模的细节。单步运行观察各个函数的功能。

  7.1 对loadClassifierNoExceptions(也可以其他函数)点击 ctrl+右键观察源码。结果显示Source  not Found.

  

  7.2 关联源码,Attach Source->Extenal File->然后将最开始解压包中的stanford-segmenter-3.4-sources.jar包加进去。

  7.3再次点击,就可以看得源码。

  

8,如果是中文版的eclipse 需要改成英文版的。中文版的没有Attach Source提醒。改变步骤如下:

  8.1在eclipse的安装目录里找到eclipse.ini文件,编辑打开,在文件的后面加上 -Duser.language=en这句话,elipse就变成英文版的了

      

  

 9,Stanford NLP 地址

http://nlp.stanford.edu/

转载于:https://www.cnblogs.com/qianwen/p/3854809.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Tokenization of raw text is a standard pre-processing step for many NLP tasks. For English, tokenization usually involves punctuation splitting and separation of some affixes like possessives. Other languages require more extensive token pre-processing, which is usually called segmentation. The Stanford Word Segmenter currently supports Arabic and Chinese. The provided segmentation schemes have been found to work well for a variety of applications. The system requires Java 1.6+ to be installed. We recommend at least 1G of memory for documents that contain long sentences. For files with shorter sentences (e.g., 20 tokens), decrease the memory requirement by changing the option java -mx1g in the run scripts. Arabic Arabic is a root-and-template language with abundant bound morphemes. These morphemes include possessives, pronouns, and discourse connectives. Segmenting bound morphemes reduces lexical sparsity and simplifies syntactic analysis. The Arabic segmenter model processes raw text according to the Penn Arabic Treebank 3 (ATB) standard. It is a stand-alone implementation of the segmenter described in: Spence Green and John DeNero. 2012. A Class-Based Agreement Model for Generating Accurately Inflected Translations. In ACL. Chinese Chinese is standardly written without spaces between words (as are some other languages). This software will split Chinese text into a sequence of words, defined according to some word segmentation standard. It is a Java implementation of the CRF-based Chinese Word Segmenter described in: Huihsin Tseng, Pichuan Chang, Galen Andrew, Daniel Jurafsky and Christopher Manning. 2005. A Conditional Random Field Word Segmenter. In Fourth SIGHAN Workshop on Chinese Language Processing. Two models with two different segmentation standards are included: Chinese Penn Treebank standard and Peking University standard. On May 21, 2008, we released a version that makes use of lexicon features. With external lexicon features, the segmenter segmen

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值