Stanford-segmenter的简单学习

   这段时间开始学习中文分词的原理,其目的也在于从最基础的地方开始自然语言处理的学习。虽然中文分词经过10多年的研究,已经很难在上面开花结果了。但我个人觉得这是最能锻炼自然语言基础的地方。从HMM模型、MaxEnt模型到CRFs模型,中文分词的研究,浓缩了自然语言处理的发展史。

   CRFs分词的原理不难懂,就是把分词当作另一种形式的命名实体识别,利用特征建立概率图模型后,用Veterbi算法求最短路径。虽然原理讲起来很简单,但是实现的细节我一直都是模糊的。了解到Stanford-segmenter分词的原理是CRFs,就把Stanford-Segmenter下载下来研究一下。

   使用的过程很简单:

   第一步:建立Project,取名随意,我取的名字是StanfordSegmenter。然后把下载到的stanford-segmenter-2013-06-20解压,把解压包中的arabic,data文件夹和seg.jar及test.simp.utf8复制到项目的根目录下,把SegDemo.java复制到src源代码目录下。如下图:222735999.png

第二步:运行SegDemo,Run As-> Run Configurations,运行需要传入参数,test.simp.utf8

222943479.png

由于Stanford-Sementer占用的内存比较大,所以需要设置VM arguments,不然就会超内存。

好了,接下来就是见证奇迹的时刻了:

testFile=test.simp.utf8
serDictionary=data/dict-chris6.ser.gz
sighanCorporaDict=data
inputEncoding=UTF-8
sighanPostProcessing=true
Loading classifier from D:\workspace_vancl\StanfordSegmenter\data\ctb.gz ... Loading Chinese dictionaries from 1 files:
  data/dict-chris6.ser.gz
loading dictionaries from data/dict-chris6.ser.gz...Done. Unique words in ChineseDictionary is: 423200
done [26.8 sec].
INFO: TagAffixDetector: useChPos=false | useCTBChar2=true | usePKChar2=false
INFO: TagAffixDetector: building TagAffixDetector from data/dict/character_list and data/dict/in.ctb
Loading character dictionary file from data/dict/character_list
Loading affix dictionary from data/dict/in.ctb
面对 新 世纪 , 世界 各 国 人民 的 共同 愿望 是 : 继续 发展 人类 以往 创造 的 一切 文明 成果 , 克服 20 世纪 困扰 着 人类 的 战争 和 贫困 问题 , 推进 和平 与 发展 的 崇高 事业 , 创造 一 个 美好 的 世界 。
CRFClassifier tagged 80 words in 1 documents at 134.45 words per second.

 看到这个结果,其实也好猜了,需要分词的源语料就是传入的参数文件test.simp.utf8。

 看到了结果,就可以关联到源代码,查看分词建模的细节了。就像骑自行车一样,先骑一骑,有一个直观的印象,有兴趣了,接下来的事情就好办了!

   其实CRFs在《数学之美》中做的事情是句法分析,这也是自然语言处理的基础,但是鼎鼎有名的Stanford-Parser用的却不是CRFs,而是概率上下文无关文法(PCFG)。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Tokenization of raw text is a standard pre-processing step for many NLP tasks. For English, tokenization usually involves punctuation splitting and separation of some affixes like possessives. Other languages require more extensive token pre-processing, which is usually called segmentation. The Stanford Word Segmenter currently supports Arabic and Chinese. The provided segmentation schemes have been found to work well for a variety of applications. The system requires Java 1.6+ to be installed. We recommend at least 1G of memory for documents that contain long sentences. For files with shorter sentences (e.g., 20 tokens), decrease the memory requirement by changing the option java -mx1g in the run scripts. Arabic Arabic is a root-and-template language with abundant bound morphemes. These morphemes include possessives, pronouns, and discourse connectives. Segmenting bound morphemes reduces lexical sparsity and simplifies syntactic analysis. The Arabic segmenter model processes raw text according to the Penn Arabic Treebank 3 (ATB) standard. It is a stand-alone implementation of the segmenter described in: Spence Green and John DeNero. 2012. A Class-Based Agreement Model for Generating Accurately Inflected Translations. In ACL. Chinese Chinese is standardly written without spaces between words (as are some other languages). This software will split Chinese text into a sequence of words, defined according to some word segmentation standard. It is a Java implementation of the CRF-based Chinese Word Segmenter described in: Huihsin Tseng, Pichuan Chang, Galen Andrew, Daniel Jurafsky and Christopher Manning. 2005. A Conditional Random Field Word Segmenter. In Fourth SIGHAN Workshop on Chinese Language Processing. Two models with two different segmentation standards are included: Chinese Penn Treebank standard and Peking University standard. On May 21, 2008, we released a version that makes use of lexicon features. With external lexicon features, the segmenter segmen

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值