Stanford Segmenter是Stanford大学的一个开源分词工具,目前已支持汉语和阿拉伯语,只是比较耗费内存,但貌似比中科院的分词工具快(具体没测)。

    Stanford Segmenter是基于CRF(Conditional Random Field,条件随机场),CRF是一个机器学习算法,其原理是字构成词,利用此原理把分词当做字的词位分类问题,其具体原理也没有去细探,也没有时间去研究。先贴一个Stanford Segmenter自带的小Demo吧,看看分词效果。(此工具对人名、地名等实体名识别的较为精准,所有公司想试着搞搞,但是启动的是忒慢,不知道有没有别方法改进,本人刚接触分词,高手莫喷。)

public class SegDemo {

  //public static String getProperty(String key, String def)
  //Gets the system property indicated by the specified key.
  private static final String basedir = System.getProperty("SegDemo", "data");

  public static void main(String[] args) throws Exception {
    System.setOut(new PrintStream(System.out, true, "utf-8"));

    Properties props = new Properties();
    props.setProperty("sighanCorporaDict", basedir);
    // props.setProperty("NormalizationTable", "data/norm.simp.utf8");
    // props.setProperty("normTableEncoding", "UTF-8");
    // below is needed because CTBSegDocumentIteratorFactory accesses it
    props.setProperty("serDictionary", basedir + "/dict-chris6.ser.gz");
    if (args.length > 0) {
      props.setProperty("testFile", args[0]);
    }
    props.setProperty("inputEncoding", "UTF-8");
    props.setProperty("sighanPostProcessing", "true");

    CRFClassifier<CoreLabel> segmenter = new CRFClassifier<CoreLabel>(props);
    segmenter.loadClassifierNoExceptions(basedir + "/ctb.gz", props);
    for (String filename : args) {
      segmenter.classifyAndWriteAnswers(filename);
    }

    String sample = "我叫李涂,你叫李涂胡说吗。";
    List<String> segmented = segmenter.segmentString(sample);
    System.out.println(segmented);
  }

}

运行此Demo的话需Stanford Segmenter项目目录下的data文件夹拷入项目中,data中存放了一些文件,包括分词字典,分词标准(貌似是这么个意思),这个data文件夹中的文件是需要在程序运行时加载的,加载这些文件需要一些时间。

    这个Demo很简单,很容易运行成功,但是这里加载的文件具体都是什么呢?比如

dict-chris6.ser.gz,ctb.gz

等等,一个小Demo带来了一系列的疑问,既然有了疑问就去找答案吧,但这个过程好枯燥呀,中文的资料没有找到什么有价值的东西,还得去官方看原版的资料。下面是我查阅一些资料的理解,记录下:

    官方网站上有这么一句话

Two models with two different segmentation standards are included:
    Chinese Penn Treebank standard and Peking University standard.

其中Chinese Penn Treebank standard 就是指ctb,是宾夕法尼亚大学的一个汉语树库;Peking University standard是北京大学的一个分词标准,这就是Stanford Segmenter分词的精髓吗?(此处来个小插曲,Stanford Segmenter分词流程是利用一些数据训练出一个分词模型,然后用训练出来的模型进行分词)

    随后又在这里发现了这个问题How can I retrain the Chinese Segmenter?这里进行了详细的解答,但我还不是很清楚。下面将回答贴出,英文好的可以帮我分析下:

In general you need four things in order to retrain the Chinese Segmenter.  You will need a data set with segmented text, a dictionary with words that the segmenter should know about, and various small data files for other feature generators.

The most important thing you need is a data file with text segmented according to the standard you want to use.  For example, for the CTB model we distribution, which follows thePenn Chinese Treebanksegmentation standard, we use the Chinese Treebank 7.0 data set.

You will need to convert your data set to text in the following format:

中国 进出口 银行 与 中国 银行 加强 合作
新华社 北京 十二月 二十六日 电 ( 记者 周根良 )
...

Each individual sentence is on its own line, and spaces are used to denote word breaks.

Some data sets will come in the format of Penn trees.  There are various ways to convert this to segmented text; one way which uses our tool suite is to use the Treebanks tool:

java edu.stanford.nlp.trees.Treebanks -words ctb7.mrg

The Treebanks tool is not included in the segmenter download, but it is available in the corenlp download.

Another useful tool is a dictionary of known words.  This should include named entities such as people, places, companies, etc. which the model might segment as a single word.  This is not actually required, but it will help identity named entities which the segmenter has not seen before.  For example, our file of named entities includes names such as

吳毅成
吳浩康
吳淑珍
...

To build a dictionary usable by our model, you want to collect lists of words and then use the ChineseDictionary tool to combine them into one serialized dictionary.

java edu.stanford.nlp.wordseg.ChineseDictionary 
                        -inputDicts ,... -output dict.ser.gz

If you want to use our existing dictionary as a starting point, you can include it as one of the filenames.  Words have a maximum lexicon length (probably 6, see the ChineseDictionary source) and words longer than that will be truncated.  There is also handling of words with a "mid dot" character; this occasionally shows up in machine translation, and if a word with a mid dot shows up in the dictionary, we accept the word either with or without the dot.

You will also need a properties file which tells the classifier which features to use.  An example properties file is included in the data directory of the segmenter download.

Finally, some of the features used by the existing models require additional data files, which are included in the data/dict directory of the segmenter download.  To figure out which files correspond to which features, please search in the source code for the appropriate filename.  You can probably just reuse the existing data files.

Once you have all of these components, you can then train a new model with a command line such as

  java -mx15g edu.stanford.nlp.ie.crf.CRFClassifier 
          -prop ctb.prop -serDictionary dict-chris6.ser.gz -sighanCorporaDict data 
          -trainFile train.txt -serializeTo newmodel.ser.gz > newmodel.log 2> newmodel.err

    第一段介绍了训练模型需要准备的四个东西,一是需要一个数据集(data set),而且这个数据集是经过分词之后的数据集;需要一个数据字典(dictionary);一些小的数据文件(small data files);最重要的就是最后一个,一个数据文件(data file),也可以认为是分词标准。

    这些英文的意思也不难,这里我只阐述我的几点疑问,

1、数据集,分好词的数据集用来做什么,是为了生成数据字典嘛?数据集怎么选取?

2、上文说到的字典(dictionary)具体指什么?只是名字等实体字典?怎么得到的这个字典?利用java edu.stanford.nlp.wordseg.ChineseDictionary这个命令得到的是什么字典,并且输入的文件是什么格式的,分好词的句呢还是普通的句子?words.txt是不是就是第一步生产的分好词的数据集?

3、最后训练模型的命令中train.txt是个什么样的文本呢?正常的文本呢还是分词之后的文本?