自然语言_统计模型 (生成式+区分式)-CSDN博客

本文链接：https://blog.csdn.net/seabiscuityj/article/details/107955778

传统的统计学习方法

生成式方法

代表性方法

n元语法模型（n-gram）/ 语言模型 2元法(bi-gram)和三元法(tri-gram)。需解决数据平滑问题。
隐马尔科夫模型（HMM）

区分式/判别式方法

假设o是观察值，q是模型，区分式方法对p(q|o)进行建模。其基本思路是：在有限样本条件下建立判别函数，不考虑样本的产生模型，直接研究预测模型，寻找不同类别之间的最优分类面，反映的是不同类别数据之间的差异性。

代表性方法：各种分类器模型

常用的统计模型和开源工具

统计模型（生成式 + 区分式）

语言模型（language model）
隐马尔科夫模型（HMM）
k-近邻法（k-NN）：多类分类问题
朴素贝叶斯法（naive Bayes）：多类分类问题
决策树（decesion tree）：多类分类问题
最大熵（maximum entropy）：多类分类问题
感知机（perceptron）：二类分类
支持向量机（SVM）：二类分类
条件随机场（CRF）：序列标注

开源工具

语言模型
- SRI 语言模型工具：
  
  http://www.speech.sri.com/projects/srilm/
- CMU-Cambridge 语言模型工具
  
  http://mi.eng.cam.ac.uk/~prc14/toolkit.html
隐马尔科夫模型 http://htk.eng.cam.ac.uk/
条件随机场：
- CRF++ （C++版）：
  
  http://crfpp.googlecode.com/svn/trunk/doc/index.html
- CRFSuite（C语言版）：
  
  http://www.chokkan.org/software/crfsuite/
- MALLET (Java版，通用的NLP工具包，包括分类、序列标注等机器学习算法)：http://mallet.cs.umass.edu/
- NLTK (Python版，通用的NLP工具包，很多工具是从MALLET中包装转成的Python接口)：http://nltk.org/
最大熵：
- OpenNLP：http://incubator.apache.org/opennlp/
- Malouf：http://tadm.sourceforge.net/
- Tsujii：http://www-tsujii.is.s.u-tokyo.ac.jp/~tsuruoka/maxent/
- 张乐：http://homepages.inf.ed.ac.uk/lzhang10/maxent.html
- 林德康：http://webdocs.cs.ualberta.ca/~lindek/downloads.htm
贝叶斯分类器：http://www.openpr.org.cn
支持向量机(LibSVM)：
http://www.csie.ntu.edu.tw/~cjlin/libsvm
NNlm: 前馈神经网络语言模型(feed-forward n-gram neural language model) http://nlg.isi.edu/software/nplm/
RNNlm: 循环神经网络语言模型(recurrent neural language model) http://rnnlm.org/
LSTMlm: LSTM语言模型(recurrent neural language model with LSTM unit) https://www-i6.informatik.rwth-aachen.de/web/Software/rwthlm.php
LSTM 反向传播算法: http://arunmallya.github.io/writeups/nn/lstm/index.html#/
Google Word2Vec: http://code.google.com/p/word2vec/
EMLo,基于循环神经网络预训练模型,https://github.com/allenai/bilm-tf
BERT,基于双向自我注意机制语言模型(BidirectionalEncoderRepresentationsfromTransformer)，https://github.com/google-research/bert
GPT,基于单向自我注意机制的预训练语言模型(LanguageModelwithGenerativePre-training)，https://github.com/openai/gpt-2
……