中文 NLP（11） -- stanfordNLP 生成文法之 PCFG 模型

在短语结构文法中，也就是转换生成语法，目前最成熟、精度最高的算法就是 PCFG 算法。

另一种基于依存句法理论的分析方法，最高效的是深度学习算法（Trainsition-Based LSTM）

PCFG 即 Probabilistic CFG ，也就是基于概率的短语结构分析。也就是在 G = (X,V,S,R) 的基础上加一个 P 表示概率，变成 G= (X,V,S,R,P) 约束如下：

也就是非终结符 A 的转换生成概率之和为 1.

比如有一个文法概率内容如下：

S -> NP VP , 1.00 NP -> astronomers , 0.10
NP -> NP PP, 0.40 NP -> saw, 0.04
VP -> VP PP , 0.30 V -> saw, 1.00
PP -> P NP , 1.00 NP -> telescopes ,0.1
VP -> V NP , 0.70 P -> with, 1.00
NP -> ears, 0.18
NP -> stars , 0.18

经 CFG 推导，得到两颗句法树，如图：

计算两颗子树的概率分别如下：

P(t1) = S × NP × VP × V × NP × NP × PP × P × NP = 1.0 × 0.1 × 0.7 × 1.0 × 0.4 × 0.18 × 1.0 × 1.0 × 0.18 = 0.0009072

P(t2) = S × NP × VP × VP × V × NP × PP × P × NP = 1.0 × 0.1 × 0.3 × 0.7 × 1.0 × 0.18 × 1.0 × 1.0 × 0.18 = 0.0006804

可以很容易的根据概率值的大小做出选择。

Stanford 的 PCFG 算法训练

在之前定义的 StanfordParser 类中增加如下方法：

class StanfordParser(StanfordCoreNLP):
    def __init__(self,jarpath,modelpath = "",opttype = "penn"):
    ...
    以上代码略
    
    # 创建模型
    def __buildtrain(self,trainpath,parsemodel):
        self.trainline = 'java -mx2g -cp "' + self.jarpath + '" ' + self.classfier + \ 
           ' -tLPP edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams -train "' + trainpath + \
                    '" -saveToSerializedFile "' + parsemodel + '"'

    # 创建模型和文本模型
    def __buildtraintxt(self,trainpath,parsemodel,txtmodel):
        self.trainline = 'java -mx2g -cp "' + self.jarpath + '" ' + self.classfier + \
                ' -tLPP edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams -train "' + trainpath + \
                    '" -saveToSerializedFile "' + parsemodel + '" -saveToTextFile "' + txtmodel + '"'

    # 训练模型
    def trainmodel(self,trainpath,parsemodel,txtmodel = None):
        if txtmodel is None:
            self.__buildtrain(trainpath,parsemodel)
        else:
            self.__buildtraintxt(trainpath,parsemodel,txtmodel)

        os.system(self.trainline)
        print "save model to",parsemodel

定义 trainfile/chtb_0001.mrg 文件内容如下（后面训练只需要指定训练文件目录 trainfile 即可）

( (IP-HLN (NP-SBJ (NP-PN (NR 上海) 
			 (NR 浦东)) 
		  (NP (NN 开发) 
		      (CC 与) 
		      (NN 法制) 
		      (NN 建设))) 
	  (VP (VV 同步))) ) 

( (FRAG  (NN 新华社) 
	 (NR 上海) 
	 (NT 二月) 
	 (NT 十日) 
	 (NN 电) 
	 (PU （) 
	 (NN 记者) 
	 (NR 谢金虎) 
	 (PU 、) 
	 (NR 张持坚) 
	 (PU ）) ))

训练代码如下：

# -*- coding:utf-8 -*-
import sys,os
reload(sys)
sys.setdefaultencoding("utf-8")
curdir = os.path.dirname(os.path.abspath(__file__))
sys.path.append(curdir)
from nltk import Tree

from stanford_parser.stanford import StanfordParser
root = os.path.join(curdir,"stanford_corenlp/")
st = StanfordParser(root)

trainpath = "trainfile/" # 宾州树库的样例文件 chtb_0001.mrg
modelpath = "trainmodel.ser.gz" # 模型文件
txtmodelpath = "trainmodel.ser" # 文本模型文件
result = st.trainmodel(trainpath,modelpath,txtmodelpath)

# 使用训练文件
# st = StanfordParser(root,"trainmodel.ser.gz")
# result = st.parse("上海 浦东 开发 与 法制 建设 同步")
print result

训练生成的文本模型文件部分如下：