title: NLP_1:语法树和N_gram模型
date: 2019-10-22 15:25:11
mathjax: true
categories:
- nlp-自然语言处理
tags: - nlp-自然语言处理
语法树
简单匹配
-
import random def name(): return random.choice('Jhon | Mike | 老梁'.split('|')) def hello(): return random.choice('你好 | 您来啦 | 快请进'.split('|')) def say_hello(): return name() + ' '+ hello() say_hello() 运行结果 ' 老梁 你好 '
了解python的都可能知道random.choice是随机生成的
-
对语法树进行定义
hello_rules= ''' say_hello = name hello tail names = name names | name name = Jhon | Mike | 老梁 | 老刘 hello = 你好 | 您来啦 | 快请进 tail = 呀 | ! ''' def get_generation_by_gram(grammar_str: str, target, stmt_split='=', or_split='|'): rules = dict()#key is the statement for line in hello_rules.split('\n'): if not line: continue stmt,expr = line.split(stmt_split) # skip the empty line rules[stmt.strip()] = expr.split(or_split) #print(rules) generated = generate(rules,target=target) return generated def generate(grammar_rule,target): if target in grammar_rule: candidates = grammar_rule[target] candidate = random.choice(candidates) return ''.join(generate(grammar_rule, target=c.strip()) for c in candidate) else: return target
示例1
-
语法树
simple_grammar = """ sentence => noun_phrase verb_phrase noun_phrase => Article Adj* noun Adj* => Adj | Adj Adj* verb_phrase => verb noun_phrase Article => 一个 | 这个 noun => 女人 | 篮球 | 桌子 | 小猫 verb => 看着 | 坐在 | 听着 | 看见 Adj => 蓝色的 | 好看的 | 小小的"""
-
生成rule
def get_generation_by_gram(grammar_str: str, target, stmt_split='=', or_split='|'): rules = dict() # key is the @statement, value is @expression for line in grammar_str.split('\n'): if not line: continue # skip the empty line # print(line) stmt, expr = line.split(stmt_split) rules[stmt.strip()] = expr.split(or_split) generated = generate(rules, target=target) return generated
-
构建语句
def generate(grammar_rule, target): if target in grammar_rule: # names candidates = grammar_rule[target] # ['name names', 'name'] candidate = random.choice(candidates) #'name names', 'name' return ''.join(generate(grammar_rule, target=c.strip()) for c in candidate.split()) else: return target
-
运行
get_generation_by_gram(simple_grammar, target='sentence', stmt_split='=>') '一个好看的蓝色的小猫听着这个蓝色的蓝色的蓝色的小小的小猫'
示例2
-
语法树
simpel_programming = ''' if_stmt => if ( cond ) { stmt } cond => var op var op => | == | < | >= | <= stmt => assign | if_stmt assign => var = var var => char var | char char => a | b | c | d | 0 | 1 | 2 | 3 '''
-
生成句子
for i in range(20): print(get_generation_by_gram(simpel_programming, target='if_stmt', stmt_split='=>'))
-
if(b2>=a){3a3c=2} if(1<2d){if(3>=dd){bc=a}} if(a<=d){0=03} if(3012<aa){0=33} if(c22){d=2} if(a1d3){3=3} if(3>=b){0dc=2c} if(1ab<3a33){if(d001<d){2b=0a}} if(a10c){1300c=2} if(b00cb){if(c<add33021){if(a<adc){3bd=a}}} if(a<=0){b=c} if(33dc<=d10b3){2=a} if(1b==2){if(3c<=0){if(2c==a){a1a0=d}}} if(0b12<=1){if(03c0){0=1a2}} if(1==b1){if(a>=12){if(02==dcd){if(b<22){3=a}}}} if(a==d){a3=1d} if(bcaa1b){c=b} if(dd>=a){a=1} if(3==bb){322=d} if(10>=0){if(bab){c=13b0}}
N_gram模型
-
假设我们有一个由nnn个词组成的句子 S = ( W 1 , W 2 , W 3 , , , , , , , W n ) S=(W_1,W_2,W_3,,,,,,,W_n) S=(W1,W2,W3,,,,,,,Wn)如何衡量它的概率呢?让我们假设,每一个单词 W i W_i Wi,都要依赖于从第一个单词 W 1 W_1 W1到它之前一个单词 w i − 1 w_{_i-1} wi−1的影响:
p ( S ) = p ( w 1 w 2 . . . w n ) = p ( w 1 ) p ( w 2 ∣ w 1 ) . . . p ( w n ∣ w n − 1 . . . w 2 w 1 ) p(S) = p(w_1w_2...w_n) = p(w_1)p(w_2|w_1)...p(w_n|w_{n-1}...w_2w_1) p(S)=p(w1w2...wn)=p(w1)p(w2∣w1)...p(wn∣wn−1...w2w1)是不是很简单?是的,不过这个衡量方法有两个缺陷:
- 参数空间过大,概率 p ( w 1 ) p ( w 2 ∣ w 1 ) . . . p ( w n ∣ w n − 1 . . . w 2 w 1 ) p(w_1)p(w_2|w_1)...p(w_n|w_{n-1}...w_2w_1) p(w1)p(w2∣w1)...p(wn∣wn−1...w2w1)
- 数据稀疏严重,词同时出现的情况可能没有,组合阶数高时尤其明显。
-
为了解决第一个问题,我们引入马尔科夫假设(Markov Assumption):一个词的出现仅与它之前的若干个词有关。
p ( w 1 . . . w n ) = ∏ p ( w i ∣ w i − 1 . . . . . w 1 ) ≈ ∏ p ( w i ∣ w i − 1 . . . . . w i − N + 1 ) p(w_1...w_n) = \prod p(w_i|w_{i-1}.....w_1) \approx \prod p(w_i|w_{i-1}.....w_{i-N+1}) p(w1...wn)=∏p(wi∣wi−1.....w1)≈∏p(wi∣wi−1.....wi−N+1)
-
如果一个词的出现仅依赖于它前面出现的一个词,那么我们就称之为 Bi-gram:
p ( S ) = p ( w 1 w 2 . . . . . . w n ) = p ( w 1 ) p ( w 2 ∣ w 1 ) . . . p ( w n ∣ w n − 1 ) p(S) = p(w_1 w_2......w_n) = p(w_1)p(w_2|w_1)...p(w_n|w_{n-1}) p(S)=p(w1w2......wn)=p(w1)p(w2∣w1)...p(wn∣wn−1)
-
如果一个词的出现仅依赖于它前面出现的两个词,那么我们就称之为 Tri-gram:
p ( S ) = p ( w 1 w 2 . . . . . . w n ) = p ( w 1 ) p ( w 2 ∣ w 1 ) . . . p ( w n ∣ w n − 1 w n − 2 ) p(S) = p(w_1 w_2......w_n) = p(w_1)p(w_2|w_1)...p(w_n|w_{n-1}w_{n-2}) p(S)=p(w1w2......wn)=p(w1)p(w2∣w1)...p(wn∣wn−1wn−2)
-
如何计算其中的每一项条件概率呢:答案是极大似然估计,就是项数统计
p ( w n ∣ w n − 1 = C ( w n − 1 w n ) C ( w n − 1 ) ) p(w_n|w_{n-1} = \frac{C(w_{n-1}w_n)}{C(w_n-1)}) p(wn∣wn−1=C(wn−1)C(wn−1wn))
p ( w n ∣ w n − 1 w n − 2 = C ( w n − 2 w n 1 w n ) C ( w n − 2 w n − 1 ) ) p(w_n|w_{n-1}w_{n-2} = \frac{C(w_{n-2}w_{n_1}w_n)}{C(w_{n-2}w_{n-1})}) p(wn∣wn−1wn−2=C(wn−2wn−1)C(wn−2wn1wn))
p ( w n ∣ w n − 1 . . . w 2 w 1 ) = C ( w 1 w 2 . . . . w n ) C ( w 1 w 2 . . . . . w n − 1 ) p(w_n|w_{n-1}...w_2w_1) = \frac{C(w_1w_2....w_n)}{C(w_1w_2.....w_{n-1})} p(wn∣wn−1...w2w1)=C(w1w2.....wn−1)C(w1w2....wn)
-
打开文件
corpus = '/Users/admin/Desktop/2.txt' FILE = open(corpus).read()
-
随机取元素查看
import random def generate_by_pro(text_corpus,length=20): return ''.join(random.sample(text_corpus,length)) generate_by_pro(FILE)
-
使用jieba分词
token = list(jieba.cut(sub_file)) 封装使用 def cut(string): return list(jieba.cut(string)) TOKENS = cut(sub_file)
-
2_gram模型使用
_2_gram_words = [ TOKENS[i] +TOKENS[i+1] for i in range(len(TOKENS)-1) ] def get_1_gram_count(word): if word in words_count: return words_count[word] else: return words_count.most_common()[-1][-1] def get_2_gram_count(word): if word in _2_gram_words_counts: return _2_gram_words_counts[word] else: return _2_gram_words_counts.most_common()[-1][-1] #对1-gram和2_gram进行封装 def get_gram_count(word, wc): if word in wc: return wc[word] else: return wc.most_common()[-1][-1]
-
集成使用
def two_gram_model(sentence): tokens = cut(sentence) probability = 1 for i in range(len(tokens)-1): word = tokens[i] next_word = tokens[i+1] _two_gram_c = get_gram_count(word+next_word, _2_gram_word_counts) _one_gram_c = get_gram_count(next_word, words_count) pro = _two_gram_c / _one_gram_c probability *= pro return probability
具体内容可以查看我的github