NLP_1:语法树和N_gram模型

74 篇文章 2 订阅
44 篇文章 0 订阅

title: NLP_1:语法树和N_gram模型
date: 2019-10-22 15:25:11
mathjax: true
categories:

  • nlp-自然语言处理
    tags:
  • nlp-自然语言处理

语法树

简单匹配

  • import random
    def name():
      return random.choice('Jhon | Mike | 老梁'.split('|'))
    def hello():
      return random.choice('你好 | 您来啦 | 快请进'.split('|'))
    def say_hello():
      return name() + ' '+ hello()
    say_hello()
    运行结果
    ' 老梁 你好 '
    

    了解python的都可能知道random.choice是随机生成的

  • 对语法树进行定义

    hello_rules= '''
    say_hello = name hello tail
    names = name names | name
    name = Jhon | Mike | 老梁 | 老刘
    hello = 你好 | 您来啦 | 快请进
    tail = 呀 | !
    '''
    def get_generation_by_gram(grammar_str: str, target, stmt_split='=', or_split='|'):
        rules = dict()#key is the statement
        for line in hello_rules.split('\n'):
            if not line: continue
            stmt,expr = line.split(stmt_split)
    
             # skip the empty line
            rules[stmt.strip()] = expr.split(or_split)
            #print(rules)
        generated = generate(rules,target=target)
        return generated
    def generate(grammar_rule,target):
        if target in grammar_rule:
            candidates = grammar_rule[target]
            candidate = random.choice(candidates)
            return ''.join(generate(grammar_rule, target=c.strip()) for c in candidate)
        else:
            return target
    

示例1

  • 语法树

    simple_grammar = """
    sentence => noun_phrase verb_phrase
    noun_phrase => Article Adj* noun
    Adj* => Adj | Adj Adj*
    verb_phrase => verb noun_phrase
    Article =>  一个 | 这个
    noun =>   女人 |  篮球 | 桌子 | 小猫
    verb => 看着   |  坐在 |  听着 | 看见
    Adj =>   蓝色的 |  好看的 | 小小的"""
    
  • 生成rule

    def get_generation_by_gram(grammar_str: str, target, stmt_split='=', or_split='|'):
    
        rules = dict() # key is the @statement, value is @expression
        for line in grammar_str.split('\n'):
            if not line: continue
            # skip the empty line
          #  print(line)
            stmt, expr = line.split(stmt_split)
        
            rules[stmt.strip()] = expr.split(or_split)
        
        generated = generate(rules, target=target)
        
        return generated
    
  • 构建语句

    def generate(grammar_rule, target):
        if target in grammar_rule: # names 
            candidates = grammar_rule[target]  # ['name names', 'name']
            candidate = random.choice(candidates) #'name names', 'name'
            return ''.join(generate(grammar_rule, target=c.strip()) for c in candidate.split())
        else:
            return target
    
  • 运行

    get_generation_by_gram(simple_grammar, target='sentence', stmt_split='=>')
    '一个好看的蓝色的小猫听着这个蓝色的蓝色的蓝色的小小的小猫'
    

示例2

  • 语法树

    simpel_programming = '''
    if_stmt => if ( cond ) { stmt }
    cond => var op var
    op => | == | < | >= | <= 
    stmt => assign | if_stmt
    assign => var = var
    var =>  char var | char
    char => a | b |  c | d | 0 | 1 | 2 | 3
    '''
    
  • 生成句子

    for i in range(20):
        print(get_generation_by_gram(simpel_programming, target='if_stmt', stmt_split='=>'))
    
  • if(b2>=a){3a3c=2}
    if(1<2d){if(3>=dd){bc=a}}
    if(a<=d){0=03}
    if(3012<aa){0=33}
    if(c22){d=2}
    if(a1d3){3=3}
    if(3>=b){0dc=2c}
    if(1ab<3a33){if(d001<d){2b=0a}}
    if(a10c){1300c=2}
    if(b00cb){if(c<add33021){if(a<adc){3bd=a}}}
    if(a<=0){b=c}
    if(33dc<=d10b3){2=a}
    if(1b==2){if(3c<=0){if(2c==a){a1a0=d}}}
    if(0b12<=1){if(03c0){0=1a2}}
    if(1==b1){if(a>=12){if(02==dcd){if(b<22){3=a}}}}
    if(a==d){a3=1d}
    if(bcaa1b){c=b}
    if(dd>=a){a=1}
    if(3==bb){322=d}
    if(10>=0){if(bab){c=13b0}}
    

N_gram模型

  • 假设我们有一个由nnn个词组成的句子 S = ( W 1 , W 2 , W 3 , , , , , , , W n ) S=(W_1,W_2,W_3,,,,,,,W_n) S=(W1,W2,W3,,,,,,,Wn)如何衡量它的概率呢?让我们假设,每一个单词 W i W_i Wi,都要依赖于从第一个单词 W 1 W_1 W1到它之前一个单词 w i − 1 w_{_i-1} wi1的影响:
    p ( S ) = p ( w 1 w 2 . . . w n ) = p ( w 1 ) p ( w 2 ∣ w 1 ) . . . p ( w n ∣ w n − 1 . . . w 2 w 1 ) p(S) = p(w_1w_2...w_n) = p(w_1)p(w_2|w_1)...p(w_n|w_{n-1}...w_2w_1) p(S)=p(w1w2...wn)=p(w1)p(w2w1)...p(wnwn1...w2w1)

    是不是很简单?是的,不过这个衡量方法有两个缺陷:

    1. 参数空间过大,概率 p ( w 1 ) p ( w 2 ∣ w 1 ) . . . p ( w n ∣ w n − 1 . . . w 2 w 1 ) p(w_1)p(w_2|w_1)...p(w_n|w_{n-1}...w_2w_1) p(w1)p(w2w1)...p(wnwn1...w2w1)
    2. 数据稀疏严重,词同时出现的情况可能没有,组合阶数高时尤其明显。
  • 为了解决第一个问题,我们引入马尔科夫假设(Markov Assumption):一个词的出现仅与它之前的若干个词有关。

    p ( w 1 . . . w n ) = ∏ p ( w i ∣ w i − 1 . . . . . w 1 ) ≈ ∏ p ( w i ∣ w i − 1 . . . . . w i − N + 1 ) p(w_1...w_n) = \prod p(w_i|w_{i-1}.....w_1) \approx \prod p(w_i|w_{i-1}.....w_{i-N+1}) p(w1...wn)=p(wiwi1.....w1)p(wiwi1.....wiN+1)

  • 如果一个词的出现仅依赖于它前面出现的一个词,那么我们就称之为 Bi-gram:

    p ( S ) = p ( w 1 w 2 . . . . . . w n ) = p ( w 1 ) p ( w 2 ∣ w 1 ) . . . p ( w n ∣ w n − 1 ) p(S) = p(w_1 w_2......w_n) = p(w_1)p(w_2|w_1)...p(w_n|w_{n-1}) p(S)=p(w1w2......wn)=p(w1)p(w2w1)...p(wnwn1)

  • 如果一个词的出现仅依赖于它前面出现的两个词,那么我们就称之为 Tri-gram:
    p ( S ) = p ( w 1 w 2 . . . . . . w n ) = p ( w 1 ) p ( w 2 ∣ w 1 ) . . . p ( w n ∣ w n − 1 w n − 2 ) p(S) = p(w_1 w_2......w_n) = p(w_1)p(w_2|w_1)...p(w_n|w_{n-1}w_{n-2}) p(S)=p(w1w2......wn)=p(w1)p(w2w1)...p(wnwn1wn2)

在这里插入图片描述

  • 如何计算其中的每一项条件概率呢:答案是极大似然估计,就是项数统计

    p ( w n ∣ w n − 1 = C ( w n − 1 w n ) C ( w n − 1 ) ) p(w_n|w_{n-1} = \frac{C(w_{n-1}w_n)}{C(w_n-1)}) p(wnwn1=C(wn1)C(wn1wn))

    p ( w n ∣ w n − 1 w n − 2 = C ( w n − 2 w n 1 w n ) C ( w n − 2 w n − 1 ) ) p(w_n|w_{n-1}w_{n-2} = \frac{C(w_{n-2}w_{n_1}w_n)}{C(w_{n-2}w_{n-1})}) p(wnwn1wn2=C(wn2wn1)C(wn2wn1wn))

    p ( w n ∣ w n − 1 . . . w 2 w 1 ) = C ( w 1 w 2 . . . . w n ) C ( w 1 w 2 . . . . . w n − 1 ) p(w_n|w_{n-1}...w_2w_1) = \frac{C(w_1w_2....w_n)}{C(w_1w_2.....w_{n-1})} p(wnwn1...w2w1)=C(w1w2.....wn1)C(w1w2....wn)


  • 打开文件

    corpus = '/Users/admin/Desktop/2.txt'
    FILE = open(corpus).read()
    
  • 随机取元素查看

    import random
    def generate_by_pro(text_corpus,length=20):
        return ''.join(random.sample(text_corpus,length))
    generate_by_pro(FILE)
    
  • 使用jieba分词

    token = list(jieba.cut(sub_file))
    封装使用
    def cut(string):
        return list(jieba.cut(string))
    TOKENS = cut(sub_file)
    
  • 2_gram模型使用

    _2_gram_words = [
        TOKENS[i] +TOKENS[i+1] for i in range(len(TOKENS)-1)
    ]
    def get_1_gram_count(word):
        if word in words_count: return words_count[word]
        else:
            return words_count.most_common()[-1][-1]
    def get_2_gram_count(word):
        if word in _2_gram_words_counts: return _2_gram_words_counts[word]
        else:
            return _2_gram_words_counts.most_common()[-1][-1]
    #对1-gram和2_gram进行封装
    def get_gram_count(word, wc):
        if word in wc: return wc[word]
        else:
            return wc.most_common()[-1][-1]
    
  • 集成使用

    def two_gram_model(sentence):
        tokens = cut(sentence)
        probability = 1
        for i in range(len(tokens)-1):
            word = tokens[i]
            next_word = tokens[i+1]
            _two_gram_c = get_gram_count(word+next_word, _2_gram_word_counts)
            _one_gram_c = get_gram_count(next_word, words_count)
            pro =  _two_gram_c / _one_gram_c
            
            probability *= pro
        return probability
            
    

    具体内容可以查看我的github

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值