词性标注Pos Tagging

转自:http://blog.csdn.net/u014568921/article/details/51791495


什么是词性标注,Part-of-speech tagging


比如下面一段标注过词性的文字文字,用空格分开后,/前面的是英文单词,后面表示它的词性。

  1. Confidence/NN in/IN the/DT pound/NN is/VBZ widely/RB expected/VBN to/TO take/VB another/DT sharp/JJ dive/NN if/IN trade/NN figures/NNS for/IN September/NNP ,/, due/JJ for/IN release/NN tomorrow/NN ,/, fail/VB to/TO show/VB a/DT substantial/JJ improvement/NN from/IN July/NNP and/CC August/NNP ‘s/POS near-record/JJ deficits/NNS ./.  
  2. Chancellor/NNP of/IN the/DT Exchequer/NNP Nigel/NNP Lawson/NNP ‘s/POS restated/VBN commitment/NN to/TO a/DT firm/NN monetary/JJ policy/NN has/VBZ helped/VBN to/TO prevent/VB a/DT freefall/NN in/IN sterling/NN over/IN the/DT past/JJ week/NN ./.  
Confidence/NN in/IN the/DT pound/NN is/VBZ widely/RB expected/VBN to/TO take/VB another/DT sharp/JJ dive/NN if/IN trade/NN figures/NNS for/IN September/NNP ,/, due/JJ for/IN release/NN tomorrow/NN ,/, fail/VB to/TO show/VB a/DT substantial/JJ improvement/NN from/IN July/NNP and/CC August/NNP 's/POS near-record/JJ deficits/NNS ./.
Chancellor/NNP of/IN the/DT Exchequer/NNP Nigel/NNP Lawson/NNP 's/POS restated/VBN commitment/NN to/TO a/DT firm/NN monetary/JJ policy/NN has/VBZ helped/VBN to/TO prevent/VB a/DT freefall/NN in/IN sterling/NN over/IN the/DT past/JJ week/NN ./.


上面NN是名词,IN是介词或从属连词,DT: determiner 表示限定词。。。


问题是现在要给一段未标注词性的文字的每个单词标注词性。


HMM、最大熵模型、crf都可以完成这一任务


HMM

用HMM做词性标注和HMM做中文分词类似,也可以看成是序列标注问题


基于隐马尔可夫模型的有监督词性标注


HMM在自然语言处理中的应用一:词性标注


词性标注





[python] view plain copy
print ?
  1. #coding:utf-8  
  2. import re  
  3.   
  4. from dicts import DefaultDict  
  5. from random import choice  
  6.   
  7. def Dict(**args):   
  8.     ”“”Return a dictionary with argument names as the keys,  
  9.     and argument values as the key values”“”  
  10.     return args  
  11.   
  12. def hmm(training_sentences, reducedtagset):  
  13.     ”“”Given a list of pre-tagged sentences, return an HMM tuple containing 
  14.     the transition (1) and emission (2) probabilities”“”  
  15.     transitions = DefaultDict(DefaultDict(0))  
  16.     emissions = DefaultDict(DefaultDict(0))  
  17.     wordcounts = DefaultDict(0)  
  18.     tagcounts = DefaultDict(0)  
  19.   
  20.     for line in training_sentences:  
  21.     prevtag = ’<START>’   # Before each sentence, begin in START state  
  22.         tagcounts[’<START>’] += 1  
  23.     for taggedword in line.split():  
  24.         (word, tag) = re.split(’(?<!\\\)\/’, taggedword)  
  25.   
  26.             if reducedtagset:  
  27.                 if re.match(‘VB’, tag) is not None: tag = ‘VB’  
  28.                 elif re.match(‘NN’, tag) is not None: tag = ‘NN’  
  29.             elif re.match(‘JJ’, tag) is not None: tag = ‘JJ’  
  30.                 elif re.match(‘RB’, tag) is not None: tag = ‘RB’  
  31.   
  32.         transitions[prevtag][tag] += 1  
  33.         emissions[tag][word] += 1  
  34.         wordcounts[word] += 1  
  35.             tagcounts[tag] += 1  
  36.             prevtag = tag  
  37.   
  38.     print emissions.keys()  
  39.       
  40.     return hmmtuple(transitions, emissions, wordcounts, tagcounts)  
  41.   
  42. def hmmtuple(transitions, emissions, wordcounts, tagcounts):      
  43.     # At test time we will need estimates for “unknown words”—the words  
  44.     # the words that never occurred in the training data.  One recommended  
  45.     # way to do this is to turn all training words occurring just once   
  46.     # into ’<UNKNOWN>’ and use this as the stand-in for all “unknown words”  
  47.     # at test time.  Below we make all the necessary transformations  
  48.     # to ’<UNKNOWN>’.  
  49.     for tag,dict in emissions.items():  
  50.     for word,count in dict.items():  
  51.         if wordcounts[word] == 1:  
  52.         del emissions[tag][word]  
  53.         emissions[tag][’<UNKNOWN>’] += 1  
  54.   
  55.     # Calculate smoothed conditional probabilities  
  56.     tags = emissions.keys()  
  57.     words = wordcounts.keys()  
  58.   
  59.     for prevtag in transitions.keys():  
  60.         for tag in tags: #transitions[prevtag]:  
  61.             transitions[prevtag][tag] = (transitions[prevtag][tag]+1.)/(tagcounts[prevtag]+len(tags))  
  62.             #transitions[prevtag][tag] *= 1./tagcounts[prevtag]  
  63.   
  64.     for tag in emissions.keys():  
  65.         for word in words: #emissions[tag]:  
  66.             emissions[tag][word] = (emissions[tag][word]+1.)/(tagcounts[tag]+len(wordcounts))  
  67.             #emissions[tag][word] *= 1./tagcounts[tag]  
  68.   
  69.     #print len(transitions), len(emissions), len(tagcounts)  
  70.     return (transitions, emissions, tags)  
  71.   
  72. def strip_tags(tagged_sentences):  
  73.     ”“”Given a list of tagged sentences, return a list of untagged sentences”“”  
  74.     untagged_sentences = []  
  75.     for taggedsent in tagged_sentences:  
  76.         untaggedsent =   
  77.     for taggedword in taggedsent.split():  
  78.         word = re.split(’(?<!\\\)\/’, taggedword)[0]  
  79.             untaggedsent += word + ’ ’  
  80.         #print untaggedsent  
  81.         untagged_sentences.append(untaggedsent)  
  82.     return untagged_sentences  
  83.   
  84. def maxsequence(probtable, tags):  
  85.     ”“”Given a filled Viterbi probabibility table, return the most likely  
  86.     sequence of POS tags”“”  
  87.     r = len(probtable)  
  88.     c = len(probtable[0])  
  89.   
  90.     maxfinalprob = 0  
  91.     maxfinaltag = None  
  92.     for i in range(r):  
  93.         if (probtable[i][c-1][0] > maxfinalprob):  
  94.             maxfinalprob = probtable[i][c-1][0]  
  95.             maxfinaltag = i  
  96.   
  97.     #print maxfinaltag  
  98.   
  99.     maxsequence = []  
  100.     prevmaxtag = maxfinaltag  
  101.     for j in range(c-1, -1, -1):  
  102.         maxsequence.insert(0, tags[prevmaxtag])  
  103.         #print probtable[prevmaxtag][j][1]  
  104.         prevmaxtag = probtable[prevmaxtag][j][1]  
  105.           
  106.     return maxsequence  
  107.   
  108. def viterbi_tags (untagged_sentences, h):  
  109.     ”“”Given a list of untagged sentences, return the most likely sequence of 
  110.     POS tags”“”  
  111.     transitions = h[0]  
  112.     emissions = h[1]  
  113.     tags = h[2]  
  114.     maxtags = []  
  115.     #print tags  
  116.   
  117.     for untaggedsent in untagged_sentences:  
  118.         #Create empty probtable  
  119.         words = untaggedsent.split()  
  120.         r = len(tags)  
  121.         c = len(words)  
  122.         probtable = [None]*r  
  123.         for i in range(r):  
  124.             probtable[i] = [None]*c  
  125.             for j in range(c):  
  126.                 probtable[i][j] = [None]*2  
  127.   
  128.         #Initialize zeroth column of probtable  
  129.         prevtag = ’<START>’  
  130.         word = words[0]  
  131.         for i in range(r):  
  132.             tag = tags[i]  
  133.   
  134.             transition = transitions[prevtag][tag]  
  135.             if word in emissions[tag]:  
  136.                 emission = emissions[tag][word]  
  137.             else:  
  138.                 emission = .0001*emissions[tag][‘<UNKNOWN>’]  
  139.   
  140.             probtable[i][0][0] = transition*emission  
  141.           
  142.         #Fill in probtable  
  143.         for j in range(1, c):  
  144.             word = words[j]  
  145.             for i in range(r):  
  146.                 tag = tags[i]  
  147.                 maxprob = 0  
  148.                 maxtag = None  
  149.   
  150.                 if word in emissions[tag]:  
  151.                     emission = emissions[tag][word]  
  152.                 else:  
  153.                     emission = .0001*emissions[tag][‘<UNKNOWN>’]  
  154.   
  155.                 for k in range(r):  
  156.                     prevtag = tags[k]  
  157.                     transition = transitions[prevtag][tag]  
  158.                     prob = probtable[k][j-1][0]*transition*emission  
  159.                       
  160.                     if (prob > maxprob):  
  161.                         maxprob = prob  
  162.                         maxtag = k  
  163.   
  164.                 probtable[i][j][0] = maxprob  
  165.                 probtable[i][j][1] = maxtag  
  166.   
  167.         #Find most likely sequence of POS tags of this sentence  
  168.         sentmaxtags = maxsequence(probtable, tags)  
  169.         maxtags.extend(sentmaxtags)  
  170.   
  171.     #Return most likely sequence of POS tags of all sentences  
  172.     return maxtags  
  173.   
  174. def true_tags (tagged_sentences):  
  175.     ”“”Given a list of tagged sentences, return the tag sequence”“”  
  176.     tags = []  
  177.     for sent in tagged_sentences:  
  178.         tags.extend([re.split(’(?<!\\\)\/’, word)[1for word in sent.split()])  
  179.     return tags  
  180.   
  181. def compare(mytags, truetags, reducedtagset):  
  182.     #print mytags, truetags  
  183.     score = 0  
  184.     length = len(mytags)  
  185.     for i in range(length):  
  186.     truetag = truetags[i]  
  187.     if reducedtagset:  
  188.             if re.match(‘VB’, truetag) is not None: truetag = ‘VB’  
  189.             elif re.match(‘NN’, truetag) is not None: truetag = ‘NN’  
  190.             elif re.match(‘JJ’, truetag) is not None: truetag = ‘JJ’  
  191.             elif re.match(‘RB’, truetag) is not None: truetag = ‘RB’  
  192.   
  193.         if mytags[i] == truetag: score += 1  
  194.       
  195.     return 1.*score/length  
  196.   
  197. if __name__ == ‘__main__’:  
  198.     f = open(’wsj15-18.pos’).readlines()  
  199.       
  200.     #90% of data is used for training  
  201.     print ‘90% of data is used for training’  
  202.     print ‘——————————–’  
  203.     i = int(len(f)*.9)  
  204.     h = hmm(f[:i], False)  
  205.   
  206.     test1 = f[i:]  
  207.     v1 = viterbi_tags(strip_tags(test1), h)  
  208.     t1 = true_tags(test1)  
  209.     c1 = compare(v1, t1, False)  
  210.     print c1  
  211.   
  212.     test2 = open(’wsj_0159.pos’).readlines()  
  213.     v2 = viterbi_tags(strip_tags(test2), h)  
  214.     t2 = true_tags(test2)  
  215.     c2 = compare(v2, t2, False)  
  216.     print c2  
#coding:utf-8
import re

from dicts import DefaultDict
from random import choice

def Dict(**args): 
    """Return a dictionary with argument names as the keys, 
    and argument values as the key values"""
    return args

def hmm(training_sentences, reducedtagset):
    """Given a list of pre-tagged sentences, return an HMM tuple containing
    the transition (1) and emission (2) probabilities"""
    transitions = DefaultDict(DefaultDict(0))
    emissions = DefaultDict(DefaultDict(0))
    wordcounts = DefaultDict(0)
    tagcounts = DefaultDict(0)

    for line in training_sentences:
    prevtag = '<START>'   # Before each sentence, begin in START state
        tagcounts['<START>'] += 1
    for taggedword in line.split():
        (word, tag) = re.split('(?<!\\\)\/', taggedword)

            if reducedtagset:
                if re.match('VB', tag) is not None: tag = 'VB'
                elif re.match('NN', tag) is not None: tag = 'NN'
            elif re.match('JJ', tag) is not None: tag = 'JJ'
                elif re.match('RB', tag) is not None: tag = 'RB'

        transitions[prevtag][tag] += 1
        emissions[tag][word] += 1
        wordcounts[word] += 1
            tagcounts[tag] += 1
            prevtag = tag

    print emissions.keys()

    return hmmtuple(transitions, emissions, wordcounts, tagcounts)

def hmmtuple(transitions, emissions, wordcounts, tagcounts):    
    # At test time we will need estimates for "unknown words"---the words
    # the words that never occurred in the training data.  One recommended
    # way to do this is to turn all training words occurring just once 
    # into '<UNKNOWN>' and use this as the stand-in for all "unknown words"
    # at test time.  Below we make all the necessary transformations
    # to '<UNKNOWN>'.
    for tag,dict in emissions.items():
    for word,count in dict.items():
        if wordcounts[word] == 1:
        del emissions[tag][word]
        emissions[tag]['<UNKNOWN>'] += 1

    # Calculate smoothed conditional probabilities
    tags = emissions.keys()
    words = wordcounts.keys()

    for prevtag in transitions.keys():
        for tag in tags: #transitions[prevtag]:
            transitions[prevtag][tag] = (transitions[prevtag][tag]+1.)/(tagcounts[prevtag]+len(tags))
            #transitions[prevtag][tag] *= 1./tagcounts[prevtag]

    for tag in emissions.keys():
        for word in words: #emissions[tag]:
            emissions[tag][word] = (emissions[tag][word]+1.)/(tagcounts[tag]+len(wordcounts))
            #emissions[tag][word] *= 1./tagcounts[tag]

    #print len(transitions), len(emissions), len(tagcounts)
    return (transitions, emissions, tags)

def strip_tags(tagged_sentences):
    """Given a list of tagged sentences, return a list of untagged sentences"""
    untagged_sentences = []
    for taggedsent in tagged_sentences:
        untaggedsent = ''
    for taggedword in taggedsent.split():
        word = re.split('(?<!\\\)\/', taggedword)[0]
            untaggedsent += word + ' '
        #print untaggedsent
        untagged_sentences.append(untaggedsent)
    return untagged_sentences

def maxsequence(probtable, tags):
    """Given a filled Viterbi probabibility table, return the most likely 
    sequence of POS tags"""
    r = len(probtable)
    c = len(probtable[0])

    maxfinalprob = 0
    maxfinaltag = None
    for i in range(r):
        if (probtable[i][c-1][0] > maxfinalprob):
            maxfinalprob = probtable[i][c-1][0]
            maxfinaltag = i

    #print maxfinaltag

    maxsequence = []
    prevmaxtag = maxfinaltag
    for j in range(c-1, -1, -1):
        maxsequence.insert(0, tags[prevmaxtag])
        #print probtable[prevmaxtag][j][1]
        prevmaxtag = probtable[prevmaxtag][j][1]

    return maxsequence

def viterbi_tags (untagged_sentences, h):
    """Given a list of untagged sentences, return the most likely sequence of
    POS tags"""
    transitions = h[0]
    emissions = h[1]
    tags = h[2]
    maxtags = []
    #print tags

    for untaggedsent in untagged_sentences:
        #Create empty probtable
        words = untaggedsent.split()
        r = len(tags)
        c = len(words)
        probtable = [None]*r
        for i in range(r):
            probtable[i] = [None]*c
            for j in range(c):
                probtable[i][j] = [None]*2

        #Initialize zeroth column of probtable
        prevtag = '<START>'
        word = words[0]
        for i in range(r):
            tag = tags[i]

            transition = transitions[prevtag][tag]
            if word in emissions[tag]:
                emission = emissions[tag][word]
            else:
                emission = .0001*emissions[tag]['<UNKNOWN>']

            probtable[i][0][0] = transition*emission

        #Fill in probtable
        for j in range(1, c):
            word = words[j]
            for i in range(r):
                tag = tags[i]
                maxprob = 0
                maxtag = None

                if word in emissions[tag]:
                    emission = emissions[tag][word]
                else:
                    emission = .0001*emissions[tag]['<UNKNOWN>']

                for k in range(r):
                    prevtag = tags[k]
                    transition = transitions[prevtag][tag]
                    prob = probtable[k][j-1][0]*transition*emission

                    if (prob > maxprob):
                        maxprob = prob
                        maxtag = k

                probtable[i][j][0] = maxprob
                probtable[i][j][1] = maxtag

        #Find most likely sequence of POS tags of this sentence
        sentmaxtags = maxsequence(probtable, tags)
        maxtags.extend(sentmaxtags)

    #Return most likely sequence of POS tags of all sentences
    return maxtags

def true_tags (tagged_sentences):
    """Given a list of tagged sentences, return the tag sequence"""
    tags = []
    for sent in tagged_sentences:
        tags.extend([re.split('(?<!\\\)\/', word)[1] for word in sent.split()])
    return tags

def compare(mytags, truetags, reducedtagset):
    #print mytags, truetags
    score = 0
    length = len(mytags)
    for i in range(length):
    truetag = truetags[i]
    if reducedtagset:
            if re.match('VB', truetag) is not None: truetag = 'VB'
            elif re.match('NN', truetag) is not None: truetag = 'NN'
            elif re.match('JJ', truetag) is not None: truetag = 'JJ'
            elif re.match('RB', truetag) is not None: truetag = 'RB'

        if mytags[i] == truetag: score += 1

    return 1.*score/length

if __name__ == '__main__':
    f = open('wsj15-18.pos').readlines()

    #90% of data is used for training
    print '90% of data is used for training'
    print '--------------------------------'
    i = int(len(f)*.9)
    h = hmm(f[:i], False)

    test1 = f[i:]
    v1 = viterbi_tags(strip_tags(test1), h)
    t1 = true_tags(test1)
    c1 = compare(v1, t1, False)
    print c1

    test2 = open('wsj_0159.pos').readlines()
    v2 = viterbi_tags(strip_tags(test2), h)
    t2 = true_tags(test2)
    c2 = compare(v2, t2, False)
    print c2


200行Python代码实现感知机词性标注器





评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值