python3 实现英文文本的uni-gram和bi-gram
接上一个英文拼写检查错误的思路,开始代码实现。首先我们需要一个比较大的英文语料来进行分析。我使用的是training-monolingual,也可以使用其他的。
1. 由语料生成uni-gram和bi-gram
代码如下:
import sys
class NGram(object):
def __init__(self, n):
# n is the order of n-gram language model
self.n = n
self.unigram = {}
self.bigram = {}
# scan a sentence, extract the ngram and update their
# frequence.
#
# @param sentence list{str}
# @return none
def scan(self, sentence):
fip = ""
# file your code here
for line in sentence:
self.ngram(line.split())
# unigram
if self.n == 1:
try:
fip = open("data.uni", "w", encoding='utf-8')
except:
print(sys.stderr, "fail