马可夫模型
对这个模型如何运行的的理解
一、从文章中获取一个单词经常关联的下一个单词
原代码
def buildWordDict(text):
# 剔除换行符和引号
text = text.replace("\n", " ")
text = text.replace("\"", "")
# 保证每个标点符号都和前面的单词在一起
# 这样不会被剔除,保留在马科夫链中
punctuation = [',', '.', ';', ':']
for symbol in punctuation:
text = text.replace(symbol, " " + symbol + " ")
words = text.split(" ")
# 过滤空单词
words = [word for word in words if word != ""]
理解:这里是对文章数据进行清洗,需要重点注意的是要对单词后经常出现的标点符号进行保留,这能使文章变得更合理,流畅。
wordDict = {}
for i in range(1, len(words)):
if words[i-1] not in wordDict:
# 为单词新建一个词典
wordDict[words[i-1]] ={}
if words[i] not in wordDict[words[i-1]]:
wordDict[words[i-1]][words[i]] = 0
wordDict[words[i-1]][words[i]] = wordDict[words[i-1]][words[i]] + 1
return wordDict
理解:1.如果目标单词(即前一个单词)(i-1)不在字典wordDict(新建)中;
2.为目标单词(i-1)新建一个词典
3.当关联单词,即下一个单词(i)第一次出现,未记录在目标单词 (i-1)字典中时
4.把关联单词(i)记录在目标单词的字典中(i-1),频率为0
5.当关联单词(i)再次出现在目标单词后时,频率+1
计算目标单词出现的总频率
原代码
def wordListSum(wordList):
sum = 0
for word, value in wordList.items():
sum += value
return sum
举例:在文章中目标单词是 " I " ,关联单词有 " hope “, “want”, “think”
即{” I ": { “hope”: 2 , “want” : 2, “think”: 4}
这里算出sum总频率为2+2+4=8
从关联单词中根据权重随机获取关联单词
原代码
def retrieveRandomWord(wordList):
randIndex = randint(1, wordListSum(wordList))
for word, value in wordList.items():
randIndex -=value
if randIndex <= 0:
return word
举例:1.随机获取一个randint范围内数字(1-8)
2.假设获取数字为4
3.hope: 4-2 = 2 >0 (继续循环)
4.want: 2-2=0 结束循环
5.返回单词want
6.词条为 I want
设置马可夫链长,输出结果
原代码
text = str(urlopen("http://pythonscraping.com/files/inaugurationSpeech.txt").read(), 'utf-8')
wordDict = buildWordDict(text)
# 生成链长为100的马可夫链
length = 100
chain = ""
currentWord = "I"
for i in range(0, length):
chain += currentWord + " "
currentWord = retrieveRandomWord(wordDict[currentWord])
print(chain)
参考文献:
Python 网络数据采集 Ryan Mitchell(2016)