Python 网络数据采集学习笔记马可夫模型 (2019/7/2)

最新推荐文章于 2022-12-21 15:29:43 发布

Yuanzhss

最新推荐文章于 2022-12-21 15:29:43 发布

阅读量166

点赞数

分类专栏：学习笔记文章标签：网络数据采集马可夫模型理解

本文链接：https://blog.csdn.net/Yuanzhs/article/details/94462324

版权

学习笔记专栏收录该内容

3 篇文章 0 订阅

订阅专栏

马可夫模型

对这个模型如何运行的的理解

一、从文章中获取一个单词经常关联的下一个单词

原代码

def buildWordDict(text):
	# 剔除换行符和引号
	text = text.replace("\n", " ")
	text = text.replace("\"", "")
	
	# 保证每个标点符号都和前面的单词在一起
	# 这样不会被剔除，保留在马科夫链中
	punctuation = [',', '.', ';', ':']
	for symbol in punctuation:
		text = text.replace(symbol, " " + symbol + " ")
			
	words = text.split(" ")
	# 过滤空单词
	words = [word for word in words if word != ""]

理解：这里是对文章数据进行清洗，需要重点注意的是要对单词后经常出现的标点符号进行保留，这能使文章变得更合理，流畅。

	wordDict = {}
	for i in range(1, len(words)):
		if words[i-1] not in wordDict:
			# 为单词新建一个词典
			wordDict[words[i-1]] ={}
		if words[i] not in wordDict[words[i-1]]:
			wordDict[words[i-1]][words[i]] = 0
		wordDict[words[i-1]][words[i]] = wordDict[words[i-1]][words[i]] + 1
		
	return wordDict

理解：1.如果目标单词（即前一个单词）（i-1）不在字典wordDict(新建)中；
2.为目标单词（i-1）新建一个词典
3.当关联单词，即下一个单词（i）第一次出现，未记录在目标单词（i-1）字典中时
4.把关联单词（i）记录在目标单词的字典中（i-1），频率为0
5.当关联单词（i）再次出现在目标单词后时，频率+1

计算目标单词出现的总频率

原代码

def wordListSum(wordList):
	sum = 0
	for word, value in wordList.items():
		sum += value
	return sum

举例：在文章中目标单词是 " I " ，关联单词有 " hope “, “want”, “think”
即{” I ": { “hope”: 2 , “want” : 2, “think”: 4}
这里算出sum总频率为2+2+4=8

从关联单词中根据权重随机获取关联单词

原代码

def retrieveRandomWord(wordList):
	randIndex = randint(1, wordListSum(wordList))
	for word, value in wordList.items():
		randIndex -=value
		if randIndex <= 0:
			return word

举例：1.随机获取一个randint范围内数字（1-8）
2.假设获取数字为4
3.hope： 4-2 = 2 >0 （继续循环）
4.want： 2-2=0 结束循环
5.返回单词want
6.词条为 I want

设置马可夫链长，输出结果

原代码

text = str(urlopen("http://pythonscraping.com/files/inaugurationSpeech.txt").read(), 'utf-8')
wordDict = buildWordDict(text)

# 生成链长为100的马可夫链
length = 100
chain = ""
currentWord = "I"
for i in range(0, length):
	chain += currentWord + " "
	currentWord = retrieveRandomWord(wordDict[currentWord])
	
print(chain)

参考文献：
Python 网络数据采集 Ryan Mitchell（2016）