基于马尔可夫模型的文本生成器

最新推荐文章于 2022-08-18 23:38:35 发布

Freyua_xx

最新推荐文章于 2022-08-18 23:38:35 发布

阅读量586

点赞数 1

分类专栏： project 文章标签：自然语言处理机器学习算法

本文链接：https://blog.csdn.net/Freyua_xx/article/details/121747591

版权

project 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

马尔可夫过程

只需要知道：
马尔可夫过程就是未来只与现在有关，与过去无关。
马尔科夫链（Markov）是最简单的马氏过程，即时间和状态过程的取值参数都是离散的马氏过程。
例如：
我知道一个事件第一天的概率分布；
还是知道一个马尔可夫的转移矩阵；
那么一个马尔可夫链就是——第一天到第二天的转移概率矩阵和第二天到第三天的转移概率矩阵是一样的，和此后每一天的都一样。

在马尔可夫模型中，

做一个文本生成器

获得一篇.txt文档（英文）

from urllib.request import urlopen
from random import randint

def wordListSum(wordList):
    sum = 0
    for word, value in wordList.items():
        sum += value
    return sum

def retrieveRandomWord(wordList):
    randIndex = randint(1, wordListSum(wordList))
    for word, value in wordList.items():
        randIndex -= value
        if randIndex <= 0:
            return word

def buildWordDict(text):
    # 把换行和引号除去
    text = text.replace('\n', ' ');
    text = text.replace('"', '');

    # 所有标点符号都变成空格 则能够提取单词
    punctuation = [',','.',';',':']
    for symbol in punctuation:
        text = text.replace(symbol, ' {} '.format(symbol));

    words = text.split(' ')
    # Filter out empty words
    words = [word for word in words if word != '']

    wordDict = {}
    for i in range(1, len(words)):#获得一个二维字典（详细分析在后面）
        if words[i-1] not in wordDict:
                # Create a new dictionary for this word
            wordDict[words[i-1]] = {}
        if words[i] not in wordDict[words[i-1]]:
            wordDict[words[i-1]][words[i]] = 0
        wordDict[words[i-1]][words[i]] += 1
    return wordDict

text = str(urlopen('http://pythonscraping.com/files/inaugurationSpeech.txt')
          .read(), 'utf-8')
wordDict = buildWordDict(text)

#以I开头生成一个100单词的马尔可夫链 
length = 100
chain = ['I']
for i in range(0, length):
    newWord = retrieveRandomWord(wordDict[chain[-1]])
    chain.append(newWord)

print(' '.join(chain))