N元模型
预测要输入的连续词,比如
如果抽取两个连续的词汇,则称之为二元模型
准备工作
数据集使用 Alice in Wonderland
将初始数据提取N-grams
import nltk
import string
with open('alice_in_wonderland.txt', 'r') as content_file:
content = content_file.read()
content2 = " ".join("".join([" " if ch in string.punctuation else ch for ch in content]).split())
tokens = nltk.word_tokenize(content2)
tokens = [word.lower() for word in tokens if len(word)>=2]
N = 3
quads = list(nltk.ngrams(tokens,N))
"""
Return the ngrams generated from a sequence of items, as an iterator.
For example:
>>> from nltk.util import ngrams
>>> list(ngrams([1,2,3,4,5], 3))
[(1, 2, 3), (2, 3, 4), (3, 4, 5)]
"""
newl_app = []
for ln in quads:
new1 = ' '.join(ln)
newl_app.append(new1)
print(newl_app[:3])
输出:
['alice adventures in', 'adventures in wonderland', 'in wonderland alice']
如何实现
1.预处理:词转换为词向量
2.创建模型和验证:将输入映射到输出的收敛-发散模型(convergent-divergent)
3.预测:最优词预测
代码
from __future__ import print_function
from sklearn.model_selection import train_test_split
import nltk
import numpy as np
import string</