Text mining
展示了如何使用Scikit网络进行文本挖掘,在这里考虑Victor Hugo的小说《LesRisérables》(Isabel F. Hapgood的翻译)。通过考虑单词和段落之间的图,可以将单词和段落嵌入相同的向量空间中,并在它们之间计算余弦相似性。计算相似单词,相似段落
每个单词都被认为是原始文本中的词,也可以改用更高级的标识化(tokenization)
可以考虑其他图,例如就像5个单词窗口中单词共发生的图一样,或章节和单词的图。这些图可以组合在一起以获取更丰富的信息和更好的嵌入。
from re import sub
import numpy as np
from sknetwork.data import from_adjacency_list
from sknetwork.embedding import Spectral
from sknetwork.linalg import normalize
加载数据
filename = 'miserables-en.txt'
with open(filename, 'r') as f:
text = f.read()
len(text)
3254528
print(text[:494])
The Project Gutenberg EBook of Les Miserables, by Victor Hugo
This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever. You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org
Title: Les Miserables
Complete in Five Volumes
Author: Victor Hugo
Translator: Isabel F. Hapgood
Release Date: June 22, 2008 [EBook #135]
Last Updated: October 30, 2009
预处理
# extract main text
main = text.split('LES MIS')[-2].lower()
len(main)
3215617
# remove ponctuation
main = sub(r"[,.;:()@#?!&$'_*]", " ", main)
main = sub(r'["-]', ' ', main)
# extract paragraphs
sep = '|||'
main = sub(r'\n\n+', sep, main)
main = sub('\n', ' ', main)
paragraphs = main.split(sep)
len(paragraphs)
13505
paragraphs[1]
'volume i fantine '
构图
def ignoreSpace(ch):
return ch != ''
paragraph_words = [paragraph.split(' ') for paragraph in paragraphs]
new_para = []
for paragraph_word in paragraph_words:
new_para.append(list(filter(ignoreSpace, paragraph_word)))
paragraph_words = new_para
paragraph_words[1]
['volume', 'i', 'fantine']
graph = from_adjacency_list(paragraph_words, bipartite=True)
biadjacency = graph.biadjacency
words = graph.names_col
biadjacency.shape
(13504, 23134)
len(words)
23134
统计
n_row, n_col = biadjacency.shape
paragraph_lengths = biadjacency.dot(np.ones(n_col))
paragraph_lengths
array([ 1., 3., 184., ..., 13., 61., 18.])
len(paragraph_lengths)
13504
word_counts = biadjacency.T.dot(np.ones(n_row))
# 计算分位数
np.quantile(word_counts, [0.1, 0.5, 0.9, 0.99])
array([ 1. , 2. , 23. , 281.67])
Embedding
dimension = 50
spectral = Spectral(dimension, regularization=100)
spectral.fit(biadjacency)
Spectral(n_components=50, decomposition='rw', regularization=100, normalized=True)
embedding_paragraph = spectral.embedding_row_
embedding_word = spectral.embedding_col_
# some word
i = int(np.argwhere(words == 'love'))
i
12716
# most similar words
cosines_word = embedding_word.dot(embedding_word[i])
words[np.argsort(-cosines_word)[:20]]
array(['love', 'kiss', 'ye', 'celestial', 'loved', 'hearts', 'roses',
'joys', 'voluptuousness', 'sweet', 'pearl', 'blindly', 'charming',
'youth', 'angelic', 'adore', 'sweetly', 'chaste', 'marriage',
'beautiful'], dtype='<U21')
np.quantile(cosines_word, [0.01, 0.1, 0.5, 0.9, 0.99])
array([-0.25546819, -0.14986081, -0.02675247, 0.15277004, 0.43360642])
# some paragraph
i = 1000
print(paragraphs[i])
about three o clock the four couples frightened at their happiness were sliding down the russian mountains a singular edifice which then occupied the heights of beaujon and whose undulating line was visible above the trees of the champs elysees
# most similar paragraphs
cosines_paragraph = embedding_paragraph.dot(embedding_paragraph[i])
for j in np.argsort(-cosines_paragraph)[:3]:
print(paragraphs[j])
print()
about three o clock the four couples frightened at their happiness were sliding down the russian mountains a singular edifice which then occupied the heights of beaujon and whose undulating line was visible above the trees of the champs elysees
a man of ripe age and a young girl made their appearance on the threshold of the attic
when the man had disappeared in the thicket fauchelevent listened until he heard his footsteps die away in the distance then he leaned over the grave and said in a low tone
np.quantile(cosines_paragraph, [0.01, 0.1, 0.5, 0.9, 0.99])
array([-0.30736959, -0.18061334, -0.01655039, 0.19731086, 0.39456743])