Scikit-network-13:Text mining

Text mining

展示了如何使用Scikit网络进行文本挖掘,在这里考虑Victor Hugo的小说《LesRisérables》(Isabel F. Hapgood的翻译)。通过考虑单词和段落之间的图,可以将单词和段落嵌入相同的向量空间中,并在它们之间计算余弦相似性。计算相似单词,相似段落

每个单词都被认为是原始文本中的词,也可以改用更高级的标识化(tokenization)

可以考虑其他图,例如就像5个单词窗口中单词共发生的图一样,或章节和单词的图。这些图可以组合在一起以获取更丰富的信息和更好的嵌入。

from re import sub
import numpy as np

from sknetwork.data import from_adjacency_list
from sknetwork.embedding import Spectral
from sknetwork.linalg import normalize

加载数据

filename = 'miserables-en.txt'

with open(filename, 'r') as f:
    text = f.read()

len(text)
3254528
print(text[:494])
The Project Gutenberg EBook of Les Miserables, by Victor Hugo

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org

Title: Les Miserables
       Complete in Five Volumes

Author: Victor Hugo

Translator: Isabel F. Hapgood

Release Date: June 22, 2008 [EBook #135]
Last Updated: October 30, 2009

预处理

# extract main text
main = text.split('LES MIS')[-2].lower()
len(main)
3215617
# remove ponctuation
main = sub(r"[,.;:()@#?!&$'_*]", " ", main)
main = sub(r'["-]', ' ', main)
# extract paragraphs
sep = '|||'
main = sub(r'\n\n+', sep, main)
main = sub('\n', ' ', main)
paragraphs = main.split(sep)
len(paragraphs)
13505
paragraphs[1]
'volume i   fantine '

构图

def ignoreSpace(ch):
    return ch != ''
paragraph_words = [paragraph.split(' ') for paragraph in paragraphs]
new_para = []
for paragraph_word in paragraph_words:
    new_para.append(list(filter(ignoreSpace, paragraph_word)))
paragraph_words = new_para
paragraph_words[1]
['volume', 'i', 'fantine']
graph = from_adjacency_list(paragraph_words, bipartite=True)
biadjacency = graph.biadjacency
words = graph.names_col
biadjacency.shape
(13504, 23134)
len(words)
23134

统计

n_row, n_col = biadjacency.shape
paragraph_lengths = biadjacency.dot(np.ones(n_col))
paragraph_lengths
array([  1.,   3., 184., ...,  13.,  61.,  18.])
len(paragraph_lengths)
13504
word_counts = biadjacency.T.dot(np.ones(n_row))
# 计算分位数
np.quantile(word_counts, [0.1, 0.5, 0.9, 0.99])
array([  1.  ,   2.  ,  23.  , 281.67])

Embedding

dimension = 50
spectral = Spectral(dimension, regularization=100)
spectral.fit(biadjacency)
Spectral(n_components=50, decomposition='rw', regularization=100, normalized=True)
embedding_paragraph = spectral.embedding_row_
embedding_word = spectral.embedding_col_
# some word
i = int(np.argwhere(words == 'love'))
i
12716
# most similar words
cosines_word = embedding_word.dot(embedding_word[i])
words[np.argsort(-cosines_word)[:20]]
array(['love', 'kiss', 'ye', 'celestial', 'loved', 'hearts', 'roses',
       'joys', 'voluptuousness', 'sweet', 'pearl', 'blindly', 'charming',
       'youth', 'angelic', 'adore', 'sweetly', 'chaste', 'marriage',
       'beautiful'], dtype='<U21')
np.quantile(cosines_word, [0.01, 0.1, 0.5, 0.9, 0.99])
array([-0.25546819, -0.14986081, -0.02675247,  0.15277004,  0.43360642])
# some paragraph
i = 1000
print(paragraphs[i])
about three o clock the four couples  frightened at their happiness  were sliding down the russian mountains  a singular edifice which then occupied the heights of beaujon  and whose undulating line was visible above the trees of the champs elysees 
# most similar paragraphs
cosines_paragraph = embedding_paragraph.dot(embedding_paragraph[i])
for j in np.argsort(-cosines_paragraph)[:3]:
    print(paragraphs[j])
    print()
about three o clock the four couples  frightened at their happiness  were sliding down the russian mountains  a singular edifice which then occupied the heights of beaujon  and whose undulating line was visible above the trees of the champs elysees 

a man of ripe age and a young girl made their appearance on the threshold of the attic 

when the man had disappeared in the thicket  fauchelevent listened until he heard his footsteps die away in the distance  then he leaned over the grave  and said in a low tone   
np.quantile(cosines_paragraph, [0.01, 0.1, 0.5, 0.9, 0.99])
array([-0.30736959, -0.18061334, -0.01655039,  0.19731086,  0.39456743])

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

uncle_ll

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值