Scikit-network-13：Text mining

最新推荐文章于 2024-04-15 10:00:58 发布

uncle_ll

最新推荐文章于 2024-04-15 10:00:58 发布

阅读量106

点赞数

分类专栏： # Scikit-network 文章标签： scikit-network Text mining

本文链接：https://blog.csdn.net/uncle_ll/article/details/131252173

版权

Scikit-network 专栏收录该内容

14 篇文章 1 订阅

订阅专栏

Text mining

展示了如何使用Scikit网络进行文本挖掘，在这里考虑Victor Hugo的小说《LesRisérables》（Isabel F. Hapgood的翻译）。通过考虑单词和段落之间的图，可以将单词和段落嵌入相同的向量空间中，并在它们之间计算余弦相似性。计算相似单词，相似段落

每个单词都被认为是原始文本中的词，也可以改用更高级的标识化（tokenization）

可以考虑其他图，例如就像5个单词窗口中单词共发生的图一样，或章节和单词的图。这些图可以组合在一起以获取更丰富的信息和更好的嵌入。

from re import sub
import numpy as np

from sknetwork.data import from_adjacency_list
from sknetwork.embedding import Spectral
from sknetwork.linalg import normalize

加载数据

filename = 'miserables-en.txt'

with open(filename, 'r') as f:
    text = f.read()

len(text)

print(text[:494])

The Project Gutenberg EBook of Les Miserables, by Victor Hugo

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org

Title: Les Miserables
       Complete in Five Volumes

Author: Victor Hugo

Translator: Isabel F. Hapgood

Release Date: June 22, 2008 [EBook #135]
Last Updated: October 30, 2009

预处理

# extract main text
main = text.split('LES MIS')[-2].lower()
len(main)

# remove ponctuation
main = sub(r"[,.;:()@#?!&$'_*]", " ", main)
main = sub(r'["-]', ' ', main)

# extract paragraphs
sep = '|||'
main = sub(r'\n\n+', sep, main)
main = sub('\n', ' ', main)
paragraphs = main.split(sep)

len(paragraphs)

paragraphs[1]

'volume i   fantine '

构图

def ignoreSpace(ch):
    return ch != ''

paragraph_words = [paragraph.split(' ') for paragraph in paragraphs]
new_para = []
for paragraph_word in paragraph_words:
    new_para.append(list(filter(ignoreSpace, paragraph_word)))
paragraph_words = new_para
paragraph_words[1]

['volume', 'i', 'fantine']

graph = from_adjacency_list(paragraph_words, bipartite=True)

biadjacency = graph.biadjacency
words = graph.names_col

biadjacency.shape

(13504, 23134)

len(words)

统计

n_row, n_col = biadjacency.shape

paragraph_lengths = biadjacency.dot(np.ones(n_col))
paragraph_lengths

array([  1.,   3., 184., ...,  13.,  61.,  18.])

len(paragraph_lengths)

word_counts = biadjacency.T.dot(np.ones(n_row))

# 计算分位数
np.quantile(word_counts, [0.1, 0.5, 0.9, 0.99])

array([  1.  ,   2.  ,  23.  , 281.67])

Embedding

dimension = 50
spectral = Spectral(dimension, regularization=100)

spectral.fit(biadjacency)

Spectral(n_components=50, decomposition='rw', regularization=100, normalized=True)

embedding_paragraph = spectral.embedding_row_
embedding_word = spectral.embedding_col_

# some word
i = int(np.argwhere(words == 'love'))
i

# most similar words
cosines_word = embedding_word.dot(embedding_word[i])
words[np.argsort(-cosines_word)[:20]]

array(['love', 'kiss', 'ye', 'celestial', 'loved', 'hearts', 'roses',
       'joys', 'voluptuousness', 'sweet', 'pearl', 'blindly', 'charming',
       'youth', 'angelic', 'adore', 'sweetly', 'chaste', 'marriage',
       'beautiful'], dtype='<U21')

np.quantile(cosines_word, [0.01, 0.1, 0.5, 0.9, 0.99])

array([-0.25546819, -0.14986081, -0.02675247,  0.15277004,  0.43360642])

# some paragraph
i = 1000
print(paragraphs[i])

about three o clock the four couples  frightened at their happiness  were sliding down the russian mountains  a singular edifice which then occupied the heights of beaujon  and whose undulating line was visible above the trees of the champs elysees

# most similar paragraphs
cosines_paragraph = embedding_paragraph.dot(embedding_paragraph[i])
for j in np.argsort(-cosines_paragraph)[:3]:
    print(paragraphs[j])
    print()

about three o clock the four couples  frightened at their happiness  were sliding down the russian mountains  a singular edifice which then occupied the heights of beaujon  and whose undulating line was visible above the trees of the champs elysees 

a man of ripe age and a young girl made their appearance on the threshold of the attic 

when the man had disappeared in the thicket  fauchelevent listened until he heard his footsteps die away in the distance  then he leaned over the grave  and said in a low tone

np.quantile(cosines_paragraph, [0.01, 0.1, 0.5, 0.9, 0.99])

array([-0.30736959, -0.18061334, -0.01655039,  0.19731086,  0.39456743])

uncle_ll

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录