pyspark SparseVector 词向量

最新推荐文章于 2022-07-30 18:05:42 发布

luoganttcc

最新推荐文章于 2022-07-30 18:05:42 发布

阅读量1.7k

点赞数

分类专栏： spark

spark 专栏收录该内容

56 篇文章 1 订阅

订阅专栏


from pyspark.mllib.linalg import SparseVector
from collections import Counter

from pyspark import SparkContext

if __name__ == "__main__":

    sc = SparkContext('local', 'term_doc')
    corpus = sc.parallelize([
    "It is the east, and Juliet is the sun.",
    "A dish fit for the gods.",
    "Brevity is the soul of wit."])

    tokens = corpus.map(lambda raw_text: raw_text.split()).cache()   
    local_vocab_map = tokens.flatMap(lambda token: token).distinct().zipWithIndex().collectAsMap()

    vocab_map = sc.broadcast(local_vocab_map)
    vocab_size = sc.broadcast(len(local_vocab_map))

    term_document_matrix = tokens \
                         .map(Counter) \
                         .map(lambda counts: {vocab_map.value[token]: float(counts[token]) for token in counts}) \
                         .map(lambda index_counts: SparseVector(vocab_size.value, index_counts))

    for doc in term_document_matrix.collect():
        print( doc)

(16,[0,1,2,3,4,5,6],[1.0,2.0,2.0,1.0,1.0,1.0,1.0])
(16,[2,7,8,9,10,11],[1.0,1.0,1.0,1.0,1.0,1.0])
(16,[1,2,12,13,14,15],[1.0,1.0,1.0,1.0,1.0,1.0])

luoganttcc

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
pyspark SparseVector 词向量

from pyspark.mllib.linalg import SparseVectorfrom collections import Counterfrom pyspark import SparkContextif __name__ == "__main__": sc = SparkContext('local', 'term_doc') corpus = sc...
复制链接

扫一扫