SparkMLlib之一Data Types

MLlib支持单机局部向量和局部矩阵,也支持基于RDD的分布式矩阵,
MLlib中的labeled point代表监督学习的训练样本

local vector

MLlib支持两种local vector :dense和sparse.
dense比较简单例如:[1.0, 0.0, 3.0]代表向量(1.0, 0.0, 3.0)
如果用sparse格式则为:(3, [0,2],[1.0, 3.0]) 其中3是向量的大小,[0,2]代表角标,[1.0,3.0]代表真实值
local vector的基类是Vector,有两个实现类:DenseVector,SparseVector,推荐使用Vectors的工厂方法创建local vectors

import org.apache.spark.mllib.linalg.{Vector, Vectors}

// Create a dense vector (1.0, 0.0, 3.0).
val dv: Vector = Vectors.dense(1.0, 0.0, 3.0)
// Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries.
val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))
// Create a sparse vector (1.0, 0.0, 3.0) by specifying its nonzero entries.
val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0)))

labeled point

labeled point 也是local vector. 由于使用了double存储label,因此即可用于回归也可用于分类。分类时,labels应从0开始,0,1,3,。。。

import org.apache.spark.mllib.linalg.{Vector, Vectors}

// Create a dense vector (1.0, 0.0, 3.0).
val dv: Vector = Vectors.dense(1.0, 0.0, 3.0)
// Create a sparse vector (1.0, 0.0, 3.0) by specifying its indices and values corresponding to nonzero entries.
val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))
// Create a sparse vector (1.0, 0.0, 3.0) by specifying its nonzero entries.
val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0)))
Sparse data

It is very common in practice to have sparse training data. MLlib supports reading training examples stored in LIBSVM format, which is the default format used by LIBSVM and LIBLINEAR. It is a text format in which each line represents a labeled sparse feature vector using the following format:

label index1:value1 index2:value2 ...

where the indices are one-based and in ascending order. After loading, the feature indices are converted to zero-based.

import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.rdd.RDD

val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

local matrix

local matrix的基类是Matrix,提供了两个实现:DenseMatrix,SparseMatrix,推荐使用Matrices的工厂方法创建local matrices。记住按列存储

import org.apache.spark.mllib.linalg.{Matrix, Matrices}

// Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))

// Create a sparse matrix ((9.0, 0.0), (0.0, 8.0), (0.0, 6.0))
val sm: Matrix = Matrices.sparse(3, 2, Array(0, 1, 3), Array(0, 2, 1), Array(9, 6, 8))

Distributed matrix

存储在多个RDD中,对于大的分布式矩阵选择存贮格式很重要。把分布式矩阵转为不同格式需要全局shuffle,非常昂贵。目前实现了三种类型的分布式矩阵。
-
基础类型是RowMatix,是一个基于行的分布式矩阵,例如一个特征向量的集合。由RDD的行组成,每个行是一个local vector.

  • IndexedRowMatrix类似于RowMatrix但是有行下标,可以用于识别行以及执行join.
  • CoordinateMatrix是存储在RDD尸体中的coordinate list中的
    注意:分布式矩阵必须是确定的

RowMatrix

可以由RDD[Vector]创建,可以计算列统计量及分解,如:
OR,SVD,PCA.

import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.linalg.distributed.RowMatrix

val rows: RDD[Vector] = ... // an RDD of local vectors
// Create a RowMatrix from an RDD[Vector].
val mat: RowMatrix = new RowMatrix(rows)

// Get its size.
val m = mat.numRows()
val n = mat.numCols()

// QR decomposition 
val qrResult = mat.tallSkinnyQR(true)

indexedRowMatrix

IndexedRowMatrix can be converted to a RowMatrix by dropping its row indices.

import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix, RowMatrix}

val rows: RDD[IndexedRow] = ... // an RDD of indexed rows
// Create an IndexedRowMatrix from an RDD[IndexedRow].
val mat: IndexedRowMatrix = new IndexedRowMatrix(rows)

// Get its size.
val m = mat.numRows()
val n = mat.numCols()

// Drop its row indices.
val rowMat: RowMatrix = mat.toRowMatrix()

CoordinateMatrix

A CoordinateMatrix is a distributed matrix backed by an RDD of its entries. Each entry is a tuple of (i: Long, j: Long, value: Double), where i is the row index, j is the column index, and value is the entry value. A CoordinateMatrix should be used only when both dimensions of the matrix are huge and the matrix is very sparse.

import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}

val entries: RDD[MatrixEntry] = ... // an RDD of matrix entries
// Create a CoordinateMatrix from an RDD[MatrixEntry].
val mat: CoordinateMatrix = new CoordinateMatrix(entries)

// Get its size.
val m = mat.numRows()
val n = mat.numCols()

// Convert it to an IndexRowMatrix whose rows are sparse vectors.
val indexedRowMatrix = mat.toIndexedRowMatrix()
BlockMatrix

A BlockMatrix is a distributed matrix backed by an RDD of MatrixBlocks, where a MatrixBlock is a tuple of ((Int, Int), Matrix), where the (Int, Int) is the index of the block, and Matrix is the sub-matrix at the given index with size rowsPerBlock x colsPerBlock. BlockMatrix supports methods such as add and multiply with another BlockMatrix. BlockMatrix also has a helper function validate which can be used to check whether the BlockMatrix is set up properly.

import org.apache.spark.mllib.linalg.distributed.{BlockMatrix, CoordinateMatrix, MatrixEntry}

val entries: RDD[MatrixEntry] = ... // an RDD of (i, j, v) matrix entries
// Create a CoordinateMatrix from an RDD[MatrixEntry].
val coordMat: CoordinateMatrix = new CoordinateMatrix(entries)
// Transform the CoordinateMatrix to a BlockMatrix
val matA: BlockMatrix = coordMat.toBlockMatrix().cache()

// Validate whether the BlockMatrix is set up properly. Throws an Exception when it is not valid.
// Nothing happens if it is valid.
matA.validate()

// Calculate A^T A.
val ata = matA.transpose.multiply(matA)
Spark MLlib是一个基于分布式架构的开源机器学习库,它在机器学习领域的发展非常快速。MLlib支持主流的统计和机器学习算法,并且在计算效率方面具有很高的性能。MLlib目前支持分类、回归、聚类和协同过滤等四种常见的机器学习问题。它提供了一系列的机器学习算法,包括逻辑回归、决策树、随机森林、支持向量机等。你可以使用MLlib来处理和分析大规模的数据集,并应用机器学习算法进行模型训练和预测。 在使用MLlib进行机器学习任务时,你可以使用SparkDataFrame API来进行数据的预处理和特征工程。例如,你可以使用Tokenizer对句子进行分词,将其转化为词语的序列,然后使用HashingTF计算词频,并应用TF-IDF来获取每个词语的重要性。这些预处理步骤能够帮助你将文本数据转化为可供机器学习算法处理的数值特征。 总之,Spark MLlib是一个强大且高效的机器学习库,它提供了丰富的机器学习算法和工具,可以帮助你进行各种机器学习任务,包括分类、回归、聚类和协同过滤。通过使用Spark的分布式计算能力,MLlib能够处理大规模的数据集,并提供高性能的机器学习解决方案。<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* [sparkMLLIB](https://blog.csdn.net/u013069552/article/details/108911123)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_1"}}] [.reference_item style="max-width: 50%"] - *2* *3* [Spark MLlib简介](https://blog.csdn.net/MusicDancing/article/details/120107185)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_1"}}] [.reference_item style="max-width: 50%"] [ .reference_list ]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值