3 Local matrix 本地矩阵
目前只支持DenseMatrix.SparseMatrix有待在以后版本出现。
scala> import org.apache.spark.mllib.linalg.{Matrix, Matrices}
import org.apache.spark.mllib.linalg.{Matrix, Matrices}scala> val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))// Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
dm: org.apache.spark.mllib.linalg.Matrix =
1.0 2.0
3.0 4.0
5.0 6.0
4 Distributed matrix 分布式的矩阵
一个分布式的矩阵有long型的行列数和double型的值,以一个或者多个RDD形式分布式存。对于存储打的分布式的矩阵来说,选择正确的格式非常重要。把一个分布式的矩阵转换成一个不同的格式需要全体打乱,这代价非常高。这里只讲了三种类型的分布式矩阵。
注释:
一个分布式矩阵的RDD必须是确定性的,因为我们要缓存矩阵的大小。
4.1 RowMatrix
一个行矩阵就是把每行对应一个RDD,将矩阵的每行分布式存储,矩阵的每行是一个本地向量。这和多变量统计的数据矩阵比较相似。因为每行以一个本地向量表示,那么矩阵列的数量被限制在整数范围内,但是实际应用中列数很小。
这个要仔细说一下:
import org.apache.spark.mllib.linalg.Vector import org.apache.spark.mllib.linalg.distributed.RowMatrix val rows: RDD[Vector] = ... // an RDD of local vectors // Create a RowMatrix from an RDD[Vector]. val mat: RowMatrix = new RowMatrix(rows) // Get its size. val m = mat.numRows() val n = mat.numCols
思考:考虑RDD[Vector]该如何生成呢?特别是从一个文件作为数据来源,比如说text文件。
解答:
假设: 现在在hdfs上,路径为hdfs://node001:9000/spark/input/data.txt 有个data.txt文件是一个三行三列的矩阵,元素之间用制彪符\t 隔开。
1 1 2
2 3 4
5 6 7
* 这个时候RDD,可以看到有text中的每行算一个String --------------------------->RDD[String].
scala> val textfile=sc.textFile("hdfs://node001:9000/spark/input/data.txt")
14/07/11 00:45:33 INFO MemoryStore: ensureFreeSpace(82268) called with curMem=249772, maxMem=309225062
14/07/11 00:45:33 INFO MemoryStore: Block broadcast_2 stored as values to memory (estimated size 80.3 KB, free 294.6 MB)
textfile: org.apache.spark.rdd.RDD[String] = MappedRDD[13] at textFile at <console>:21
* 去掉制表符,每行的每个数都是String,一行是一个Array[String], ------------------------------->RDD[Array[String]]
scala> val middle=textfile.map((arg) =>arg.split("\\t"))
middle: org.apache.spark.rdd.RDD[Array[String]] = MappedRDD[14] at map at <console>:23
* 每行每个数由String -> Double ------------------------------->RDD[Array[Double]]
scala> val mid=middle.map((arg)=>arg.map((args)=>args.toDouble))
mid: org.apache.spark.rdd.RDD[Array[Double]] = MappedRDD[17] at map at <console>:25
* 定义函数change ,将Array[Double]变成Vector。
scala> def change(t:Array[Double]):Vector={
| val x=Vectors.dense(t)
| x
| }
change: (t: Array[Double])org.apache.spark.mllib.linalg.Vector
* 通过change则text中的每行通过Array[String]生成Vector. --------------------------------------------->RDD[Vector]
scala> val ha=mid.map(change)
ha: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MappedRDD[18] at map at <console>:29
* 通过RDD[Vector]创建RowMatrix。
scala> val mat:RowMatrix=new RowMatrix(ha)
mat: org.apache.spark.mllib.linalg.distributed.RowMatrix = org.apache.spark.mllib.linalg.distributed.RowMatrix@2ec01ae5
* 下面是计算矩阵的行数和列数。
scala> val m=mat.numRows()
14/07/11 01:05:50 INFO SparkContext: Starting job: count at RowMatrix.scala:194
14/07/11 01:05:50 INFO DAGScheduler: Got job 2 (count at RowMatrix.scala:194) with 2 output partitions (allowLocal=false)
14/07/11 01:05:50 INFO DAGScheduler: Final stage: Stage 2(count at RowMatrix.scala:194)
14/07/11 01:05:50 INFO DAGScheduler: Parents of final stage: List()
14/07/11 01:05:50 INFO DAGScheduler: Missing parents: List()
14/07/11 01:05:50 INFO DAGScheduler: Submitting Stage 2 (MappedRDD[18] at map at <console>:29), which has no missing parents
14/07/11 01:05:50 INFO DAGScheduler: Submitting 2 missing tasks from Stage 2 (MappedRDD[18] at map at <console>:29)
14/07/11 01:05:50 INFO TaskSchedulerImpl: Adding task set 2.0 with 2 tasks
14/07/11 01:05:50 INFO TaskSetManager: Starting task 2.0:0 as TID 4 on executor localhost: localhost (PROCESS_LOCAL)
14/07/11 01:05:50 INFO TaskSetManager: Serialized task 2.0:0 as 2827 bytes in 0 ms
14/07/11 01:05:50 INFO TaskSetManager: Starting task 2.0:1 as TID 5 on executor localhost: localhost (PROCESS_LOCAL)
14/07/11 01:05:50 INFO TaskSetManager: Serialized task 2.0:1 as 2827 bytes in 0 ms
14/07/11 01:05:50 INFO Executor: Running task ID 4
14/07/11 01:05:50 INFO Executor: Running task ID 5
14/07/11 01:05:50 INFO BlockManager: Found block broadcast_2 locally
14/07/11 01:05:50 INFO BlockManager: Found block broadcast_2 locally
14/07/11 01:05:50 INFO HadoopRDD: Input split: hdfs://node001:9000/spark/input/data.txt:0+9
14/07/11 01:05:50 INFO HadoopRDD: Input split: hdfs://node001:9000/spark/input/data.txt:9+9
14/07/11 01:05:50 INFO Executor: Serialized size of result for 5 is 597
14/07/11 01:05:50 INFO Executor: Serialized size of result for 4 is 597
14/07/11 01:05:50 INFO Executor: Sending result for 5 directly to driver
14/07/11 01:05:50 INFO Executor: Sending result for 4 directly to driver
14/07/11 01:05:50 INFO Executor: Finished task ID 5
14/07/11 01:05:50 INFO Executor: Finished task ID 4
14/07/11 01:05:50 INFO DAGScheduler: Completed ResultTask(2, 1)
14/07/11 01:05:50 INFO TaskSetManager: Finished TID 5 in 26 ms on localhost (progress: 1/2)
14/07/11 01:05:50 INFO DAGScheduler: Completed ResultTask(2, 0)
14/07/11 01:05:50 INFO TaskSetManager: Finished TID 4 in 28 ms on localhost (progress: 2/2)
14/07/11 01:05:50 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
14/07/11 01:05:50 INFO DAGScheduler: Stage 2 (count at RowMatrix.scala:194) finished in 0.028 s
14/07/11 01:05:50 INFO SparkContext: Job finished: count at RowMatrix.scala:194, took 0.050264151 s
m: Long = 3
scala> val n=mat.numCols()
14/07/11 01:06:02 INFO SparkContext: Starting job: first at RowMatrix.scala:186
14/07/11 01:06:02 INFO DAGScheduler: Got job 3 (first at RowMatrix.scala:186) with 1 output partitions (allowLocal=true)
14/07/11 01:06:02 INFO DAGScheduler: Final stage: Stage 3(first at RowMatrix.scala:186)
14/07/11 01:06:02 INFO DAGScheduler: Parents of final stage: List()
14/07/11 01:06:02 INFO DAGScheduler: Missing parents: List()
14/07/11 01:06:02 INFO DAGScheduler: Computing the requested partition locally
14/07/11 01:06:02 INFO HadoopRDD: Input split: hdfs://node001:9000/spark/input/data.txt:0+9
14/07/11 01:06:02 INFO SparkContext: Job finished: first at RowMatrix.scala:186, took 0.011817442 s
n: Long = 3
4.1.1 Multivariate summary statistics
我们为RowMatrix提供了列的总和统计,如果列的数量不是太大,比如小于3000,你就能够计算出 协方差矩阵,作为本地矩阵,这需要 O(n2) 的存储空间,其中n是矩阵的列数。总共的CPU时间是 O(mn2) , m是矩阵的行数,如果是稀疏矩阵的话,会更快。
import org.apache.spark.mllib.linalg.Matrix
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.stat.MultivariateStatisticalSummary
val mat:RowMatrix = ...// a RowMatrix
// Compute column summary statistics.
val summary:MultivariateStatisticalSummary = mat.computeColumnSummaryStatistics()
println(summary.mean)// a dense vector containing the mean value for each column
println(summary.variance)// column-wise variance
println(summary.numNonzeros)// number of nonzeros in each column
// Compute the covariance matrix.
val Cov:Matrix = mat.computeCovariance()
实验:
scala> val summary: MultivariateStatisticalSummary = mat.computeColumnSummaryStatistics()
14/07/11 04:23:50 INFO SparkContext: Starting job: aggregate at RowMatrix.scala:374
14/07/11 04:23:50 INFO DAGScheduler: Got job 7 (aggregate at RowMatrix.scala:374) with 2 output partitions (allowLocal=false)
14/07/11 04:23:50 INFO DAGScheduler: Final stage: Stage 7(aggregate at RowMatrix.scala:374)
14/07/11 04:23:50 INFO DAGScheduler: Parents of final stage: List()
14/07/11 04:23:50 INFO DAGScheduler: Missing parents: List()
14/07/11 04:23:50 INFO DAGScheduler: Submitting Stage 7 (MappedRDD[21] at map at RowMatrix.scala:374), which has no missing parents
14/07/11 04:23:50 INFO DAGScheduler: Submitting 2 missing tasks from Stage 7 (MappedRDD[21] at map at RowMatrix.scala:374)
14/07/11 04:23:50 INFO TaskSchedulerImpl: Adding task set 7.0 with 2 tasks
14/07/11 04:23:50 INFO TaskSetManager: Starting task 7.0:0 as TID 12 on executor localhost: localhost (PROCESS_LOCAL)
14/07/11 04:23:50 INFO TaskSetManager: Serialized task 7.0:0 as 3252 bytes in 1 ms
14/07/11 04:23:50 INFO TaskSetManager: Starting task 7.0:1 as TID 13 on executor localhost: localhost (PROCESS_LOCAL)
14/07/11 04:23:50 INFO TaskSetManager: Serialized task 7.0:1 as 3252 bytes in 1 ms
14/07/11 04:23:50 INFO Executor: Running task ID 12
14/07/11 04:23:50 INFO Executor: Running task ID 13
14/07/11 04:23:50 INFO BlockManager: Found block broadcast_2 locally
14/07/11 04:23:50 INFO BlockManager: Found block broadcast_2 locally
14/07/11 04:23:50 INFO HadoopRDD: Input split: hdfs://node001:9000/spark/input/data.txt:0+9
14/07/11 04:23:50 INFO HadoopRDD: Input split: hdfs://node001:9000/spark/input/data.txt:9+9
14/07/11 04:23:50 INFO Executor: Serialized size of result for 12 is 1898
14/07/11 04:23:50 INFO Executor: Sending result for 12 directly to driver
14/07/11 04:23:50 INFO Executor: Finished task ID 12
14/07/11 04:23:50 INFO DAGScheduler: Completed ResultTask(7, 0)
14/07/11 04:23:50 INFO Executor: Serialized size of result for 13 is 1898
14/07/11 04:23:50 INFO TaskSetManager: Finished TID 12 in 16 ms on localhost (progress: 1/2)
14/07/11 04:23:50 INFO Executor: Sending result for 13 directly to driver
14/07/11 04:23:50 INFO Executor: Finished task ID 13
14/07/11 04:23:50 INFO DAGScheduler: Completed ResultTask(7, 1)
14/07/11 04:23:50 INFO TaskSetManager: Finished TID 13 in 17 ms on localhost (progress: 2/2)
14/07/11 04:23:50 INFO TaskSchedulerImpl: Removed TaskSet 7.0, whose tasks have all completed, from pool
14/07/11 04:23:50 INFO DAGScheduler: Stage 7 (aggregate at RowMatrix.scala:374) finished in 0.018 s
14/07/11 04:23:50 INFO SparkContext: Job finished: aggregate at RowMatrix.scala:374, took 0.029782399 s
summary: org.apache.spark.mllib.stat.MultivariateStatisticalSummary = org.apache.spark.mllib.linalg.distributed.ColumnStatisticsAggregator@5a30f467
scala> println(summary.mean) //是针对每一列
[2.6666666666666665,3.3333333333333335,4.333333333333333]
scala> println(summary.variance) //每一列
[4.333333333333333,6.333333333333333,6.333333333333333]
scala> println(summary.numNonzeros)
[3.0,3.0,3.0]
scala> val Cov: Matrix = mat.computeCovariance()
14/07/11 04:25:25 INFO SparkContext: Starting job: aggregate at RowMatrix.scala:312
14/07/11 04:25:25 INFO DAGScheduler: Got job 8 (aggregate at RowMatrix.scala:312) with 2 output partitions (allowLocal=false)
14/07/11 04:25:25 INFO DAGScheduler: Final stage: Stage 8(aggregate at RowMatrix.scala:312)
14/07/11 04:25:25 INFO DAGScheduler: Parents of final stage: List()
14/07/11 04:25:25 INFO DAGScheduler: Missing parents: List()
14/07/11 04:25:25 INFO DAGScheduler: Submitting Stage 8 (MappedRDD[18] at map at <console>:29), which has no missing parents
14/07/11 04:25:25 INFO DAGScheduler: Submitting 2 missing tasks from Stage 8 (MappedRDD[18] at map at <console>:29)
14/07/11 04:25:25 INFO TaskSchedulerImpl: Adding task set 8.0 with 2 tasks
14/07/11 04:25:25 INFO TaskSetManager: Starting task 8.0:0 as TID 14 on executor localhost: localhost (PROCESS_LOCAL)
14/07/11 04:25:25 INFO TaskSetManager: Serialized task 8.0:0 as 3105 bytes in 0 ms
14/07/11 04:25:25 INFO TaskSetManager: Starting task 8.0:1 as TID 15 on executor localhost: localhost (PROCESS_LOCAL)
14/07/11 04:25:25 INFO TaskSetManager: Serialized task 8.0:1 as 3105 bytes in 0 ms
14/07/11 04:25:25 INFO Executor: Running task ID 14
14/07/11 04:25:25 INFO Executor: Running task ID 15
14/07/11 04:25:25 INFO BlockManager: Found block broadcast_2 locally
14/07/11 04:25:25 INFO BlockManager: Found block broadcast_2 locally
14/07/11 04:25:25 INFO HadoopRDD: Input split: hdfs://node001:9000/spark/input/data.txt:0+9
14/07/11 04:25:25 INFO HadoopRDD: Input split: hdfs://node001:9000/spark/input/data.txt:9+9
14/07/11 04:25:25 INFO Executor: Serialized size of result for 15 is 1140
14/07/11 04:25:25 INFO Executor: Sending result for 15 directly to driver
14/07/11 04:25:25 INFO Executor: Finished task ID 15
14/07/11 04:25:25 INFO Executor: Serialized size of result for 14 is 1140
14/07/11 04:25:25 INFO Executor: Sending result for 14 directly to driver
14/07/11 04:25:25 INFO Executor: Finished task ID 14
14/07/11 04:25:25 INFO DAGScheduler: Completed ResultTask(8, 1)
14/07/11 04:25:25 INFO TaskSetManager: Finished TID 15 in 16 ms on localhost (progress: 1/2)
14/07/11 04:25:25 INFO TaskSetManager: Finished TID 14 in 18 ms on localhost (progress: 2/2)
14/07/11 04:25:25 INFO TaskSchedulerImpl: Removed TaskSet 8.0, whose tasks have all completed, from pool
14/07/11 04:25:25 INFO DAGScheduler: Completed ResultTask(8, 0)
14/07/11 04:25:25 INFO DAGScheduler: Stage 8 (aggregate at RowMatrix.scala:312) finished in 0.019 s
14/07/11 04:25:25 INFO SparkContext: Job finished: aggregate at RowMatrix.scala:312, took 0.026187565 s
14/07/11 04:25:25 INFO SparkContext: Starting job: aggregate at RowMatrix.scala:211
14/07/11 04:25:25 INFO DAGScheduler: Got job 9 (aggregate at RowMatrix.scala:211) with 2 output partitions (allowLocal=false)
14/07/11 04:25:25 INFO DAGScheduler: Final stage: Stage 9(aggregate at RowMatrix.scala:211)
14/07/11 04:25:25 INFO DAGScheduler: Parents of final stage: List()
14/07/11 04:25:25 INFO DAGScheduler: Missing parents: List()
14/07/11 04:25:25 INFO DAGScheduler: Submitting Stage 9 (MappedRDD[18] at map at <console>:29), which has no missing parents
14/07/11 04:25:25 INFO DAGScheduler: Submitting 2 missing tasks from Stage 9 (MappedRDD[18] at map at <console>:29)
14/07/11 04:25:25 INFO TaskSchedulerImpl: Adding task set 9.0 with 2 tasks
14/07/11 04:25:25 INFO TaskSetManager: Starting task 9.0:0 as TID 16 on executor localhost: localhost (PROCESS_LOCAL)
14/07/11 04:25:25 INFO TaskSetManager: Serialized task 9.0:0 as 3057 bytes in 1 ms
14/07/11 04:25:25 INFO TaskSetManager: Starting task 9.0:1 as TID 17 on executor localhost: localhost (PROCESS_LOCAL)
14/07/11 04:25:25 INFO TaskSetManager: Serialized task 9.0:1 as 3057 bytes in 0 ms
14/07/11 04:25:25 INFO Executor: Running task ID 16
14/07/11 04:25:25 INFO Executor: Running task ID 17
14/07/11 04:25:25 INFO BlockManager: Found block broadcast_2 locally
14/07/11 04:25:25 INFO BlockManager: Found block broadcast_2 locally
14/07/11 04:25:25 INFO HadoopRDD: Input split: hdfs://node001:9000/spark/input/data.txt:0+9
14/07/11 04:25:25 INFO HadoopRDD: Input split: hdfs://node001:9000/spark/input/data.txt:9+9
14/07/11 04:25:25 INFO Executor: Serialized size of result for 17 is 1037
14/07/11 04:25:25 INFO Executor: Sending result for 17 directly to driver
14/07/11 04:25:25 INFO Executor: Finished task ID 17
14/07/11 04:25:25 INFO DAGScheduler: Completed ResultTask(9, 1)
14/07/11 04:25:25 INFO TaskSetManager: Finished TID 17 in 11 ms on localhost (progress: 1/2)
14/07/11 04:25:25 INFO Executor: Serialized size of result for 16 is 1037
14/07/11 04:25:25 INFO Executor: Sending result for 16 directly to driver
14/07/11 04:25:25 INFO Executor: Finished task ID 16
14/07/11 04:25:25 INFO DAGScheduler: Completed ResultTask(9, 0)
14/07/11 04:25:25 INFO TaskSetManager: Finished TID 16 in 14 ms on localhost (progress: 2/2)
14/07/11 04:25:25 INFO TaskSchedulerImpl: Removed TaskSet 9.0, whose tasks have all completed, from pool
14/07/11 04:25:25 INFO DAGScheduler: Stage 9 (aggregate at RowMatrix.scala:211) finished in 0.015 s
14/07/11 04:25:25 INFO SparkContext: Job finished: aggregate at RowMatrix.scala:211, took 0.019312805 s
Cov: org.apache.spark.mllib.linalg.Matrix =
4.333333333333334 5.166666666666666 5.166666666666668
5.166666666666668 6.333333333333332 6.333333333333336
5.166666666666668 6.333333333333332 6.333333333333336
4.2 IndexedRowMatrix
这是分布式矩阵的第二种matrix.这种矩阵和RowMatrix非常相似,区别是它带有有一定意义的row indices。It is backed by an RDD of indexed rows, which each row is represented by its index (long-typed) and a local vector.体会一下这句话。
一个IndexedRowMatrix可以从RDD[IndexedRow] 实例创建,IndexedRow是(Long, Vector)的wrapper.而且这种矩阵可以传换成RowMatrix,通过丢掉它的row indices.
import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix, RowMatrix}
val rows: RDD[IndexedRow] = ... // an RDD of indexed rows
// Create an IndexedRowMatrix from an RDD[IndexedRow].
val mat: IndexedRowMatrix = new IndexedRowMatrix(rows)
// Get its size.
val m = mat.numRows()
val n = mat.numCols()
// Drop its row indices.
val rowMat: RowMatrix = mat.toRowMatrix()