MLlib-Basics (II)

MLlib-Basics

3 Local matrix 本地矩阵

             目前只支持DenseMatrix.SparseMatrix有待在以后版本出现。

scala> import org.apache.spark.mllib.linalg.{Matrix, Matrices}
import org.apache.spark.mllib.linalg.{Matrix, Matrices}

// Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
scala> val dm: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))
dm: org.apache.spark.mllib.linalg.Matrix =
1.0  2.0  
3.0  4.0  
5.0  6.0

4 Distributed matrix 分布式的矩阵

       一个分布式的矩阵有long型的行列数和double型的值,以一个或者多个RDD形式分布式存。对于存储打的分布式的矩阵来说,选择正确的格式非常重要。把一个分布式的矩阵转换成一个不同的格式需要全体打乱,这代价非常高。这里只讲了三种类型的分布式矩阵。

注释:

       一个分布式矩阵的RDD必须是确定性的,因为我们要缓存矩阵的大小。

    4.1 RowMatrix

    一个行矩阵就是把每行对应一个RDD,将矩阵的每行分布式存储,矩阵的每行是一个本地向量。这和多变量统计的数据矩阵比较相似。因为每行以一个本地向量表示,那么矩阵列的数量被限制在整数范围内,但是实际应用中列数很小。

      这个要仔细说一下:

import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.linalg.distributed.RowMatrix

val rows: RDD[Vector] = ... // an RDD of local vectors
// Create a RowMatrix from an RDD[Vector].
val mat: RowMatrix = new RowMatrix(rows)

// Get its size.
val m = mat.numRows()
val n = mat.numCols
    
       思考:考虑RDD[Vector]该如何生成呢?特别是从一个文件作为数据来源,比如说text文件。

       解答:

       假设:    现在在hdfs上,路径为hdfs://node001:9000/spark/input/data.txt 有个data.txt文件是一个三行三列的矩阵,元素之间用制彪符\t 隔开。

                        1    1    2
                        2    3    4
                        5    6    7

*         这个时候RDD,可以看到有text中的每行算一个String                                --------------------------->RDD[String].

scala> val textfile=sc.textFile("hdfs://node001:9000/spark/input/data.txt")
14/07/11 00:45:33 INFO MemoryStore: ensureFreeSpace(82268) called with curMem=249772, maxMem=309225062
14/07/11 00:45:33 INFO MemoryStore: Block broadcast_2 stored as values to memory (estimated size 80.3 KB, free 294.6 MB)
textfile: org.apache.spark.rdd.RDD[String] = MappedRDD[13] at textFile at <console>:21

 *       去掉制表符,每行的每个数都是String,一行是一个Array[String],         ------------------------------->RDD[Array[String]]

           scala> val middle=textfile.map((arg) =>arg.split("\\t"))
           middle: org.apache.spark.rdd.RDD[Array[String]] = MappedRDD[14] at map at <console>:23

 *       每行每个数由String -> Double                                                                                                ------------------------------->RDD[Array[Double]]

           scala> val mid=middle.map((arg)=>arg.map((args)=>args.toDouble))
           mid: org.apache.spark.rdd.RDD[Array[Double]] = MappedRDD[17] at map at <console>:25

 *       定义函数change ,将Array[Double]变成Vector。

          scala> def change(t:Array[Double]):Vector={
                   | val x=Vectors.dense(t)
                   | x
                   | }
           change: (t: Array[Double])org.apache.spark.mllib.linalg.Vector


*        通过change则text中的每行通过Array[String]生成Vector.     --------------------------------------------->RDD[Vector]                 

 scala> val ha=mid.map(change)
ha: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MappedRDD[18] at map at <console>:29

*        通过RDD[Vector]创建RowMatrix。           

           scala> val mat:RowMatrix=new RowMatrix(ha)
           mat: org.apache.spark.mllib.linalg.distributed.RowMatrix = org.apache.spark.mllib.linalg.distributed.RowMatrix@2ec01ae5
                  

*         下面是计算矩阵的行数和列数。

scala> val m=mat.numRows()
14/07/11 01:05:50 INFO SparkContext: Starting job: count at RowMatrix.scala:194
14/07/11 01:05:50 INFO DAGScheduler: Got job 2 (count at RowMatrix.scala:194) with 2 output partitions (allowLocal=false)
14/07/11 01:05:50 INFO DAGScheduler: Final stage: Stage 2(count at RowMatrix.scala:194)
14/07/11 01:05:50 INFO DAGScheduler: Parents of final stage: List()
14/07/11 01:05:50 INFO DAGScheduler: Missing parents: List()
14/07/11 01:05:50 INFO DAGScheduler: Submitting Stage 2 (MappedRDD[18] at map at <console>:29), which has no missing parents
14/07/11 01:05:50 INFO DAGScheduler: Submitting 2 missing tasks from Stage 2 (MappedRDD[18] at map at <console>:29)
14/07/11 01:05:50 INFO TaskSchedulerImpl: Adding task set 2.0 with 2 tasks
14/07/11 01:05:50 INFO TaskSetManager: Starting task 2.0:0 as TID 4 on executor localhost: localhost (PROCESS_LOCAL)
14/07/11 01:05:50 INFO TaskSetManager: Serialized task 2.0:0 as 2827 bytes in 0 ms
14/07/11 01:05:50 INFO TaskSetManager: Starting task 2.0:1 as TID 5 on executor localhost: localhost (PROCESS_LOCAL)
14/07/11 01:05:50 INFO TaskSetManager: Serialized task 2.0:1 as 2827 bytes in 0 ms
14/07/11 01:05:50 INFO Executor: Running task ID 4
14/07/11 01:05:50 INFO Executor: Running task ID 5
14/07/11 01:05:50 INFO BlockManager: Found block broadcast_2 locally
14/07/11 01:05:50 INFO BlockManager: Found block broadcast_2 locally
14/07/11 01:05:50 INFO HadoopRDD: Input split: hdfs://node001:9000/spark/input/data.txt:0+9
14/07/11 01:05:50 INFO HadoopRDD: Input split: hdfs://node001:9000/spark/input/data.txt:9+9
14/07/11 01:05:50 INFO Executor: Serialized size of result for 5 is 597
14/07/11 01:05:50 INFO Executor: Serialized size of result for 4 is 597
14/07/11 01:05:50 INFO Executor: Sending result for 5 directly to driver
14/07/11 01:05:50 INFO Executor: Sending result for 4 directly to driver
14/07/11 01:05:50 INFO Executor: Finished task ID 5
14/07/11 01:05:50 INFO Executor: Finished task ID 4
14/07/11 01:05:50 INFO DAGScheduler: Completed ResultTask(2, 1)
14/07/11 01:05:50 INFO TaskSetManager: Finished TID 5 in 26 ms on localhost (progress: 1/2)
14/07/11 01:05:50 INFO DAGScheduler: Completed ResultTask(2, 0)
14/07/11 01:05:50 INFO TaskSetManager: Finished TID 4 in 28 ms on localhost (progress: 2/2)
14/07/11 01:05:50 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
14/07/11 01:05:50 INFO DAGScheduler: Stage 2 (count at RowMatrix.scala:194) finished in 0.028 s
14/07/11 01:05:50 INFO SparkContext: Job finished: count at RowMatrix.scala:194, took 0.050264151 s
m: Long = 3

scala> val n=mat.numCols()
14/07/11 01:06:02 INFO SparkContext: Starting job: first at RowMatrix.scala:186
14/07/11 01:06:02 INFO DAGScheduler: Got job 3 (first at RowMatrix.scala:186) with 1 output partitions (allowLocal=true)
14/07/11 01:06:02 INFO DAGScheduler: Final stage: Stage 3(first at RowMatrix.scala:186)
14/07/11 01:06:02 INFO DAGScheduler: Parents of final stage: List()
14/07/11 01:06:02 INFO DAGScheduler: Missing parents: List()
14/07/11 01:06:02 INFO DAGScheduler: Computing the requested partition locally
14/07/11 01:06:02 INFO HadoopRDD: Input split: hdfs://node001:9000/spark/input/data.txt:0+9
14/07/11 01:06:02 INFO SparkContext: Job finished: first at RowMatrix.scala:186, took 0.011817442 s
n: Long = 3

4.1.1 Multivariate summary statistics

      我们为RowMatrix提供了列的总和统计,如果列的数量不是太大,比如小于3000,你就能够计算出 协方差矩阵,作为本地矩阵,这需要 O(n2) 的存储空间,其中n是矩阵的列数。总共的CPU时间是 O(mn2) , m是矩阵的行数,如果是稀疏矩阵的话,会更快。

import org.apache.spark.mllib.linalg.Matrix

import org.apache.spark.mllib.linalg.distributed.RowMatrix

import org.apache.spark.mllib.stat.MultivariateStatisticalSummary

val mat:RowMatrix = ...// a RowMatrix


// Compute column summary statistics.

val summary:MultivariateStatisticalSummary = mat.computeColumnSummaryStatistics()

println(summary.mean)// a dense vector containing the mean value for each column

println(summary.variance)// column-wise variance

println(summary.numNonzeros)// number of nonzeros in each column


// Compute the covariance matrix.

val Cov:Matrix = mat.computeCovariance()

实验:

scala> val summary: MultivariateStatisticalSummary = mat.computeColumnSummaryStatistics()

14/07/11 04:23:50 INFO SparkContext: Starting job: aggregate at RowMatrix.scala:374

14/07/11 04:23:50 INFO DAGScheduler: Got job 7 (aggregate at RowMatrix.scala:374) with 2 output partitions (allowLocal=false)

14/07/11 04:23:50 INFO DAGScheduler: Final stage: Stage 7(aggregate at RowMatrix.scala:374)

14/07/11 04:23:50 INFO DAGScheduler: Parents of final stage: List()

14/07/11 04:23:50 INFO DAGScheduler: Missing parents: List()

14/07/11 04:23:50 INFO DAGScheduler: Submitting Stage 7 (MappedRDD[21] at map at RowMatrix.scala:374), which has no missing parents

14/07/11 04:23:50 INFO DAGScheduler: Submitting 2 missing tasks from Stage 7 (MappedRDD[21] at map at RowMatrix.scala:374)

14/07/11 04:23:50 INFO TaskSchedulerImpl: Adding task set 7.0 with 2 tasks

14/07/11 04:23:50 INFO TaskSetManager: Starting task 7.0:0 as TID 12 on executor localhost: localhost (PROCESS_LOCAL)

14/07/11 04:23:50 INFO TaskSetManager: Serialized task 7.0:0 as 3252 bytes in 1 ms

14/07/11 04:23:50 INFO TaskSetManager: Starting task 7.0:1 as TID 13 on executor localhost: localhost (PROCESS_LOCAL)

14/07/11 04:23:50 INFO TaskSetManager: Serialized task 7.0:1 as 3252 bytes in 1 ms

14/07/11 04:23:50 INFO Executor: Running task ID 12

14/07/11 04:23:50 INFO Executor: Running task ID 13

14/07/11 04:23:50 INFO BlockManager: Found block broadcast_2 locally

14/07/11 04:23:50 INFO BlockManager: Found block broadcast_2 locally

14/07/11 04:23:50 INFO HadoopRDD: Input split: hdfs://node001:9000/spark/input/data.txt:0+9

14/07/11 04:23:50 INFO HadoopRDD: Input split: hdfs://node001:9000/spark/input/data.txt:9+9

14/07/11 04:23:50 INFO Executor: Serialized size of result for 12 is 1898

14/07/11 04:23:50 INFO Executor: Sending result for 12 directly to driver

14/07/11 04:23:50 INFO Executor: Finished task ID 12

14/07/11 04:23:50 INFO DAGScheduler: Completed ResultTask(7, 0)

14/07/11 04:23:50 INFO Executor: Serialized size of result for 13 is 1898

14/07/11 04:23:50 INFO TaskSetManager: Finished TID 12 in 16 ms on localhost (progress: 1/2)

14/07/11 04:23:50 INFO Executor: Sending result for 13 directly to driver

14/07/11 04:23:50 INFO Executor: Finished task ID 13

14/07/11 04:23:50 INFO DAGScheduler: Completed ResultTask(7, 1)

14/07/11 04:23:50 INFO TaskSetManager: Finished TID 13 in 17 ms on localhost (progress: 2/2)

14/07/11 04:23:50 INFO TaskSchedulerImpl: Removed TaskSet 7.0, whose tasks have all completed, from pool

14/07/11 04:23:50 INFO DAGScheduler: Stage 7 (aggregate at RowMatrix.scala:374) finished in 0.018 s

14/07/11 04:23:50 INFO SparkContext: Job finished: aggregate at RowMatrix.scala:374, took 0.029782399 s

summary: org.apache.spark.mllib.stat.MultivariateStatisticalSummary = org.apache.spark.mllib.linalg.distributed.ColumnStatisticsAggregator@5a30f467


scala> println(summary.mean)                        //是针对每一列

[2.6666666666666665,3.3333333333333335,4.333333333333333]


scala> println(summary.variance)                        //每一列

[4.333333333333333,6.333333333333333,6.333333333333333]


scala> println(summary.numNonzeros)

[3.0,3.0,3.0]


scala> val Cov: Matrix = mat.computeCovariance()

14/07/11 04:25:25 INFO SparkContext: Starting job: aggregate at RowMatrix.scala:312

14/07/11 04:25:25 INFO DAGScheduler: Got job 8 (aggregate at RowMatrix.scala:312) with 2 output partitions (allowLocal=false)

14/07/11 04:25:25 INFO DAGScheduler: Final stage: Stage 8(aggregate at RowMatrix.scala:312)

14/07/11 04:25:25 INFO DAGScheduler: Parents of final stage: List()

14/07/11 04:25:25 INFO DAGScheduler: Missing parents: List()

14/07/11 04:25:25 INFO DAGScheduler: Submitting Stage 8 (MappedRDD[18] at map at <console>:29), which has no missing parents

14/07/11 04:25:25 INFO DAGScheduler: Submitting 2 missing tasks from Stage 8 (MappedRDD[18] at map at <console>:29)

14/07/11 04:25:25 INFO TaskSchedulerImpl: Adding task set 8.0 with 2 tasks

14/07/11 04:25:25 INFO TaskSetManager: Starting task 8.0:0 as TID 14 on executor localhost: localhost (PROCESS_LOCAL)

14/07/11 04:25:25 INFO TaskSetManager: Serialized task 8.0:0 as 3105 bytes in 0 ms

14/07/11 04:25:25 INFO TaskSetManager: Starting task 8.0:1 as TID 15 on executor localhost: localhost (PROCESS_LOCAL)

14/07/11 04:25:25 INFO TaskSetManager: Serialized task 8.0:1 as 3105 bytes in 0 ms

14/07/11 04:25:25 INFO Executor: Running task ID 14

14/07/11 04:25:25 INFO Executor: Running task ID 15

14/07/11 04:25:25 INFO BlockManager: Found block broadcast_2 locally

14/07/11 04:25:25 INFO BlockManager: Found block broadcast_2 locally

14/07/11 04:25:25 INFO HadoopRDD: Input split: hdfs://node001:9000/spark/input/data.txt:0+9

14/07/11 04:25:25 INFO HadoopRDD: Input split: hdfs://node001:9000/spark/input/data.txt:9+9

14/07/11 04:25:25 INFO Executor: Serialized size of result for 15 is 1140

14/07/11 04:25:25 INFO Executor: Sending result for 15 directly to driver

14/07/11 04:25:25 INFO Executor: Finished task ID 15

14/07/11 04:25:25 INFO Executor: Serialized size of result for 14 is 1140

14/07/11 04:25:25 INFO Executor: Sending result for 14 directly to driver

14/07/11 04:25:25 INFO Executor: Finished task ID 14

14/07/11 04:25:25 INFO DAGScheduler: Completed ResultTask(8, 1)

14/07/11 04:25:25 INFO TaskSetManager: Finished TID 15 in 16 ms on localhost (progress: 1/2)

14/07/11 04:25:25 INFO TaskSetManager: Finished TID 14 in 18 ms on localhost (progress: 2/2)

14/07/11 04:25:25 INFO TaskSchedulerImpl: Removed TaskSet 8.0, whose tasks have all completed, from pool

14/07/11 04:25:25 INFO DAGScheduler: Completed ResultTask(8, 0)

14/07/11 04:25:25 INFO DAGScheduler: Stage 8 (aggregate at RowMatrix.scala:312) finished in 0.019 s

14/07/11 04:25:25 INFO SparkContext: Job finished: aggregate at RowMatrix.scala:312, took 0.026187565 s

14/07/11 04:25:25 INFO SparkContext: Starting job: aggregate at RowMatrix.scala:211

14/07/11 04:25:25 INFO DAGScheduler: Got job 9 (aggregate at RowMatrix.scala:211) with 2 output partitions (allowLocal=false)

14/07/11 04:25:25 INFO DAGScheduler: Final stage: Stage 9(aggregate at RowMatrix.scala:211)

14/07/11 04:25:25 INFO DAGScheduler: Parents of final stage: List()

14/07/11 04:25:25 INFO DAGScheduler: Missing parents: List()

14/07/11 04:25:25 INFO DAGScheduler: Submitting Stage 9 (MappedRDD[18] at map at <console>:29), which has no missing parents

14/07/11 04:25:25 INFO DAGScheduler: Submitting 2 missing tasks from Stage 9 (MappedRDD[18] at map at <console>:29)

14/07/11 04:25:25 INFO TaskSchedulerImpl: Adding task set 9.0 with 2 tasks

14/07/11 04:25:25 INFO TaskSetManager: Starting task 9.0:0 as TID 16 on executor localhost: localhost (PROCESS_LOCAL)

14/07/11 04:25:25 INFO TaskSetManager: Serialized task 9.0:0 as 3057 bytes in 1 ms

14/07/11 04:25:25 INFO TaskSetManager: Starting task 9.0:1 as TID 17 on executor localhost: localhost (PROCESS_LOCAL)

14/07/11 04:25:25 INFO TaskSetManager: Serialized task 9.0:1 as 3057 bytes in 0 ms

14/07/11 04:25:25 INFO Executor: Running task ID 16

14/07/11 04:25:25 INFO Executor: Running task ID 17

14/07/11 04:25:25 INFO BlockManager: Found block broadcast_2 locally

14/07/11 04:25:25 INFO BlockManager: Found block broadcast_2 locally

14/07/11 04:25:25 INFO HadoopRDD: Input split: hdfs://node001:9000/spark/input/data.txt:0+9

14/07/11 04:25:25 INFO HadoopRDD: Input split: hdfs://node001:9000/spark/input/data.txt:9+9

14/07/11 04:25:25 INFO Executor: Serialized size of result for 17 is 1037

14/07/11 04:25:25 INFO Executor: Sending result for 17 directly to driver

14/07/11 04:25:25 INFO Executor: Finished task ID 17

14/07/11 04:25:25 INFO DAGScheduler: Completed ResultTask(9, 1)

14/07/11 04:25:25 INFO TaskSetManager: Finished TID 17 in 11 ms on localhost (progress: 1/2)

14/07/11 04:25:25 INFO Executor: Serialized size of result for 16 is 1037

14/07/11 04:25:25 INFO Executor: Sending result for 16 directly to driver

14/07/11 04:25:25 INFO Executor: Finished task ID 16

14/07/11 04:25:25 INFO DAGScheduler: Completed ResultTask(9, 0)

14/07/11 04:25:25 INFO TaskSetManager: Finished TID 16 in 14 ms on localhost (progress: 2/2)

14/07/11 04:25:25 INFO TaskSchedulerImpl: Removed TaskSet 9.0, whose tasks have all completed, from pool

14/07/11 04:25:25 INFO DAGScheduler: Stage 9 (aggregate at RowMatrix.scala:211) finished in 0.015 s

14/07/11 04:25:25 INFO SparkContext: Job finished: aggregate at RowMatrix.scala:211, took 0.019312805 s

Cov: org.apache.spark.mllib.linalg.Matrix =

4.333333333333334  5.166666666666666  5.166666666666668  

5.166666666666668  6.333333333333332  6.333333333333336  

5.166666666666668  6.333333333333332  6.333333333333336


4.2 IndexedRowMatrix      

         这是分布式矩阵的第二种matrix.这种矩阵和RowMatrix非常相似,区别是它带有有一定意义的row indices。It is backed by an RDD of indexed rows, which each row is represented by its index (long-typed) and a local vector.体会一下这句话。

      一个IndexedRowMatrix可以从RDD[IndexedRow] 实例创建,IndexedRow是(Long, Vector)的wrapper.而且这种矩阵可以传换成RowMatrix,通过丢掉它的row indices.


import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix, RowMatrix}

val rows: RDD[IndexedRow] = ...                           // an RDD of indexed rows 


// Create an IndexedRowMatrix from an RDD[IndexedRow].

val mat: IndexedRowMatrix = new IndexedRowMatrix(rows)


// Get its size.

val m = mat.numRows()

val n = mat.numCols()


// Drop its row indices.

val rowMat: RowMatrix = mat.toRowMatrix() 






  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值