4.3CoordinateMatrix
坐标矩阵也是一种RDD存储的分布式矩阵。顾名思义,这里的每一项都是一个(i: Long, j: Long, value: Double)
指示行列值的元组tuple。
其中i是行坐标,j是列坐标,value是值。如果矩阵是非常大的而且稀疏,坐标矩阵一定是最好的选择。
坐标矩阵通过RDD[MatrixEntry]实例创建,MatrixEntry是(long,long.Double)形式。坐标矩阵可以转化为IndexedRowMatrix。
importorg.apache.spark.mllib.linalg.distributed.{CoordinateMatrix,MatrixEntry}
valentries:RDD[MatrixEntry]=...// an RDD of matrix entries
// Create a CoordinateMatrix from an RDD[MatrixEntry].
valmat:CoordinateMatrix= new CoordinateMatrix(entries)
// Get its size.
valm= mat.numRows()
valn= mat.numCols()
// Convert it to an IndexRowMatrix whose rows are sparse vectors.
valindexedRowMatrix= mat.toIndexedRowMatrix()
我们仍然用这组数据做实验。
1 1 2
2 3 4
5 6 7
* 先从hdfs上load数据,text中的每行看作是一个String ,是RDD[String]格式。
scala> val textfile=sc.textFile("hdfs://node001:9000/spark/input/data.txt")
14/07/11 05:01:46 INFO MemoryStore: ensureFreeSpace(167504) called with curMem=0, maxMem=309225062
14/07/11 05:01:46 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 163.6 KB, free 294.7 MB)
textfile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at <console>:12
*将每行,分成了一个字符数组.变成RDD[Array[String]]
scala> val middle=textfile.map((arg)=>arg.split("\\t"))
middle: org.apache.spark.rdd.RDD[Array[String]] = MappedRDD[2] at map at <console>:14
* 想办法剖离出(long,long,Double)的元组。
scala> val mid=middle.map((arg)=>(arg(0).toLong,arg(1).toLong,arg(2).toDouble))
mid: org.apache.spark.rdd.RDD[(Long, Long, Double)] = MappedRDD[3] at map at <console>:16
*生成MatrixEntry。
scala> val entries=mid.map((arg)=>MatrixEntry(arg._1,arg._2,arg._3))
entries: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.distributed.MatrixEntry] = MappedRDD[4] at map at <console>:19
*生成坐标矩阵
scala> val mat: CoordinateMatrix = new CoordinateMatrix(entries)
mat: org.apache.spark.mllib.linalg.distributed.CoordinateMatrix= org.apache.spark.mllib.linalg.distributed.CoordinateMatrix@be5b71c
* 计算它的行数
scala> val m = mat.numRows()
14/07/11 05:15:32 INFO FileInputFormat: Total input paths to process : 1
14/07/11 05:15:32 INFO SparkContext: Starting job: reduce at CoordinateMatrix.scala:99
14/07/11 05:15:32 INFO DAGScheduler: Got job 0 (reduce at CoordinateMatrix.scala:99) with 2 output partitions (allowLocal=false)
14/07/11 05:15:32 INFO DAGScheduler: Final stage: Stage 0(reduce at CoordinateMatrix.scala:99)
14/07/11 05:15:32 INFO DAGScheduler: Parents of final stage: List()
14/07/11 05:15:32 INFO DAGScheduler: Missing parents: List()
14/07/11 05:15:32 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[5] at map at CoordinateMatrix.scala:99), which has no missing parents
14/07/11 05:15:33 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (MappedRDD[5] at map at CoordinateMatrix.scala:99)
14/07/11 05:15:33 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
14/07/11 05:15:33 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on executor localhost: localhost (PROCESS_LOCAL)
14/07/11 05:15:33 INFO TaskSetManager: Serialized task 0.0:0 as 1936 bytes in 1 ms
14/07/11 05:15:33 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on executor localhost: localhost (PROCESS_LOCAL)
14/07/11 05:15:33 INFO TaskSetManager: Serialized task 0.0:1 as 1936 bytes in 0 ms
14/07/11 05:15:33 INFO Executor: Running task ID 1
14/07/11 05:15:33 INFO Executor: Running task ID 0
14/07/11 05:15:33 INFO BlockManager: Found block broadcast_0 locally
14/07/11 05:15:33 INFO BlockManager: Found block broadcast_0 locally
14/07/11 05:15:33 INFO HadoopRDD: Input split: hdfs://node001:9000/spark/input/data.txt:9+9
14/07/11 05:15:33 INFO HadoopRDD: Input split: hdfs://node001:9000/spark/input/data.txt:0+9
14/07/11 05:15:33 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
14/07/11 05:15:33 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
14/07/11 05:15:33 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
14/07/11 05:15:33 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
14/07/11 05:15:33 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
14/07/11 05:15:33 INFO Executor: Serialized size of result for 1 is 724
14/07/11 05:15:33 INFO Executor: Serialized size of result for 0 is 724
14/07/11 05:15:33 INFO Executor: Sending result for 0 directly to driver
14/07/11 05:15:33 INFO Executor: Sending result for 1 directly to driver
14/07/11 05:15:33 INFO Executor: Finished task ID 1
14/07/11 05:15:33 INFO Executor: Finished task ID 0
14/07/11 05:15:33 INFO TaskSetManager: Finished TID 1 in 82 ms on localhost (progress: 1/2)
14/07/11 05:15:33 INFO DAGScheduler: Completed ResultTask(0, 1)
14/07/11 05:15:33 INFO DAGScheduler: Completed ResultTask(0, 0)
14/07/11 05:15:33 INFO TaskSetManager: Finished TID 0 in 91 ms on localhost (progress: 2/2)
14/07/11 05:15:33 INFO DAGScheduler: Stage 0 (reduce at CoordinateMatrix.scala:99) finished in 0.097 s
14/07/11 05:15:33 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
14/07/11 05:15:33 INFO SparkContext: Job finished: reduce at CoordinateMatrix.scala:99, took 0.170012294 s
m: Long = 6
*它的列数
scala> val n = mat.numCols()
n: Long = 7
* 转换为了IndexedRowMatrix
scala> val indexedRowMatrix = mat.toIndexedRowMatrix()
indexedRowMatrix: org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix=
org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix@65e3fe67