RDD[Vector]

原创 2016年08月28日 23:17:10

1.629502 1.66991
1.871226 1.898365
1.46171 1.91306
1.58579 1.537943
2.018275 1.836801
1.98899 2.006619
1.599317 1.991072
1.991236 1.235661
1.057009 1.601767
1.889463 1.86318
1.368395 1.213885
1.251551 1.821578
1.904642 1.523114
1.383058 1.641584
1.182018 1.286603
1.030947 1.093305
2.050907 1.327946
1.74832 2.008842
2.02456 1.23564
1.02345 1.25648
1\
scala> val data_path="/home/sc/Desktop/data.txt"
data_path: String = /home/sc/Desktop/data.txt


scala> val data = sc.textFile(data_path).map(_.split(" ")).map(f => f.map(f => f.toDouble))
16/08/12 06:03:54 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 38.8 KB, free 135.9 KB)
16/08/12 06:03:54 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 4.2 KB, free 140.1 KB)
16/08/12 06:03:54 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:50455 (size: 4.2 KB, free: 517.4 MB)
16/08/12 06:03:54 INFO SparkContext: Created broadcast 4 from textFile at <console>:35
data: org.apache.spark.rdd.RDD[Array[Double]] = MapPartitionsRDD[13] at map at <console>:35


scala> import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.Vectors


scala> val datal = data.map(f => Vectors.dense(f))
datal: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MapPartitionsRDD[14] at map at <console>:39


scala> datal.collect
16/08/12 06:04:14 INFO FileInputFormat: Total input paths to process : 1
16/08/12 06:04:14 INFO SparkContext: Starting job: collect at <console>:42
16/08/12 06:04:14 INFO DAGScheduler: Got job 2 (collect at <console>:42) with 1 output partitions
16/08/12 06:04:14 INFO DAGScheduler: Final stage: ResultStage 2 (collect at <console>:42)
16/08/12 06:04:14 INFO DAGScheduler: Parents of final stage: List()
16/08/12 06:04:14 INFO DAGScheduler: Missing parents: List()
16/08/12 06:04:14 INFO DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[14] at map at <console>:39), which has no missing parents
16/08/12 06:04:14 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 3.6 KB, free 143.7 KB)
16/08/12 06:04:14 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 2027.0 B, free 145.7 KB)
16/08/12 06:04:14 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on localhost:50455 (size: 2027.0 B, free: 517.4 MB)
16/08/12 06:04:14 INFO SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1006
16/08/12 06:04:14 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 2 (MapPartitionsRDD[14] at map at <console>:39)
16/08/12 06:04:14 INFO TaskSchedulerImpl: Adding task set 2.0 with 1 tasks
16/08/12 06:04:14 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, localhost, partition 0,PROCESS_LOCAL, 2133 bytes)
16/08/12 06:04:14 INFO Executor: Running task 0.0 in stage 2.0 (TID 2)
16/08/12 06:04:14 INFO HadoopRDD: Input split: file:/home/sc/Desktop/data.txt:0+351
16/08/12 06:04:14 INFO Executor: Finished task 0.0 in stage 2.0 (TID 2). 2786 bytes result sent to driver
16/08/12 06:04:14 INFO DAGScheduler: ResultStage 2 (collect at <console>:42) finished in 0.166 s
16/08/12 06:04:14 INFO DAGScheduler: Job 2 finished: collect at <console>:42, took 0.257591 s
16/08/12 06:04:14 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 163 ms on localhost (1/1)
16/08/12 06:04:14 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 
res3: Array[org.apache.spark.mllib.linalg.Vector] = Array([1.629502,1.66991], [1.871226,1.898365], [1.46171,1.91306], [1.58579,1.537943], [2.018275,1.836801], [1.98899,2.006619], [1.599317,1.991072], [1.991236,1.235661], [1.057009,1.601767], [1.889463,1.86318], [1.368395,1.213885], [1.251551,1.821578], [1.904642,1.523114], [1.383058,1.641584], [1.182018,1.286603], [1.030947,1.093305], [2.050907,1.327946], [1.74832,2.008842], [2.02456,1.23564], [1.02345,1.25648])


scala> 
版权声明:本文为博主原创文章,未经博主允许不得转载。

Apache Spark MLlib学习笔记(一)MLlib数据存储Vector/Matrix/LablePoint

MLlib支持单机local vectors 和 matrices以及分布式矩阵。其中local vectors 和 matrices是一种用于公共接口的简单数据结...

将RDD[vector]转化成DataFrame

机器学习中的feature是vector,有时我们在得到RDD[Vector]后,想给feature添加索引,然后转化成DataFrame,这样我们可以根据id来知道某一个feature对应是哪一个样...

RDD输出到一个文件中

在使用Spark的机器学习模型时,有时为了方便观看输出数据,我们需要将RDD输出到一个文件中,比如我们需要将预测的label输出到一个文件,这样方便我们观看每一个样本的label。你若是输出到多个文件...

关于SparkMLlib的基础数据结构Spark-MLlib-Basics

此部分主要降价写关于MLlib的集中基础的数据结构

Spark 之DataFrame与RDD 转换

DataFrame可以从结构化文件、hive表、外部数据库以及现有的RDD加载构建得到。具体的结构化文件、hive表、外部数据库的相关加载可以参考其他章节。这里主要针对从现有的RDD来构建DataFr...

Spark MLlib 核心基础:向量 And 矩阵

1、Spark MLlib 核心基础:向量 And矩阵 1.1 Vector 1.1.1 dense vector 源码定义:    * Creates a dense vector from it...
  • sunbow0
  • sunbow0
  • 2015年04月23日 17:47
  • 6268

Spark MLlib 1.6 -- 统计基础篇

·  Summary statistics ·  Correlations ·  Stratified sampling ·  Hypothesis testing · Streaming Sign...

Java接入Spark之创建RDD的两种方式和操作RDD

Java接入Spark之创建RDD的两种方式和操作RDD

将RDD[vector]转化成DataFrame

机器学习中的feature是vector,有时我们在得到RDD[Vector]后,想给feature添加索引,然后转化成DataFrame,这样我们可以根据id来知道某一个feature对应是哪一个样...

机器学习算法之随机森林(1)pyspark.mllib中的RF

spark的persist操作可以使得数据常驻内存,而机器学习最主要的工作——迭代,需要频繁地存取数据,这样相比hadoop来说,天然地有利于机器学习。 ———- 单机版。 至于集群的搭建——现在...
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:RDD[Vector]
举报原因:
原因补充:

(最多只允许输入30个字)