【第22期】观点:IT 行业加班,到底有没有价值?

RDD[Vector]

原创 2016年08月28日 23:17:10

1.629502 1.66991
1.871226 1.898365
1.46171 1.91306
1.58579 1.537943
2.018275 1.836801
1.98899 2.006619
1.599317 1.991072
1.991236 1.235661
1.057009 1.601767
1.889463 1.86318
1.368395 1.213885
1.251551 1.821578
1.904642 1.523114
1.383058 1.641584
1.182018 1.286603
1.030947 1.093305
2.050907 1.327946
1.74832 2.008842
2.02456 1.23564
1.02345 1.25648
1\
scala> val data_path="/home/sc/Desktop/data.txt"
data_path: String = /home/sc/Desktop/data.txt


scala> val data = sc.textFile(data_path).map(_.split(" ")).map(f => f.map(f => f.toDouble))
16/08/12 06:03:54 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 38.8 KB, free 135.9 KB)
16/08/12 06:03:54 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 4.2 KB, free 140.1 KB)
16/08/12 06:03:54 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:50455 (size: 4.2 KB, free: 517.4 MB)
16/08/12 06:03:54 INFO SparkContext: Created broadcast 4 from textFile at <console>:35
data: org.apache.spark.rdd.RDD[Array[Double]] = MapPartitionsRDD[13] at map at <console>:35


scala> import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.Vectors


scala> val datal = data.map(f => Vectors.dense(f))
datal: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MapPartitionsRDD[14] at map at <console>:39


scala> datal.collect
16/08/12 06:04:14 INFO FileInputFormat: Total input paths to process : 1
16/08/12 06:04:14 INFO SparkContext: Starting job: collect at <console>:42
16/08/12 06:04:14 INFO DAGScheduler: Got job 2 (collect at <console>:42) with 1 output partitions
16/08/12 06:04:14 INFO DAGScheduler: Final stage: ResultStage 2 (collect at <console>:42)
16/08/12 06:04:14 INFO DAGScheduler: Parents of final stage: List()
16/08/12 06:04:14 INFO DAGScheduler: Missing parents: List()
16/08/12 06:04:14 INFO DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[14] at map at <console>:39), which has no missing parents
16/08/12 06:04:14 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 3.6 KB, free 143.7 KB)
16/08/12 06:04:14 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 2027.0 B, free 145.7 KB)
16/08/12 06:04:14 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on localhost:50455 (size: 2027.0 B, free: 517.4 MB)
16/08/12 06:04:14 INFO SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:1006
16/08/12 06:04:14 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 2 (MapPartitionsRDD[14] at map at <console>:39)
16/08/12 06:04:14 INFO TaskSchedulerImpl: Adding task set 2.0 with 1 tasks
16/08/12 06:04:14 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, localhost, partition 0,PROCESS_LOCAL, 2133 bytes)
16/08/12 06:04:14 INFO Executor: Running task 0.0 in stage 2.0 (TID 2)
16/08/12 06:04:14 INFO HadoopRDD: Input split: file:/home/sc/Desktop/data.txt:0+351
16/08/12 06:04:14 INFO Executor: Finished task 0.0 in stage 2.0 (TID 2). 2786 bytes result sent to driver
16/08/12 06:04:14 INFO DAGScheduler: ResultStage 2 (collect at <console>:42) finished in 0.166 s
16/08/12 06:04:14 INFO DAGScheduler: Job 2 finished: collect at <console>:42, took 0.257591 s
16/08/12 06:04:14 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 163 ms on localhost (1/1)
16/08/12 06:04:14 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 
res3: Array[org.apache.spark.mllib.linalg.Vector] = Array([1.629502,1.66991], [1.871226,1.898365], [1.46171,1.91306], [1.58579,1.537943], [2.018275,1.836801], [1.98899,2.006619], [1.599317,1.991072], [1.991236,1.235661], [1.057009,1.601767], [1.889463,1.86318], [1.368395,1.213885], [1.251551,1.821578], [1.904642,1.523114], [1.383058,1.641584], [1.182018,1.286603], [1.030947,1.093305], [2.050907,1.327946], [1.74832,2.008842], [2.02456,1.23564], [1.02345,1.25648])


scala> 
版权声明:本文为博主原创文章,未经博主允许不得转载。 举报

相关文章推荐

将RDD[vector]转化成DataFrame

机器学习中的feature是vector,有时我们在得到RDD[Vector]后,想给feature添加索引,然后转化成DataFrame,这样我们可以根据id来知道某一个feature对应是哪一个样...

快速上手写spark代码系列01:RDD transformation函数入门

快速上手写spark代码系列:01-RDD transformation函数入门标签(空格分隔): RDD transformation快速上手写spark代码系列01-RDD transformat...

欢迎关注CSDN程序人生公众号

关注程序员生活,汇聚开发轶事。

RDD:基于内存的集群计算容错抽象

<em style="margin: 0px; padding: 0px; border:

将RDD[vector]转化成DataFrame

机器学习中的feature是vector,有时我们在得到RDD[Vector]后,想给feature添加索引,然后转化成DataFrame,这样我们可以根据id来知道某一个feature对应是哪一个样...

Hadoop vs Spark性能对比

基于Spark-0.4和Hadoop-0.20.2 1. Kmeans 数据:自己产生的三维数据,分别围绕正方形的8个顶点 {0, 0, 0}, {0, 10, 0}, {0, 0, 10}, {0, 10, 10}, {10, 0, 0}, {10, 0, 10}, {10, 10, 0}, {10, 10, 10} <table border="1" cellspacin

Deep Learning in Customer Churn Prediction (五) (Spark RDD 特征构建实践尝试)

使用 Spark 初探 (一) 中关于Spark RDD的初步知识可以尝试给出 Deep Learning in Customer Churn Prediction (三) (初步特征构建实践及基本...
收藏助手
不良信息举报
您举报文章:深度学习:神经网络中的前向传播和反向传播算法推导
举报原因:
原因补充:

(最多只允许输入30个字)