DataFrames与RDDs的相互转换

SparkSQL:DataFrames与RDDs的相互转换

SparkSQL 支持2种RDDs转换DataFrames的方式

1、使用反射获取RDD内的schema,当已知类的schema的时候,使用这种基于发射的方法会让代码更加简洁而且效果也很好
2、通过编程接口指定schema,通过spark sql的接口创建RDD的schema,这种方式会让代码比较冗长,这种方式的好处是,在运行时才知道数据的列以及列的类型的情况下,可以动态生成schema。

1、先学习几个DataFrame的简单编程

通过读取HDFS上一个json的文件,在直接操作这个RDD

scala> val df=sqlContext.read.json("hdfs://192.168.18.140:9000/spark/people.json")
17/03/27 23:40:18 INFO json.JSONRelation: Listing hdfs://192.168.18.140:9000/spark/people.json on driver
17/03/27 23:40:19 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 230.3 KB, free 511.3 MB)
17/03/27 23:40:19 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 19.6 KB, free 511.3 MB)
17/03/27 23:40:19 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
17/03/27 23:40:19 INFO scheduler.DAGScheduler: ResultStage 0 (json at <console>:25) finished in 0.355 s
17/03/27 23:40:19 INFO scheduler.DAGScheduler: Job 0 finished: json at <console>:25, took 0.455183 s
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

scala> df.show();
17/03/27 23:40:38 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 86.5 KB, free 511.2 MB)
17/03/27 23:40:39 INFO scheduler.DAGScheduler: Job 2 finished: show at <console>:28, took 0.066229 s
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+
scala> df.printSchema();
root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)


scala> df.select("name").show();
17/03/27 23:42:06 INFO storage.MemoryStore: Block broadcast_6 stored as values in memory (estimated size 229.6 KB, free 510.7 MB)
17/03/27 23:42:06 INFO storage.MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 19.5 KB, free 510.6 MB)
17/03/27 23:42:06 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in memory on localhost:53243 (size: 19.5 KB, free: 511.4 MB)
17/03/27 23:42:06 INFO scheduler.DAGScheduler: Job 4 finished: show at <console>:28, took 0.033443 s
+-------+
|   name|
+-------+
|Michael|
|   Andy|
| Justin|
+-------+


scala> df.select(df("name"),df("age")+1).show();
17/03/27 23:42:51 INFO storage.MemoryStore: Block broadcast_10 stored as values in memory (estimated size 229.6 KB, free 511.3 MB)
17/03/27 23:42:51 INFO storage.MemoryStore: Block broadcast_10_piece0 stored as bytes in memory (estimated size 19.5 KB, free 511.3 MB)


17/03/27 23:42:52 INFO scheduler.DAGScheduler: Job 6 finished: show at <console>:28, took 0.034821 s
+-------+---------+
|   name|(age + 1)|
+-------+---------+
|Michael|     null|
|   Andy|       31|
| Justin|       20|
+-------+---------+

scala> df.filter(df("age")>21).show();
17/03/27 23:45:16 INFO storage.MemoryStore: Block broadcast_23 stored as values in memory (estimated size 229.6 KB, free 509.7 MB)
17/03/27 23:45:17 INFO storage.MemoryStore: Block broadcast_23_piece0 stored as bytes in memory (estimated size 19.5 KB, free 509.7 MB)
17/03/27 23:45:17 INFO scheduler.DAGScheduler: Job 12 finished: show at <console>:28, took 0.028776 s
+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+

scala> df.groupBy("age").count().show();
17/03/27 23:45:34 INFO executor.Executor: Finished task 196.0 in stage 18.0 (TID 413). 1609 bytes result sent to driver
17/03/27 23:45:34 INFO executor.Executor: Running task 198.0 in stage 18.0 (TID 415)
+----+-----+
| age|count|
+----+-----+
|null|    1|
|  19|    1|
|  30|    1|
+----+-----+

2、使用反射获取schema

#创建case class 
scala> case class people(name:String,age:Int);
defined class people
#从指定的地址创建RDD,并且将RDD和case class关联,并且隐式转换,将RDD转换成DataFrame
scala> val ppl=sc.textFile("hdfs://192.168.18.140:9000/spark/people.txt").map(_.split(",")).map(p=>people(p(0),p(1).trim.toInt)).toDF();
17/03/27 23:53:09 INFO storage.MemoryStore: Block broadcast_32 stored as values in memory (estimated size 229.9 KB, free 509.4 MB)
17/03/27 23:53:09 INFO storage.MemoryStore: Block broadcast_32_piece0 stored as bytes in memory (estimated size 19.5 KB, free 509.4 MB)
17/03/27 23:53:09 INFO storage.BlockManagerInfo: Added broadcast_32_piece0 in memory on localhost:53243 (size: 19.5 KB, free: 511.3 MB)
17/03/27 23:53:09 INFO spark.SparkContext: Created broadcast 32 from textFile at <console>:29
ppl: org.apache.spark.sql.DataFrame = [name: string, age: int]
#将RDD注册到表
scala> ppl.registerTempTable("ppl");
#编写sql将结果生成一个新的RDD
scala> val p=sqlContext.sql("select name,age from ppl where age<=8");
17/03/27 23:54:54 INFO parse.ParseDriver: Parsing command: select name,age from ppl where age<=8
17/03/27 23:54:54 INFO parse.ParseDriver: Parse Completed
p: org.apache.spark.sql.DataFrame = [name: string, age: int]
#利用RDD编程将数据展示
scala> p.map(t=>"name: "+t(0)).collect().foreach(println);
17/03/27 23:55:54 INFO mapred.FileInputFormat: Total input paths to process : 1
17/03/27 23:55:54 INFO spark.SparkContext: Starting job: collect at <console>:28
17/03/27 23:55:54 INFO scheduler.DAGScheduler: Got job 15 (collect at <console>:28) with 2 output partitions
17/03/27 23:55:54 INFO scheduler.DAGScheduler: ResultStage 19 (collect at <console>:28) finished in 0.041 s
17/03/27 23:55:54 INFO scheduler.DAGScheduler: Job 15 finished: collect at <console>:28, took 0.046329 s
name: spark
name: flume

3、通过编程接口指定schema
编程创建DataFrame分为3步
1、从原来的RDD创建一个Row格式的RDD
2、创建与RDD种Row结构匹配的StructType,通过StructType创建表示RDD的schema
3、通过SQLContext提供的createDataFrame方法创建DataFrame,

#从指定的地址创建RDD
scala> val peo=sc.textFile("hdfs://192.168.18.140:9000/spark/people.txt");
17/03/27 23:59:18 INFO storage.MemoryStore: Block broadcast_36 stored as values in memory (estimated size 230.0 KB, free 509.2 MB)
17/03/27 23:59:18 INFO storage.MemoryStore: Block broadcast_36_piece0 stored as bytes in memory (estimated size 19.5 KB, free 509.2 MB)
17/03/27 23:59:18 INFO storage.BlockManagerInfo: Added broadcast_36_piece0 in memory on localhost:53243 (size: 19.5 KB, free: 511.3 MB)
17/03/27 23:59:18 INFO spark.SparkContext: Created broadcast 36 from textFile at <console>:27
peo: org.apache.spark.rdd.RDD[String] = hdfs://192.168.18.140:9000/spark/people.txt MapPartitionsRDD[75] at textFile at <console>:27
#指定字段的schema
scala> val schemaString ="name age";
17/03/27 23:59:36 INFO storage.BlockManagerInfo: Removed broadcast_35_piece0 on localhost:53243 in memory (size: 4.6 KB, free: 511.3 MB)
17/03/27 23:59:36 INFO spark.ContextCleaner: Cleaned accumulator 54
17/03/27 23:59:36 INFO storage.BlockManagerInfo: Removed broadcast_34_piece0 on localhost:53243 in memory (size: 6.3 KB, free: 511.3 MB)
17/03/27 23:59:36 INFO spark.ContextCleaner: Cleaned accumulator 53
schemaString: String = name age
#导入Row包
scala> import org.apache.spark.sql.Row;
import org.apache.spark.sql.Row
#导入spark sql和数据类型
scala> import org.apache.spark.sql.types.{StructType,StructField,StringType};
import org.apache.spark.sql.types.{StructType, StructField, StringType}
#创建一个基于string类型的schema
scala> val schema=StructType(schemaString.split(" ").map(fieldName=>StructField(fieldName,StringType,true)));
schema: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(age,StringType,true))
#将RDD映射到RowRDD上
scala> val rowRDD=peo.map(_.split(",")).map(p=>Row(p(0),p(1).trim));
rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[77] at map at <console>:33
#将schema信息应用到rowRDD上
scala> val peoDataFrame=sqlContext.createDataFrame(rowRDD,schema);
peoDataFrame: org.apache.spark.sql.DataFrame = [name: string, age: string]
#注册表
scala> peoDataFrame.registerTempTable("p");
scala> val result=sqlContext.sql("select name from p");
17/03/28 00:06:07 INFO parse.ParseDriver: Parsing command: select name from p
17/03/28 00:06:07 INFO parse.ParseDriver: Parse Completed
result: org.apache.spark.sql.DataFrame = [name: string]
#RDD编程
scala> result.map(t=>"name :"+t(0)).collect().foreach(println);
17/03/28 00:06:53 INFO mapred.FileInputFormat: Total input paths to process : 1
17/03/28 00:06:53 INFO spark.SparkContext: Starting job: collect at <console>:30
17/03/28 00:06:53 INFO scheduler.DAGScheduler: Got job 17 (collect at <console>:30) with 2 output partitions
17/03/28 00:06:53 INFO scheduler.DAGScheduler: Final stage: ResultStage 22 (collect at <console>:30)
17/03/28 00:06:53 INFO scheduler.DAGScheduler: Parents of final stage: List()
name :spark
name :hadoop
name :flume
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值