第60课:使用Java和Scala在IDE中实战RDD和DataFrame动态转换操作学习笔记
本期内容:
1 使用Java实战RDD与DataFrame转换
2 使用Scala实战RDD与DataFrame转换
什么是非动态转换?
=> 提前已经知道了RDD具体数据的元数据信息,可以通过JavaBean或Case Class的方式提前创建DataFrame时,通过反射的方式获得元数据信息。
什么是动态转换?
=> 无法提前知道具体的RDD每个Record的列的个数及每列的类型只有在运行时才能知道。
这种情况在生产环境下更常见。因为在生产环境下提前知道数据的元数据信息的可能性不大。另外,生产环境下业务会变化,业务变化时列就会变化,或列的数据类型变化 ,如果采用静态的方式的话,每次业务变化时代码需要大幅度的修改。如果采用动态的方式,元数据来自于数据库或文件或其他方式,业务变化时代码的业务逻辑不会发生过多变化。
Scala实现的两种方式:
1 在main方法中
2 继承app去写代码。
2015年Spark在全球奠定了大数据分析OS的基础。
Spark是什么?=>Spark是大数据分析的平台,Spark上又有很多分支(可以扩展),
2015年业界对大数据streaming的要求更多了。信用卡诈骗,天气查询,交通监控,医疗等都要求实时处理。其实一切皆流处理。Spark是流处理界的霸主,因为流入的数据可以直接用SparkSQL处理,然后再进行机器学习、图计算等处理。
SparkSQL讲解步骤:
1 编程方式
2 内核
3 性能优化
下面编写代码说明如何实现RDD与DataFrame的动态转换:
package com.dt.spark.SparkApps.sql;
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
public class RDD2DataFrameByProgramatically {
public static void main(String[] args){
SparkConf conf = new SparkConf().setMaster("local").setAppName("RDD2DataFrameByReflection");
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(sc);
JavaRDD<String> lines = sc.textFile("D://DT-IMF//testdata//persons.txt");
//第一步:在RDD的基础上创建类型为Row的RDD
JavaRDD<Row> personsRDD = lines.map(new Function<String, Row>() {
@Override
public Row call(String line) throws Exception {
String[] splited = line.split(",");
return RowFactory.create(Integer.valueOf(splited[0]),splited[1],Integer.valueOf(splited[2]));
}
});
//第二步:动态构造DataFrame的元数据,一般而言,有多少列以及每列的具体类型,可能来自于JSON文件也可能来自于数据库。JSON非常轻量级,而且天生KV型,非常简洁。数据库则更安全。
List<StructField> structFields = new ArrayList<StructField>();
structFields.add(DataTypes.createStructField("id", DataTypes.IntegerType, true));
structFields.add(DataTypes.createStructField("name", DataTypes.StringType, true));
structFields.add(DataTypes.createStructField("age", DataTypes.IntegerType, true));
//构建SturceType用于最后DataFrame元数据的描述
StructType structType = DataTypes.createStructType(structFields);
//第三步:基于以后的MataData以及RDD<Row>来构造 DataFrame
//Spark有了DataFrame后会取代RDD?No!RDD是底层的是核心。
//今后在编程中是三足鼎立:RDD,DataSet,DataFrame。现在DataSet还是实验性的API
//DataSet是想让所有的子框架都基于DataSet计算。
//DataSet底层是钨丝计划。这样的话所有的框架都可以利用Tungsten天然的性能优势。
//正常推荐用hiveContext,不是说数据来源是hive才用hiveContext的,hiveContext功能比sqlContext强大,hiveContext包含了sqlContext所有的功能,并在此基础上做的更好。
DataFrame personsDF = sqlContext.createDataFrame(personsRDD, structType);
//第四步:注册成为临时表以供后续的SQL查询操作
personsDF.registerTempTable("persons");
//第五步,进行数据的多维度分析
DataFrame result = sqlContext.sql("select * from persons where age > 8");
//第六步,对结果进行处理,包括由DataFrame转换成为RDD<Row>,以及结果的持久化。
List<Row> listRow = result.javaRDD().collect();
for(Row row : listRow){
System.out.println(row);
}
//用group/join才能看出Hive和SparkSQL的速度差别。
//Hadoop和链式操作,一次只能有一个Reducer,有Join/group时会产生很多作业,
//而SparkSQL可能只有一个作业
//Hadoop每一个Task都是一个JVM(JVM不能复用),而Spark是线程复用。
//JVM生成的时间Spark已经计算完了。
}
}
D://DT-IMF//testdata//persons.txt文件内容:
1,Spark,7
2,Hadoop,11
3,Flink,5
eclipse中运行结果:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/03/31 00:24:51 INFO SparkContext: Running Spark version 1.6.0
16/03/31 00:24:57 INFO SecurityManager: Changing view acls to: think
16/03/31 00:24:57 INFO SecurityManager: Changing modify acls to: think
16/03/31 00:24:57 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(think); users with modify permissions: Set(think)
16/03/31 00:25:03 INFO Utils: Successfully started service 'sparkDriver' on port 61072.
16/03/31 00:25:05 INFO Slf4jLogger: Slf4jLogger started
16/03/31 00:25:06 INFO Remoting: Starting remoting
16/03/31 00:25:07 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.56.1:61085]
16/03/31 00:25:07 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 61085.
16/03/31 00:25:07 INFO SparkEnv: Registering MapOutputTracker
16/03/31 00:25:07 INFO SparkEnv: Registering BlockManagerMaster
16/03/31 00:25:08 INFO DiskBlockManager: Created local directory at C:\Users\think\AppData\Local\Temp\blockmgr-8b91069e-9c87-4557-9231-3e7d7dcd6bee
16/03/31 00:25:08 INFO MemoryStore: MemoryStore started with capacity 1773.8 MB
16/03/31 00:25:09 INFO SparkEnv: Registering OutputCommitCoordinator
16/03/31 00:25:11 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/03/31 00:25:11 INFO SparkUI: Started SparkUI at http://192.168.56.1:4040
16/03/31 00:25:13 INFO Executor: Starting executor ID driver on host localhost
16/03/31 00:25:13 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 61092.
16/03/31 00:25:13 INFO NettyBlockTransferService: Server created on 61092
16/03/31 00:25:13 INFO BlockManagerMaster: Trying to register BlockManager
16/03/31 00:25:13 INFO BlockManagerMasterEndpoint: Registering block manager localhost:61092 with 1773.8 MB RAM, BlockManagerId(driver, localhost, 61092)
16/03/31 00:25:13 INFO BlockManagerMaster: Registered BlockManager
16/03/31 00:25:20 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 127.4 KB, free 127.4 KB)
16/03/31 00:25:20 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free 141.3 KB)
16/03/31 00:25:20 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:61092 (size: 13.9 KB, free: 1773.7 MB)
16/03/31 00:25:20 INFO SparkContext: Created broadcast 0 from textFile at RDD2DataFrameByProgramatically.java:25
16/03/31 00:25:31 WARN : Your hostname, think-PC resolves to a loopback/non-reachable address: fe80:0:0:0:d401:a5b5:2103:6d13%eth8, but we couldn't find any external IP address!
16/03/31 00:25:33 INFO FileInputFormat: Total input paths to process : 1
16/03/31 00:25:34 INFO SparkContext: Starting job: collect at RDD2DataFrameByProgramatically.java:57
16/03/31 00:25:34 INFO DAGScheduler: Got job 0 (collect at RDD2DataFrameByProgramatically.java:57) with 1 output partitions
16/03/31 00:25:34 INFO DAGScheduler: Final stage: ResultStage 0 (collect at RDD2DataFrameByProgramatically.java:57)
16/03/31 00:25:34 INFO DAGScheduler: Parents of final stage: List()
16/03/31 00:25:34 INFO DAGScheduler: Missing parents: List()
16/03/31 00:25:34 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[6] at javaRDD at RDD2DataFrameByProgramatically.java:57), which has no missing parents
16/03/31 00:25:34 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 9.4 KB, free 150.7 KB)
16/03/31 00:25:34 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 4.7 KB, free 155.4 KB)
16/03/31 00:25:34 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:61092 (size: 4.7 KB, free: 1773.7 MB)
16/03/31 00:25:34 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/03/31 00:25:34 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[6] at javaRDD at RDD2DataFrameByProgramatically.java:57)
16/03/31 00:25:34 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
16/03/31 00:25:35 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2138 bytes)
16/03/31 00:25:35 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
16/03/31 00:25:35 INFO HadoopRDD: Input split: file:/D:/DT-IMF/testdata/persons.txt:0+33
16/03/31 00:25:35 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
16/03/31 00:25:35 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
16/03/31 00:25:35 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
16/03/31 00:25:35 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
16/03/31 00:25:35 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
16/03/31 00:25:36 INFO GeneratePredicate: Code generated in 644.70534 ms
16/03/31 00:25:36 INFO GenerateUnsafeProjection: Code generated in 428.879976 ms
16/03/31 00:25:36 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 3494 bytes result sent to driver
16/03/31 00:25:36 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1993 ms on localhost (1/1)
16/03/31 00:25:37 INFO DAGScheduler: ResultStage 0 (collect at RDD2DataFrameByProgramatically.java:57) finished in 2.106 s
16/03/31 00:25:37 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/03/31 00:25:37 INFO DAGScheduler: Job 0 finished: collect at RDD2DataFrameByProgramatically.java:57, took 2.850865 s
[2,Hadoop,11]
16/03/31 00:25:37 INFO SparkContext: Invoking stop() from shutdown hook
16/03/31 00:25:37 INFO SparkUI: Stopped Spark web UI at http://192.168.56.1:4040
16/03/31 00:25:37 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/03/31 00:25:37 INFO MemoryStore: MemoryStore cleared
16/03/31 00:25:37 INFO BlockManager: BlockManager stopped
16/03/31 00:25:37 INFO BlockManagerMaster: BlockManagerMaster stopped
16/03/31 00:25:37 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/03/31 00:25:37 INFO SparkContext: Successfully stopped SparkContext
16/03/31 00:25:37 INFO ShutdownHookManager: Shutdown hook called
16/03/31 00:25:37 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/03/31 00:25:37 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/03/31 00:25:37 INFO ShutdownHookManager: Deleting directory C:\Users\think\AppData\Local\Temp\spark-35a31d69-0074-474c-bde2-8b4e70e15e9d
以上内容是王家林老师DT大数据梦工厂《 IMF传奇行动》第60课的学习笔记。
王家林老师是Spark、Flink、Docker、Android技术中国区布道师。Spark亚太研究院院长和首席专家,DT大数据梦工厂创始人,Android软硬整合源码级专家,英语发音魔术师,健身狂热爱好者。
微信公众账号:DT_Spark
电话:18610086859
QQ:1740415547
微信号:18610086859
新浪微博:ilovepains