第58课:使用Java和Scala在IDE中开发DataFrame实战学习笔记
本期内容:
1. 使用Java开发DataFrame实战
2. 使用Scala开发DataFrame实战
自Spark1.3开始大多数SparkSQL都基于DataFrame编程。因为DataFrame高效且功能强大。
可以把SparkSQl作为一个分布式查询引擎。SparkSQL一般都和Hive一起使用。
使用Java开发Spark应用程序的原因:
1. 企业生产环境下大多以Java为核心
2. Java更傻瓜,更易理解
SQLContext中有HiveContext子类。
Spark官网建议任何时候都用HiveContext。一般情况下都可以直接用HiveContext而不需要使用SQLContext
./spark-submit --file,可以通过指定hive-site.xml配置,这给我们一个启发:如果有一些特别的配置,可以这样指定。但如果指定其他文件,会覆盖默认参数hive-site.xml(注意:Spark的Conf目录下也有hive-site.xml文件)
spark-submit --class com.dt.spark.sql.DataFrameOps --master spark://slq1:7077 /home/richard/spark-1.6.0/SparkApps/wordCount.jar
配置Hive的不同数据来源。如果不配置的话Spark会自动找Spark中的Hive配置信息。
在哪台机上提交就需要在哪台机上安装Hive
Java代码如下:
package com.dt.spark.SparkApps;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;
public class DataFrameOps {
public static void main(String[] args){
//创建SparkConf用于读取系统配置信息并设置当前应用程序名称
SparkConf conf = new SparkConf().setAppName("DataFrameOps");
//创建JavaSparkContext对象实例作为整个Driver核心基石
JavaSparkContext sc = new JavaSparkContext(conf);
//创建SQLContext上下文对象用于SQL分析
SQLContext sqlContext = new SQLContext(sc);
//创建DataFrame,可以简单认为DataFrame是一张表
//DataFrame可以来源于多种格式
//如JSON,可以直接创建DataFrame
//SQLContext只支持SQL一种方言,用HiveContext就可以支持多种
//方言(默认HQL,可以设置)。
DataFrame df = sqlContext.read().json("HDFS://slq1:9000/user/people.json");
//select * from table
df.show();
//describe table;
df.printSchema();
df.select("name").show();
// select name age+1 from table;
df.select(df.col("name"),df.col("age").plus(10)).show();
//select * from table where age > 20;
df.filter(df.col("age").gt(20)).show();
//select count(1) from table group by age;
df.groupBy(df.col("age")).count().show();
}
}
下面编写scala代码:
package com.dt.spark.SparkApps.sql
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
// 使用Java的方式开发实战对DataFrame的操作
object DataFrameOps {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
conf.setAppName("DataFrameOps")
conf.setMaster("spark://slq1:7077")
val sc = new SparkContext
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.json("hdfs://slq1:9000/user/data/SparkResources/people.json")
df.show()
df.printSchema()
df.select("name").show()
df.select(df("name"),df("age") + 10).show()
df.filter(df("age") > 10).show()
df.groupBy("age").count.show()
}
}
导出为jar包:
点击eclipse菜单File->export,选择destination为JAR file后点击Next:
选择SparkAppsScala项目,选择导出路径后点击Finish。
将生成的SparkAppsScala.jar文件放入虚拟机中。
编写shell执行:
/home/richard/spark-1.6.0/bin/spark-submit --class com.dt.spark.SparkApps.sql.DataFrameOps --master spark://slq1:7077 /home/richard/slq/spark/SparkAppsScala.jar
生产环境下,执行shell一般是这样的:
/home/richard/spark-1.6.0/bin/spark-submit --class com.dt.spark.SparkApps.sql.DataFrameOps --files //home/richard/spark-1.6.0/conf/hive-site.xml --driver-class-path /home/richard/hive-1.2.1/lib/mysql-connector-java-5.1.32-bin.jar --master spark://slq1:7077 /home/richard/slq/spark/SparkAppsScala.jar
执行结果如下:
[richard@slq1 spark]$ ./SparkAppsScala.sh
16/03/27 07:36:38 INFO spark.SparkContext: Running Spark version 1.6.0
16/03/27 07:36:45 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/03/27 07:36:49 INFO spark.SecurityManager: Changing view acls to: richard
16/03/27 07:36:49 INFO spark.SecurityManager: Changing modify acls to: richard
16/03/27 07:36:50 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(richard); users with modify permissions: Set(richard)
16/03/27 07:37:00 INFO util.Utils: Successfully started service 'sparkDriver' on port 34946.
16/03/27 07:37:05 INFO slf4j.Slf4jLogger: Slf4jLogger started
16/03/27 07:37:06 INFO Remoting: Starting remoting
16/03/27 07:37:09 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.1.121:46548]
16/03/27 07:37:10 INFO util.Utils: Successfully started service 'sparkDriverActorSystem' on port 46548.
16/03/27 07:37:10 INFO spark.SparkEnv: Registering MapOutputTracker
16/03/27 07:37:11 INFO spark.SparkEnv: Registering BlockManagerMaster
16/03/27 07:37:11 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-967b95a2-650b-4485-8052-cd9158865d38
16/03/27 07:37:11 INFO storage.MemoryStore: MemoryStore started with capacity 517.4 MB
16/03/27 07:37:13 INFO spark.SparkEnv: Registering OutputCommitCoordinator
16/03/27 07:37:17 INFO server.Server: jetty-8.y.z-SNAPSHOT
16/03/27 07:37:18 WARN component.AbstractLifeCycle: FAILED SelectChannelConnector@0.0.0.0:4040: java.net.BindException: 地址已在使用
java.net.BindException: 地址已在使用
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:437)
at sun.nio.ch.Net.bind(Net.java:429)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at org.spark-project.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187)
at org.spark-project.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316)
at org.spark-project.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265)
at org.spark-project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at org.spark-project.jetty.server.Server.doStart(Server.java:293)
at org.spark-project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:252)
at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1964)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1955)
at org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:262)
at org.apache.spark.ui.WebUI.bind(WebUI.scala:136)
at org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
at org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:481)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:123)
at com.dt.spark.SparkApps.sql.DaraFrameOps$.main(DaraFrameOps.scala:13)
at com.dt.spark.SparkApps.sql.DaraFrameOps.main(DaraFrameOps.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
16/03/27 07:37:18 WARN component.AbstractLifeCycle: FAILED org.spark-project.jetty.server.Server@49bf29c6: java.net.BindException: 地址已在使用
java.net.BindException: 地址已在使用
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:437)
at sun.nio.ch.Net.bind(Net.java:429)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at org.spark-project.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187)
at org.spark-project.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316)
at org.spark-project.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265)
at org.spark-project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at org.spark-project.jetty.server.Server.doStart(Server.java:293)
at org.spark-project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:252)
at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
at org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1964)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1955)
at org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:262)
at org.apache.spark.ui.WebUI.bind(WebUI.scala:136)
at org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
at org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:481)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:123)
at com.dt.spark.SparkApps.sql.DaraFrameOps$.main(DaraFrameOps.scala:13)
at com.dt.spark.SparkApps.sql.DaraFrameOps.main(DaraFrameOps.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
16/03/27 07:37:18 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
16/03/27 07:37:18 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null}
16/03/27 07:37:18 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null}
16/03/27 07:37:18 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null}
16/03/27 07:37:18 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
16/03/27 07:37:18 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}
16/03/27 07:37:18 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null}
16/03/27 07:37:18 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null}
16/03/27 07:37:18 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null}
16/03/27 07:37:18 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null}
16/03/27 07:37:18 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
16/03/27 07:37:18 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null}
16/03/27 07:37:18 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null}
16/03/27 07:37:18 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null}
16/03/27 07:37:18 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null}
16/03/27 07:37:18 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null}
16/03/27 07:37:18 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null}
16/03/27 07:37:18 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null}
16/03/27 07:37:18 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null}
16/03/27 07:37:18 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null}
16/03/27 07:37:18 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}
16/03/27 07:37:18 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null}
16/03/27 07:37:18 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null}
16/03/27 07:37:18 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null}
16/03/27 07:37:18 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
16/03/27 07:37:18 INFO server.Server: jetty-8.y.z-SNAPSHOT
16/03/27 07:37:19 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4041
16/03/27 07:37:19 INFO util.Utils: Successfully started service 'SparkUI' on port 4041.
16/03/27 07:37:19 INFO ui.SparkUI: Started SparkUI at http://192.168.1.121:4041
16/03/27 07:37:19 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-1f459855-2e46-4caa-9b7f-897e3926749c/httpd-6a314baa-db64-40c8-8a36-52541d75c8f7
16/03/27 07:37:19 INFO spark.HttpServer: Starting HTTP Server
16/03/27 07:37:20 INFO server.Server: jetty-8.y.z-SNAPSHOT
16/03/27 07:37:20 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:48259
16/03/27 07:37:20 INFO util.Utils: Successfully started service 'HTTP file server' on port 48259.
16/03/27 07:37:20 INFO spark.SparkContext: Added JAR file:/home/richard/slq/spark/SparkAppsScala.jar at http://192.168.1.121:48259/jars/SparkAppsScala.jar with timestamp 1459035440616
16/03/27 07:37:23 INFO client.AppClient$ClientEndpoint: Connecting to master spark://slq1:7077...
16/03/27 07:37:30 INFO cluster.SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20160327073729-0000
16/03/27 07:37:30 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 44773.
16/03/27 07:37:30 INFO netty.NettyBlockTransferService: Server created on 44773
16/03/27 07:37:30 INFO storage.BlockManagerMaster: Trying to register BlockManager
16/03/27 07:37:31 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.1.121:44773 with 517.4 MB RAM, BlockManagerId(driver, 192.168.1.121, 44773)
16/03/27 07:37:31 INFO storage.BlockManagerMaster: Registered BlockManager
16/03/27 07:37:32 INFO client.AppClient$ClientEndpoint: Executor added: app-20160327073729-0000/0 on worker-20160327012804-192.168.1.121-47541 (192.168.1.121:47541) with 1 cores
16/03/27 07:37:32 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20160327073729-0000/0 on hostPort 192.168.1.121:47541 with 1 cores, 1024.0 MB RAM
16/03/27 07:37:32 INFO client.AppClient$ClientEndpoint: Executor added: app-20160327073729-0000/1 on worker-20160327012200-192.168.1.123-59271 (192.168.1.123:59271) with 1 cores
16/03/27 07:37:32 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20160327073729-0000/1 on hostPort 192.168.1.123:59271 with 1 cores, 1024.0 MB RAM
16/03/27 07:37:32 INFO client.AppClient$ClientEndpoint: Executor added: app-20160327073729-0000/2 on worker-20160327012159-192.168.1.122-49405 (192.168.1.122:49405) with 1 cores
16/03/27 07:37:32 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20160327073729-0000/2 on hostPort 192.168.1.122:49405 with 1 cores, 1024.0 MB RAM
16/03/27 07:37:35 INFO client.AppClient$ClientEndpoint: Executor updated: app-20160327073729-0000/2 is now RUNNING
16/03/27 07:37:35 INFO client.AppClient$ClientEndpoint: Executor updated: app-20160327073729-0000/1 is now RUNNING
16/03/27 07:37:44 INFO client.AppClient$ClientEndpoint: Executor updated: app-20160327073729-0000/0 is now RUNNING
16/03/27 07:37:45 INFO cluster.SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginni