第69课:SparkSQL通过Hive数据源实战学习笔记
本期内容:
1 SparkSQL操作Hive解析
2 SparkSQL操作Hive实战
数据源:home/richard/slq/spark/people.txt和/home/richard/slq/spark/peoplescores.txt两个文件。
people.txt的文件内容:
Michael 29
Andy 30
Justin 19
peoplescores.txt文件内容:
Michael 99
Andy 97
Justin 68
注:people.txt和peoplescores.txt的文件内容以Tab键为分隔符。
代码:
package com.dt.spark.SparkApps.sql
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.hive.HiveContext
// 使用Java的方式开发实战对DataFrame的操作
object SparkSQL2Hive {
def main(args: Array[String]): Unit = {
val conf = new SparkConf() //创建SparkConf对象
conf.setAppName("SparkSQL2Hive") //设置应用程序名
conf.setMaster("spark://slq1:7077") //设置集群的Master
val sc = new SparkContext //创建SparkContext对象,
//在目前企业级大数据Spark开发的时候,绝大多数情况下是采用Hive作为数据仓库
//Spark提供了HIve的支持功能,Spark通过HiveContext可以直接操作Hive中的数据
//基于HiveContext我们可以使用sql/hql两种方式才编写SQL语句对Hive进行操作,
//包括创建表、删除表、往表里导入数据 以及用SQL语法构造 各种SQL语句对表中的数据进行CRUD操作
//第二:也可以直接通过saveAsTable的方式把DaraFrame中的数据保存到Hive数据仓库中
//第三:可以直接通过HiveContext.table方法来直接加载Hive中的表而生成DataFrame
val hiveContext = new HiveContext(sc)
hiveContext.sql("use hive")
hiveContext.sql("DROP TABLE IF EXISTS people")
hiveContext.sql("CREATE TABLE IF NOT EXISTS people(name STRING,age INT)")
hiveContext.sql("LOAD DATA LOCAL INPATH '/home/richard/slq/spark/people.txt' INTO TABLE people")
//把本地数据加载到Hive中(背后实际上发生了数据的拷贝)
//当然也可以通过LOAD DATA INPATH去获得HDFS等上面的数据 到Hive(此时发生了数据的移动)
hiveContext.sql("DROP TABLE IF EXISTS peoplescores")
hiveContext.sql("CREATE TABLE IF NOT EXISTS peoplescores(name STRING,score INT)")
hiveContext.sql("LOAD DATA LOCAL INPATH '/home/richard/slq/spark/peoplescores.txt' INTO TABLE peoplescores")
//通过HiveContext使用join直接基于Hive中的两张表进行操作获得大于90分的人的name,age,score
val resultDF = hiveContext.sql("SELECT pi.name,pi.age,ps.score"
+ "FROM people pi JOIN peoplescores ps ON pi.name=ps.name WHERE ps.score > 90")
//通过saveAsTable创建一张Hive Managed Table,数据放在什么地方、元数据都是Hive管理的
//当删除该表时,数据也会一起被删除(磁盘上的数据不再存在)
hiveContext.sql("DROP TABLE IF EXISTS peopleinformationresult")
resultDF.saveAsTable("peopleinformationresult")
//使用HivewContext的Table方法可以直接去读Hive中的Table并生成DaraFrame
//读取的数据就可以进行机器学习、图计算、各种复杂ETL等操作
val dataFrameHive = hiveContext.table("peopleinformationresult")
dataFrameHive.show()
}
}
创建一个shell脚本:
/home/richard/spark-1.6.0/bin/spark-submit --class SparkSQLByScala.SparkSQL2Hive --master spark://slq1:7077 /home/richard/slq/spark/160327/SparkSQL2Hive.jar
通过./SparkAppsScala.sh执行。执行时的log如下:
[richard@slq1 160327]$ ./SparkAppsScala.sh
16/04/09 23:44:11 INFO spark.SparkContext: Running Spark version 1.6.0
16/04/09 23:44:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/04/09 23:44:17 INFO spark.SecurityManager: Changing view acls to: richard
16/04/09 23:44:17 INFO spark.SecurityManager: Changing modify acls to: richard
16/04/09 23:44:17 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(richard); users with modify permissions: Set(richard)
16/04/09 23:44:25 INFO util.Utils: Successfully started service 'sparkDriver' on port 35849.
16/04/09 23:44:30 INFO slf4j.Slf4jLogger: Slf4jLogger started
16/04/09 23:44:31 INFO Remoting: Starting remoting
16/04/09 23:44:34 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.1.121:48550]
16/04/09 23:44:34 INFO util.Utils: Successfully started service 'sparkDriverActorSystem' on port 48550.
16/04/09 23:44:35 INFO spark.SparkEnv: Registering MapOutputTracker
16/04/09 23:44:35 INFO spark.SparkEnv: Registering BlockManagerMaster
16/04/09 23:44:35 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-a5fed17f-9ca9-4356-8433-94184add459c
16/04/09 23:44:35 INFO storage.MemoryStore: MemoryStore started with capacity 517.4 MB
16/04/09 23:44:37 INFO spark.SparkEnv: Registering OutputCommitCoordinator
16/04/09 23:44:40 INFO server.Server: jetty-8.y.z-SNAPSHOT
16/04/09 23:44:40 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
16/04/09 23:44:40 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
16/04/09 23:44:40 INFO ui.SparkUI: Started SparkUI at http://192.168.1.121:4040
16/04/09 23:44:41 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-03bc88ca-b3b3-410e-b8f7-10f85be695e3/httpd-ea4628a0-346c-4cf9-a951-092ab9f72f2a
16/04/09 23:44:41 INFO spark.HttpServer: Starting HTTP Server
16/04/09 23:44:41 INFO server.Server: jetty-8.y.z-SNAPSHOT
16/04/09 23:44:41 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:60023
16/04/09 23:44:41 INFO util.Utils: Successfully started service 'HTTP file server' on port 60023.
16/04/09 23:44:41 INFO spark.SparkContext: Added JAR file:/home/richard/slq/spark/160327/SparkSQL2Hive.jar at http://192.168.1.121:60023/jars/SparkSQL2Hive.jar with timestamp 1460216681926
16/04/09 23:44:42 INFO client.AppClient$ClientEndpoint: Connecting to master spark://slq1:7077...
16/04/09 23:44:45 INFO cluster.SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20160409234445-0002
16/04/09 23:44:45 INFO client.AppClient$ClientEndpoint: Executor added: app-20160409234445-0002/0 on worker-20160409171905-192.168.1.122-40769 (192.168.1.122:40769) with 1 cores
16/04/09 23:44:45 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20160409234445-0002/0 on hostPort 192.168.1.122:40769 with 1 cores, 1024.0 MB RAM
16/04/09 23:44:45 INFO client.AppClient$ClientEndpoint: Executor added: app-20160409234445-0002/1 on worker-20160409171920-192.168.1.121-58385 (192.168.1.121:58385) with 1 cores
16/04/09 23:44:45 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20160409234445-0002/1 on hostPort 192.168.1.121:58385 with 1 cores, 1024.0 MB RAM
16/04/09 23:44:45 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40241.
16/04/09 23:44:45 INFO netty.NettyBlockTransferService: Server created on 40241
16/04/09 23:44:45 INFO client.AppClient$ClientEndpoint: Executor added: app-20160409234445-0002/2 on worker-20160409171905-192.168.1.123-55110 (192.168.1.123:55110) with 1 cores
16/04/09 23:44:45 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20160409234445-0002/2 on hostPort 192.168.1.123:55110 with 1 cores, 1024.0 MB RAM
16/04/09 23:44:46 INFO storage.BlockManagerMaster: Trying to register BlockManager
16/04/09 23:44:46 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.1.121:40241 with 517.4 MB RAM, BlockManagerId(driver, 192.168.1.121, 40241)
16/04/09 23:44:46 INFO storage.BlockManagerMaster: Registered BlockManager
16/04/09 23:44:47 INFO client.AppClient$ClientEndpoint: Executor updated: app-20160409234445-0002/2 is now RUNNING
16/04/09 23:44:47 INFO client.AppClient$ClientEndpoint: Executor updated: app-20160409234445-0002/0 is now RUNNING
16/04/09 23:44:47 INFO client.AppClient$ClientEndpoint: Executor updated: app-20160409234445-0002/1 is now RUNNING
16/04/09 23:44:52 INFO cluster.SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
16/04/09 23:45:16 INFO cluster.SparkDeploySchedulerBackend: Registered executor NettyRpcEndpointRef(null) (slq3:44752) with ID 2
16/04/09 23:45:16 INFO storage.BlockManagerMasterEndpoint: Registering block manager slq3:43259 with 517.4 MB RAM, BlockManagerId(2, slq3, 43259)
16/04/09 23:45:19 INFO cluster.SparkDeploySchedulerBackend: Registered executor NettyRpcEndpointRef(null) (slq2:34148) with ID 0
16/04/09 23:45:19 INFO storage.BlockManagerMasterEndpoint: Registering block manager slq2:42107 with 517.4 MB RAM, BlockManagerId(0, slq2, 42107)
16/04/09 23:45:24 INFO hive.HiveContext: Initializing execution hive, version 1.2.1
16/04/09 23:45:25 INFO client.ClientWrapper: Inspected Hadoop version: 2.6.0
16/04/09 23:45:25 INFO client.ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0
16/04/09 23:45:33 INFO metastore.HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
16/04/09 23:45:34 INFO metastore.ObjectStore: ObjectStore, initialize called
16/04/09 23:45:38 INFO DataNucleus.Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
16/04/09 23:45:38 INFO DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored
16/04/09 23:45:41 WARN DataNucleus.Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/04/09 23:45:49 WARN DataNucleus.Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/04/09 23:45:57 INFO cluster.SparkDeploySchedulerBackend: Registered executor NettyRpcEndpointRef(null) (slq1:42766) with ID 1
16/04/09 23:45:58 INFO storage.BlockManagerMasterEndpoint: Registering block manager slq1:55585 with 517.4 MB RAM, BlockManagerId(1, slq1, 55585)
16/04/09 23:46:15 INFO metastore.ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
16/04/09 23:46:29 INFO DataNucleus.Datastore: The class "org.apache.ha