第69课:SparkSQL通过Hive数据源实战学习笔记

本篇博客是关于SparkSQL通过Hive数据源进行实战学习的笔记,内容包括使用SparkSQL创建和操作Hive表,加载数据,执行join查询,并将结果保存为新的Hive表。示例代码展示了如何处理people.txt和peoplescores.txt文件数据,以及如何通过shell脚本执行Spark应用。执行过程记录了SparkDriver和Executor的启动、任务调度和执行过程。
摘要由CSDN通过智能技术生成

69课:SparkSQL通过Hive数据源实战学习笔记

本期内容:

1 SparkSQL操作Hive解析

2 SparkSQL操作Hive实战

 

数据源:home/richard/slq/spark/people.txt和/home/richard/slq/spark/peoplescores.txt两个文件。

people.txt的文件内容:

Michael 29

Andy 30

Justin 19

peoplescores.txt文件内容:

Michael 99

Andy 97

Justin 68

注:people.txt和peoplescores.txt的文件内容以Tab键为分隔符。

 

代码:

package com.dt.spark.SparkApps.sql

 

import org.apache.spark.SparkContext

import org.apache.spark.SparkConf

import org.apache.spark.sql.SQLContext

import org.apache.spark.sql.hive.HiveContext

 

// 使用Java的方式开发实战对DataFrame的操作

object SparkSQL2Hive {

  def main(args: Array[String]): Unit = {

    val conf = new SparkConf()  //创建SparkConf对象

    conf.setAppName("SparkSQL2Hive")  //设置应用程序名

    conf.setMaster("spark://slq1:7077")  //设置集群的Master

    val sc = new SparkContext  //创建SparkContext对象,

    

    //在目前企业级大数据Spark开发的时候,绝大多数情况下是采用Hive作为数据仓库

    //Spark提供了HIve的支持功能,Spark通过HiveContext可以直接操作Hive中的数据

    //基于HiveContext我们可以使用sql/hql两种方式才编写SQL语句对Hive进行操作,

    //包括创建表、删除表、往表里导入数据 以及用SQL语法构造 各种SQL语句对表中的数据进行CRUD操作   

    //第二:也可以直接通过saveAsTable的方式把DaraFrame中的数据保存到Hive数据仓库中

    //第三:可以直接通过HiveContext.table方法来直接加载Hive中的表而生成DataFrame

    val hiveContext = new HiveContext(sc)

    hiveContext.sql("use hive")

    hiveContext.sql("DROP TABLE IF EXISTS people")

    hiveContext.sql("CREATE TABLE IF NOT EXISTS people(name STRING,age INT)")

    hiveContext.sql("LOAD DATA LOCAL INPATH '/home/richard/slq/spark/people.txt' INTO TABLE people")

    //把本地数据加载到Hive中(背后实际上发生了数据的拷贝)

    //当然也可以通过LOAD DATA INPATH去获得HDFS等上面的数据 到Hive(此时发生了数据的移动)

    hiveContext.sql("DROP TABLE IF EXISTS peoplescores")

    hiveContext.sql("CREATE TABLE IF NOT EXISTS peoplescores(name STRING,score INT)")

    hiveContext.sql("LOAD DATA LOCAL INPATH '/home/richard/slq/spark/peoplescores.txt' INTO TABLE peoplescores")

    //通过HiveContext使用join直接基于Hive中的两张表进行操作获得大于90分的人的name,age,score

    val resultDF = hiveContext.sql("SELECT pi.name,pi.age,ps.score"   

      + "FROM people pi JOIN peoplescores ps ON pi.name=ps.name WHERE ps.score > 90")

    

      //通过saveAsTable创建一张Hive Managed Table,数据放在什么地方、元数据都是Hive管理的

      //当删除该表时,数据也会一起被删除(磁盘上的数据不再存在)

    hiveContext.sql("DROP TABLE IF EXISTS peopleinformationresult")

    resultDF.saveAsTable("peopleinformationresult")  

    

    //使用HivewContextTable方法可以直接去读Hive中的Table并生成DaraFrame

    //读取的数据就可以进行机器学习、图计算、各种复杂ETL等操作

    val dataFrameHive = hiveContext.table("peopleinformationresult")

    

    dataFrameHive.show()  

    

  }

}

 

创建一个shell脚本:

/home/richard/spark-1.6.0/bin/spark-submit --class SparkSQLByScala.SparkSQL2Hive --master spark://slq1:7077 /home/richard/slq/spark/160327/SparkSQL2Hive.jar

通过./SparkAppsScala.sh执行。执行时的log如下:

[richard@slq1 160327]$ ./SparkAppsScala.sh

16/04/09 23:44:11 INFO spark.SparkContext: Running Spark version 1.6.0

16/04/09 23:44:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

16/04/09 23:44:17 INFO spark.SecurityManager: Changing view acls to: richard

16/04/09 23:44:17 INFO spark.SecurityManager: Changing modify acls to: richard

16/04/09 23:44:17 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(richard); users with modify permissions: Set(richard)

16/04/09 23:44:25 INFO util.Utils: Successfully started service 'sparkDriver' on port 35849.

16/04/09 23:44:30 INFO slf4j.Slf4jLogger: Slf4jLogger started

16/04/09 23:44:31 INFO Remoting: Starting remoting

16/04/09 23:44:34 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.1.121:48550]

16/04/09 23:44:34 INFO util.Utils: Successfully started service 'sparkDriverActorSystem' on port 48550.

16/04/09 23:44:35 INFO spark.SparkEnv: Registering MapOutputTracker

16/04/09 23:44:35 INFO spark.SparkEnv: Registering BlockManagerMaster

16/04/09 23:44:35 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-a5fed17f-9ca9-4356-8433-94184add459c

16/04/09 23:44:35 INFO storage.MemoryStore: MemoryStore started with capacity 517.4 MB

16/04/09 23:44:37 INFO spark.SparkEnv: Registering OutputCommitCoordinator

16/04/09 23:44:40 INFO server.Server: jetty-8.y.z-SNAPSHOT

16/04/09 23:44:40 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040

16/04/09 23:44:40 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.

16/04/09 23:44:40 INFO ui.SparkUI: Started SparkUI at http://192.168.1.121:4040

16/04/09 23:44:41 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-03bc88ca-b3b3-410e-b8f7-10f85be695e3/httpd-ea4628a0-346c-4cf9-a951-092ab9f72f2a

16/04/09 23:44:41 INFO spark.HttpServer: Starting HTTP Server

16/04/09 23:44:41 INFO server.Server: jetty-8.y.z-SNAPSHOT

16/04/09 23:44:41 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:60023

16/04/09 23:44:41 INFO util.Utils: Successfully started service 'HTTP file server' on port 60023.

16/04/09 23:44:41 INFO spark.SparkContext: Added JAR file:/home/richard/slq/spark/160327/SparkSQL2Hive.jar at http://192.168.1.121:60023/jars/SparkSQL2Hive.jar with timestamp 1460216681926

16/04/09 23:44:42 INFO client.AppClient$ClientEndpoint: Connecting to master spark://slq1:7077...

16/04/09 23:44:45 INFO cluster.SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20160409234445-0002

16/04/09 23:44:45 INFO client.AppClient$ClientEndpoint: Executor added: app-20160409234445-0002/0 on worker-20160409171905-192.168.1.122-40769 (192.168.1.122:40769) with 1 cores

16/04/09 23:44:45 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20160409234445-0002/0 on hostPort 192.168.1.122:40769 with 1 cores, 1024.0 MB RAM

16/04/09 23:44:45 INFO client.AppClient$ClientEndpoint: Executor added: app-20160409234445-0002/1 on worker-20160409171920-192.168.1.121-58385 (192.168.1.121:58385) with 1 cores

16/04/09 23:44:45 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20160409234445-0002/1 on hostPort 192.168.1.121:58385 with 1 cores, 1024.0 MB RAM

16/04/09 23:44:45 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 40241.

16/04/09 23:44:45 INFO netty.NettyBlockTransferService: Server created on 40241

16/04/09 23:44:45 INFO client.AppClient$ClientEndpoint: Executor added: app-20160409234445-0002/2 on worker-20160409171905-192.168.1.123-55110 (192.168.1.123:55110) with 1 cores

16/04/09 23:44:45 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20160409234445-0002/2 on hostPort 192.168.1.123:55110 with 1 cores, 1024.0 MB RAM

16/04/09 23:44:46 INFO storage.BlockManagerMaster: Trying to register BlockManager

16/04/09 23:44:46 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.1.121:40241 with 517.4 MB RAM, BlockManagerId(driver, 192.168.1.121, 40241)

16/04/09 23:44:46 INFO storage.BlockManagerMaster: Registered BlockManager

16/04/09 23:44:47 INFO client.AppClient$ClientEndpoint: Executor updated: app-20160409234445-0002/2 is now RUNNING

16/04/09 23:44:47 INFO client.AppClient$ClientEndpoint: Executor updated: app-20160409234445-0002/0 is now RUNNING

16/04/09 23:44:47 INFO client.AppClient$ClientEndpoint: Executor updated: app-20160409234445-0002/1 is now RUNNING

16/04/09 23:44:52 INFO cluster.SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0

16/04/09 23:45:16 INFO cluster.SparkDeploySchedulerBackend: Registered executor NettyRpcEndpointRef(null) (slq3:44752) with ID 2

16/04/09 23:45:16 INFO storage.BlockManagerMasterEndpoint: Registering block manager slq3:43259 with 517.4 MB RAM, BlockManagerId(2, slq3, 43259)

16/04/09 23:45:19 INFO cluster.SparkDeploySchedulerBackend: Registered executor NettyRpcEndpointRef(null) (slq2:34148) with ID 0

16/04/09 23:45:19 INFO storage.BlockManagerMasterEndpoint: Registering block manager slq2:42107 with 517.4 MB RAM, BlockManagerId(0, slq2, 42107)

16/04/09 23:45:24 INFO hive.HiveContext: Initializing execution hive, version 1.2.1

16/04/09 23:45:25 INFO client.ClientWrapper: Inspected Hadoop version: 2.6.0

16/04/09 23:45:25 INFO client.ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop23Shims for Hadoop version 2.6.0

16/04/09 23:45:33 INFO metastore.HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore

16/04/09 23:45:34 INFO metastore.ObjectStore: ObjectStore, initialize called

16/04/09 23:45:38 INFO DataNucleus.Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored

16/04/09 23:45:38 INFO DataNucleus.Persistence: Property datanucleus.cache.level2 unknown - will be ignored

16/04/09 23:45:41 WARN DataNucleus.Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)

16/04/09 23:45:49 WARN DataNucleus.Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)

16/04/09 23:45:57 INFO cluster.SparkDeploySchedulerBackend: Registered executor NettyRpcEndpointRef(null) (slq1:42766) with ID 1

16/04/09 23:45:58 INFO storage.BlockManagerMasterEndpoint: Registering block manager slq1:55585 with 517.4 MB RAM, BlockManagerId(1, slq1, 55585)

16/04/09 23:46:15 INFO metastore.ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"

16/04/09 23:46:29 INFO DataNucleus.Datastore: The class "org.apache.ha

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值