关闭

spark-shell用非sql API 改写 hql

标签: sqlhqlsparkhive
694人阅读 评论(0) 收藏 举报
分类:

当我们在spark-shell用scala写程序的时候,如何不嵌入sql来查询Hive呢?这里我们来举个例子:

hql

先来看下想要执行的sql,这里选用了TPC-DS中的query3:

     select
           /*+MAPJOIN(dt, item)*/
             dt.d_year
            ,item.i_brand_id brand_id
            ,item.i_brand brand
            ,sum(ss_ext_sales_price) sum_agg
     from
          store_sales
          ,item
     ,date_dim dt
     where dt.d_date_sk = store_sales.ss_sold_date_sk
       and store_sales.ss_item_sk = item.i_item_sk
       and item.i_manufact_id = 436
       and dt.d_moy=12
     group by dt.d_year
          ,item.i_brand
          ,item.i_brand_id
     order by dt.d_year
             ,sum_agg desc
             ,brand_id
     limit 10

我们看下在Hive shell,use 2g数据库“tpcds_text_2”,查询结果:

Query ID = root_20160829113535_c357e9ac-8f93-4d79-b32c-efee18603f9c
Total jobs = 2
Execution log at: /tmp/root/root_20160829113535_c357e9ac-8f93-4d79-b32c-efee18603f9c.log
2016-08-29 11:35:41 Starting to launch local task to process map join;  maximum memory = 2024800256
2016-08-29 11:35:43 Dump the side-table for tag: 1 with group count: 6200 into file: file:/tmp/root/14cf9218-fe98-4d6d-940c-ec994c2a1552/hive_2016-08-29_11-35-35_363_2500629239737861774-1/-local-10006/HashTable-Stage-3/MapJoin-mapfile21--.hashtable
2016-08-29 11:35:44 Uploaded 1 File to: file:/tmp/root/14cf9218-fe98-4d6d-940c-ec994c2a1552/hive_2016-08-29_11-35-35_363_2500629239737861774-1/-local-10006/HashTable-Stage-3/MapJoin-mapfile21--.hashtable (156018 bytes)
2016-08-29 11:35:44 Dump the side-table for tag: 1 with group count: 32 into file: file:/tmp/root/14cf9218-fe98-4d6d-940c-ec994c2a1552/hive_2016-08-29_11-35-35_363_2500629239737861774-1/-local-10006/HashTable-Stage-3/MapJoin-mapfile31--.hashtable
2016-08-29 11:35:44 Uploaded 1 File to: file:/tmp/root/14cf9218-fe98-4d6d-940c-ec994c2a1552/hive_2016-08-29_11-35-35_363_2500629239737861774-1/-local-10006/HashTable-Stage-3/MapJoin-mapfile31--.hashtable (1644 bytes)
2016-08-29 11:35:44 End of local task; Time Taken: 2.481 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 12
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1471250735061_0008, Tracking URL = http://holodesk01:8088/proxy/application_1471250735061_0008/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1471250735061_0008
Hadoop job information for Stage-3: number of mappers: 3; number of reducers: 12
2016-08-29 11:35:57,022 Stage-3 map = 0%,  reduce = 0%
2016-08-29 11:36:14,828 Stage-3 map = 22%,  reduce = 0%, Cumulative CPU 16.4 sec
2016-08-29 11:36:15,901 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 26.4 sec
2016-08-29 11:36:24,264 Stage-3 map = 100%,  reduce = 83%, Cumulative CPU 49.53 sec
2016-08-29 11:36:25,311 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 57.53 sec
MapReduce Total cumulative CPU time: 57 seconds 530 msec
Ended Job = job_1471250735061_0008
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1471250735061_0009, Tracking URL = http://holodesk01:8088/proxy/application_1471250735061_0009/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1471250735061_0009
Hadoop job information for Stage-4: number of mappers: 2; number of reducers: 1
2016-08-29 11:36:36,161 Stage-4 map = 0%,  reduce = 0%
2016-08-29 11:36:50,781 Stage-4 map = 100%,  reduce = 0%, Cumulative CPU 2.96 sec
2016-08-29 11:36:59,168 Stage-4 map = 100%,  reduce = 100%, Cumulative CPU 5.4 sec
MapReduce Total cumulative CPU time: 5 seconds 400 msec
Ended Job = job_1471250735061_0009
MapReduce Jobs Launched: 
Stage-Stage-3: Map: 3  Reduce: 12   Cumulative CPU: 57.53 sec   HDFS Read: 784204960 HDFS Write: 4354 SUCCESS
Stage-Stage-4: Map: 2  Reduce: 1   Cumulative CPU: 5.4 sec   HDFS Read: 14313 HDFS Write: 502 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 2 seconds 930 msec
OK
1998    10003008    exportiunivamalg #8 36243.50026702881
1998    4003001 exportiedu pack #1  34620.07957458496
1998    10014017    edu packamalgamalg #17  31645.54008293152
1998    3003001 exportiexporti #1   30119.83995437622
1998    5003001 exportischolar #1   28706.51041984558
1998    9016005 corpunivamalg #5    27940.650374412537
1998    2001001 amalgimporto #1 27937.100036621094
1998    8004005 edu packnameless #5 22627.530197143555
1998    4002001 importoedu pack #1  21856.61989593506
1998    1002001 importoamalg #1 20252.259941101074
Time taken: 86.116 seconds, Fetched: 10 row(s)

spark API

下面是对应的scala程序:

import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession

//Hive warehouseLocation
val warehouseLocation = "hdfs://holodesk01:9000/user/hive/warehouse"

val spark = SparkSession.builder().appName("Spark Hive Example").config("spark.sql.warehouse.dir", warehouseLocation).enableHiveSupport().getOrCreate()

import spark.implicits._
import spark.sql

/*
Spark SQL caches Parquet metadata for better performance. When Hive metastore Parquet table conversion is enabled, metadata of those converted tables are also cached. If these tables are updated by Hive or other external tools, you need to refresh them manually to ensure consistent metadata.

spark.catalog.refreshTable("tpcds_text_2.item")
spark.catalog.refreshTable("tpcds_text_2.date_dim")
spark.catalog.refreshTable("tpcds_text_2.store_sales")
*/

val item = spark.table("tpcds_text_2.item")
val date_dim = spark.table("tpcds_text_2.date_dim")
val store_sales = spark.table("tpcds_text_2.store_sales")
//先对表做filter,再做join,可大大提高性能
val itemFiltered = item.filter( $"i_manufact_id" === 436)
//groupBy 返回 RelationalGroupedDataset  对象,再对调用agg,可返回 包含 grouping columns 和 聚合列 的 DataFrame
date_dim.filter($"d_moy" === 12).join(store_sales, $"d_date_sk" === $"ss_sold_date_sk", "inner").join(itemFiltered, $"ss_item_sk" === $"i_item_sk", "inner").groupBy($"d_year",$"i_brand",$"i_brand_id").agg(Map("ss_ext_sales_price"->"sum")).sort($"d_year",$"sum(ss_ext_sales_price)".desc,$"i_brand_id").select($"d_year",$"i_brand",$"i_brand_id",$"sum(ss_ext_sales_price)").limit(10).show()

/***

+------+--------------------+----------+-----------------------+                
|d_year|             i_brand|i_brand_id|sum(ss_ext_sales_price)|
+------+--------------------+----------+-----------------------+
|  1998| exportiunivamalg #8|  10003008|      36243.50026702881|
|  1998|  exportiedu pack #1|   4003001|      34620.07957458496|
|  1998|edu packamalgamal...|  10014017|      31645.54008293152|
|  1998|   exportiexporti #1|   3003001|      30119.83995437622|
|  1998|   exportischolar #1|   5003001|      28706.51041984558|
|  1998|    corpunivamalg #5|   9016005|     27940.650374412537|
|  1998|     amalgimporto #1|   2001001|     27937.100036621094|
|  1998| edu packnameless #5|   8004005|     22627.530197143555|
|  1998|  importoedu pack #1|   4002001|      21856.61989593506|
|  1998|     importoamalg #1|   1002001|     20252.259941101074|
+------+--------------------+----------+-----------------------+

***/

总结

如果不嵌入sql,代码写起来还是会有点麻烦的。但是,不用sql写,反而能让我们学习到sql解析执行的过程,对其进行优化。

0
0
查看评论

整理对Spark SQL的理解

Catalyst定位 其他系统如果想基于Spark做一些类sql、标准sql甚至其他查询语言的查询,需要基于Catalyst提供的解析器、执行计划树结构、逻辑执行计划的处理规则体系等类体系来实现执行计划的解析、生成、优化、映射工作。 对应上图中,主要是左侧的TreeNodelib及中间三次转化过程中...
  • zbf8441372
  • zbf8441372
  • 2014-07-15 10:18
  • 24677

Spark shell脚本

启动终端,cd 到 SPARK_HOME/  路径下运行命令          $ ./bin/spark-shell
  • gdp12315
  • gdp12315
  • 2015-10-20 20:12
  • 549

sparkSQL元数据缓存踩的坑

spark metadata cache背景最近一直忙着搞apm,也没时间写博客,眼看5月已经过半了,赶紧写一篇压压惊,先描述下背景:我们将sparkSession封装在actor中,每个actor都有自己独占的sparkSession,有些sql是保存数据到hive和hdfs上,但由于是一个多线程...
  • cjuexuan
  • cjuexuan
  • 2017-05-16 11:01
  • 2021

实战6.SparkSQL(下)--Spark实战应用

1、运行环境说明 1.1 硬软件环境 l  主机操作系统:Windows 64位,双核4线程,主频2.2G,10G内存 l  虚拟软件:VMware® Workstation 9.0.0 build-812388 l  虚拟机操作系统:CentOS ...
  • xiangxizhishi
  • xiangxizhishi
  • 2017-09-12 00:20
  • 317

如何在spark-shell命令行执行spark hql

前面已经有篇文章介绍如何编译包含hive的spark-assembly.jar了,不清楚的可以翻看一下前面的文章。 cloudera manager装好的spark,直接执行spark-shell进入命令行后,写入如下语句: val hiveContext = new&...
  • xiao_jun_0820
  • xiao_jun_0820
  • 2015-03-17 19:02
  • 11540

SparkSQL编程指南之Java篇二-数据源(上)

Spark SQL通过DataFrame接口支持各种不同数据源的操作。一个DataFrame可以进行相关的转换操作,也可以用于创建临时视图。注册DataFrame为一个临时视图可以允许你对其数据执行SQL查询。本文首先会介绍使用Spark数据源加载和保存数据的一般方法,然后对内置数据源进行详细介绍。...
  • gangchengzhong
  • gangchengzhong
  • 2017-03-10 11:11
  • 1143

Hibernate 的HQL和sql有什么区别

sql 面向数据库表查询hql 面向对象查询hql : from 后面跟的 类名+类对象 where 后 用 对象的属性做条件sql: from 后面跟的是表名 where 后 用表中字段做条件查询在Hibernate中使用查询时,一般使用Hql查询语句。HQL(Hibernate Query La...
  • haozhugogo
  • haozhugogo
  • 2017-01-16 17:18
  • 1098

经典SQL改写案例

好久没有写博客了,今天遇到经典案例, 经典思想碰撞。 不得不写博客纪念。   我一直认为SQL 不牛逼, 做不了好的表模型,数据模型,业务模型设计 ,  而往往好的架构设计 就体现在好的 业务数据模型中。     SQL:  SELECT ...
  • daiqiulong2
  • daiqiulong2
  • 2017-03-15 20:32
  • 341

hibernate之cretiria,hql,本地sql基本使用

hibernate目前总共分为三大类查询:cretiria,hql,本地sql 第一:关于cretiria的查询 具有一个直观的、可扩展的条件查询API是Hibernate的特色。创建一个Criteria 实例 org.hibernate.Criteria接口表示特定持久类的一个查询。Sess...
  • KingBoyWorld
  • KingBoyWorld
  • 2017-06-27 13:33
  • 498

HQL使用小技巧——混合SQL

偶然间发现HQL一个非常有意思的用法,拿来分享下。 示例如下: 1. student数据库表(MySQL) CREATE TABLE `student` ( `ID` int(11) NOT NULL, `NAMES` varchar(50) NOT NULL, `AGE` int...
  • wangpeng047
  • wangpeng047
  • 2013-07-30 11:48
  • 6195
    个人资料
    • 访问:238496次
    • 积分:3632
    • 等级:
    • 排名:第10678名
    • 原创:131篇
    • 转载:38篇
    • 译文:6篇
    • 评论:51条
    博客专栏