hive的条件查询语句_hive的查询语句分析

连接服务

beeline -u jdbc:hive2://hadoop02:10000 -n parallels

数据准备

hive_user

1 xzm 21 12 ‘2016-08-12 14:12:12’

2 xzy 20 13 ‘2016-08-12 14:12:14’

3 lvy 19 9 ‘2016-08-12 14:12:13’

hive_city

12 hangzhou ‘2016-08-12 14:12:13’

13 wenzhou ‘2016-08-12 14:12:19’

19 shengyang ‘2016-08-12 14:12:20’

建立事实表

-------------------------

CREATE EXTERNAL TABLE `hive_user`(

userid INT,

name string,

age INT,

city_id INT,

unixtime STRING)

ROW FORMAT SERDE

‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’

WITH SERDEPROPERTIES (

‘field.delim’=’\t’,

‘serialization.format’=’\t’)

STORED AS INPUTFORMAT

‘org.apache.hadoop.mapred.TextInputFormat’

OUTPUTFORMAT

‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’

建立分区事实表

-------------------------

CREATE EXTERNAL TABLE `hive_user_p`(

userid INT,

name string,

age INT,

city_id INT,

unixtime STRING)

PARTITIONED BY (

`dt` string)

ROW FORMAT SERDE

‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’

WITH SERDEPROPERTIES (

‘field.delim’=’\t’,

‘serialization.format’=’\t’)

STORED AS INPUTFORMAT

‘org.apache.hadoop.mapred.TextInputFormat’

OUTPUTFORMAT

‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’

建立城市纬度表

-------------------------

CREATE EXTERNAL TABLE `hive_city`(

id INT,

name string,

unixtime STRING)

ROW FORMAT SERDE

‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’

WITH SERDEPROPERTIES (

‘field.delim’=’\t’,

‘serialization.format’=’\t’)

STORED AS INPUTFORMAT

‘org.apache.hadoop.mapred.TextInputFormat’

OUTPUTFORMAT

‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’

建立城市分区纬度表

-------------------------

CREATE EXTERNAL TABLE `hive_city_p`(

id INT,

name string,

unixtime STRING)

PARTITIONED BY (

`dt` string)

ROW FORMAT SERDE

‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’

WITH SERDEPROPERTIES (

‘field.delim’=’\t’,

‘serialization.format’=’\t’)

STORED AS INPUTFORMAT

‘org.apache.hadoop.mapred.TextInputFormat’

OUTPUTFORMAT

‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’

导入原始数据

load data local inpath ‘test_data/hive_user_data’ into table hive_user

load data local inpath ‘test_data/hive_city_data’ into table hive_city

导入分区表数据

insert overwrite table hive_user_p partition (dt)

select * ,date_sub(from_unixtime(unix_timestamp(),’yyyy-MM-dd’),1) as dt

from hive_user;

insert overwrite table hive_city_p partition (dt)

select * ,date_sub(from_unixtime(unix_timestamp(),’yyyy-MM-dd’),1) as dt

from hive_city;

join 查询

SELECT up.name, cp.name

FROM hive_user_p up

JOIN hive_city_p cp

ON up.city_id = cp.id

where up.dt= ‘2016-09-20’

left join

SELECT up.name, cp.name

FROM hive_user_p up

left JOIN hive_city_p cp

ON up.city_id = cp.id

where up.dt= ‘2016-09-20’

right join

SELECT up.name, cp.name

FROM hive_user_p up

right JOIN hive_city_p cp

ON up.city_id = cp.id

where up.dt= ‘2016-09-20’

全连接

SELECT up.name, cp.name

FROM hive_user_p up

FULL JOIN hive_city_p cp

ON up.city_id = cp.id

where up.dt= ‘2016-09-20’

0: jdbc:hive2://hadoop02:10000> SELECT up.name, cp.name

0: jdbc:hive2://hadoop02:10000> FROM hive_user_p up

0: jdbc:hive2://hadoop02:10000> FULL JOIN hive_city_p cp

0: jdbc:hive2://hadoop02:10000> ON up.city_id = cp.id

0: jdbc:hive2://hadoop02:10000> where up.dt= ‘2016-09-20’;

Query ID = parallels_20160922010116_a8d05b9e-7006-49a3-ac24-0ffa7264bd42

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks not specified. Estimated from input data size: 1

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=

In order to set a constant number of reducers:

set mapreduce.job.reduces=

INFO : Number of reduce tasks not specified. Estimated from input data size: 1

INFO : In order to change the average load for a reducer (in bytes):

INFO : set hive.exec.reducers.bytes.per.reducer=

INFO : In order to limit the maximum number of reducers:

INFO : set hive.exec.reducers.max=

INFO : In order to set a constant number of reducers:

INFO : set mapreduce.job.reduces=

WARN : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.

Starting Job = job_1474468001307_0007, Tracking URL = http://hadoop02:8088/proxy/application_1474468001307_0007/

Kill Command = /home/parallels/bigdata/hadoop/current/bin/hadoop job -kill job_1474468001307_0007

INFO : number of splits:2

INFO : Submitting tokens for job: job_1474468001307_0007

INFO : The url to track the job: http://hadoop02:8088/proxy/application_1474468001307_0007/

INFO : Starting Job = job_1474468001307_0007, Tracking URL = http://hadoop02:8088/proxy/application_1474468001307_0007/

INFO : Kill Command = /home/parallels/bigdata/hadoop/current/bin/hadoop job -kill job_1474468001307_0007

Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1

2016-09-22 01:01:45,237 Stage-1 map = 0%, reduce = 0%

INFO : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1

INFO : 2016-09-22 01:01:45,237 Stage-1 map = 0%, reduce = 0%

2016-09-22 01:01:56,320 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 2.28 sec

INFO : 2016-09-22 01:01:56,320 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 2.28 sec

2016-09-22 01:02:09,449 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.88 sec

2016-09-22 01:02:12,440 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.88 sec

INFO : 2016-09-22 01:02:09,449 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.88 sec

INFO : 2016-09-22 01:02:12,440 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.88 sec

MapReduce Total cumulative CPU time: 4 seconds 880 msec

Ended Job = job_1474468001307_0007

MapReduce Jobs Launched:

Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 10.03 sec HDFS Read: 14523 HDFS Write: 32 SUCCESS

Total MapReduce CPU Time Spent: 10 seconds 30 msec

OK

INFO : MapReduce Total cumulative CPU time: 4 seconds 880 msec

INFO : Ended Job = job_1474468001307_0007

+———-+———–+–+

| up.name | cp.name |

+———-+———–+–+

| lvy | NULL |

| xzm | hangzhou |

| xzy | wenzhou |

+———-+———–+–+

查看执行计划

explain SELECT up.name, cp.name

FROM hive_user_p up

right JOIN hive_city_p cp

ON up.city_id = cp.id

where up.dt= ‘2016-09-20’

如何看数据是否有问题

-------------------------

在rm页面找到对应的job ,点击进入,类似:http://hadoop03:19888/jobhistory/jobcounters/job_1474468001307_0007

再往里点能看到counter

mapper看hdfs read

reduce看shuffle或者reduce input

再看rm页面各个任务的执行情况,按照时长排序,找最长的 和正常长度的

点counter页面对比下

如果数据量和时间成正比 就是数据倾斜了

mapper的话 可能是文件大小不均匀

要看文件格式是否可切

reduce的话么 可以看看有个开关做二次聚合

hive.groupby.skewindata = true

或者如果时间都差不多 看看多开点reduce

一个100g也要在一个task处理

parquet加其他索引是可以按照块拆的

可切的存储格式,一个文件太大会按照block大小分给不同task

建议用parquet可以切或者orc

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值