连接服务
beeline -u jdbc:hive2://hadoop02:10000 -n parallels
数据准备
hive_user
1 xzm 21 12 ‘2016-08-12 14:12:12’
2 xzy 20 13 ‘2016-08-12 14:12:14’
3 lvy 19 9 ‘2016-08-12 14:12:13’
hive_city
12 hangzhou ‘2016-08-12 14:12:13’
13 wenzhou ‘2016-08-12 14:12:19’
19 shengyang ‘2016-08-12 14:12:20’
建立事实表
-------------------------
CREATE EXTERNAL TABLE `hive_user`(
userid INT,
name string,
age INT,
city_id INT,
unixtime STRING)
ROW FORMAT SERDE
‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’
WITH SERDEPROPERTIES (
‘field.delim’=’\t’,
‘serialization.format’=’\t’)
STORED AS INPUTFORMAT
‘org.apache.hadoop.mapred.TextInputFormat’
OUTPUTFORMAT
‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’
建立分区事实表
-------------------------
CREATE EXTERNAL TABLE `hive_user_p`(
userid INT,
name string,
age INT,
city_id INT,
unixtime STRING)
PARTITIONED BY (
`dt` string)
ROW FORMAT SERDE
‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’
WITH SERDEPROPERTIES (
‘field.delim’=’\t’,
‘serialization.format’=’\t’)
STORED AS INPUTFORMAT
‘org.apache.hadoop.mapred.TextInputFormat’
OUTPUTFORMAT
‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’
建立城市纬度表
-------------------------
CREATE EXTERNAL TABLE `hive_city`(
id INT,
name string,
unixtime STRING)
ROW FORMAT SERDE
‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’
WITH SERDEPROPERTIES (
‘field.delim’=’\t’,
‘serialization.format’=’\t’)
STORED AS INPUTFORMAT
‘org.apache.hadoop.mapred.TextInputFormat’
OUTPUTFORMAT
‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’
建立城市分区纬度表
-------------------------
CREATE EXTERNAL TABLE `hive_city_p`(
id INT,
name string,
unixtime STRING)
PARTITIONED BY (
`dt` string)
ROW FORMAT SERDE
‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’
WITH SERDEPROPERTIES (
‘field.delim’=’\t’,
‘serialization.format’=’\t’)
STORED AS INPUTFORMAT
‘org.apache.hadoop.mapred.TextInputFormat’
OUTPUTFORMAT
‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat’
导入原始数据
load data local inpath ‘test_data/hive_user_data’ into table hive_user
load data local inpath ‘test_data/hive_city_data’ into table hive_city
导入分区表数据
insert overwrite table hive_user_p partition (dt)
select * ,date_sub(from_unixtime(unix_timestamp(),’yyyy-MM-dd’),1) as dt
from hive_user;
insert overwrite table hive_city_p partition (dt)
select * ,date_sub(from_unixtime(unix_timestamp(),’yyyy-MM-dd’),1) as dt
from hive_city;
join 查询
SELECT up.name, cp.name
FROM hive_user_p up
JOIN hive_city_p cp
ON up.city_id = cp.id
where up.dt= ‘2016-09-20’
left join
SELECT up.name, cp.name
FROM hive_user_p up
left JOIN hive_city_p cp
ON up.city_id = cp.id
where up.dt= ‘2016-09-20’
right join
SELECT up.name, cp.name
FROM hive_user_p up
right JOIN hive_city_p cp
ON up.city_id = cp.id
where up.dt= ‘2016-09-20’
全连接
SELECT up.name, cp.name
FROM hive_user_p up
FULL JOIN hive_city_p cp
ON up.city_id = cp.id
where up.dt= ‘2016-09-20’
0: jdbc:hive2://hadoop02:10000> SELECT up.name, cp.name
0: jdbc:hive2://hadoop02:10000> FROM hive_user_p up
0: jdbc:hive2://hadoop02:10000> FULL JOIN hive_city_p cp
0: jdbc:hive2://hadoop02:10000> ON up.city_id = cp.id
0: jdbc:hive2://hadoop02:10000> where up.dt= ‘2016-09-20’;
Query ID = parallels_20160922010116_a8d05b9e-7006-49a3-ac24-0ffa7264bd42
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
INFO : Number of reduce tasks not specified. Estimated from input data size: 1
INFO : In order to change the average load for a reducer (in bytes):
INFO : set hive.exec.reducers.bytes.per.reducer=
INFO : In order to limit the maximum number of reducers:
INFO : set hive.exec.reducers.max=
INFO : In order to set a constant number of reducers:
INFO : set mapreduce.job.reduces=
WARN : Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
Starting Job = job_1474468001307_0007, Tracking URL = http://hadoop02:8088/proxy/application_1474468001307_0007/
Kill Command = /home/parallels/bigdata/hadoop/current/bin/hadoop job -kill job_1474468001307_0007
INFO : number of splits:2
INFO : Submitting tokens for job: job_1474468001307_0007
INFO : The url to track the job: http://hadoop02:8088/proxy/application_1474468001307_0007/
INFO : Starting Job = job_1474468001307_0007, Tracking URL = http://hadoop02:8088/proxy/application_1474468001307_0007/
INFO : Kill Command = /home/parallels/bigdata/hadoop/current/bin/hadoop job -kill job_1474468001307_0007
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2016-09-22 01:01:45,237 Stage-1 map = 0%, reduce = 0%
INFO : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
INFO : 2016-09-22 01:01:45,237 Stage-1 map = 0%, reduce = 0%
2016-09-22 01:01:56,320 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 2.28 sec
INFO : 2016-09-22 01:01:56,320 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 2.28 sec
2016-09-22 01:02:09,449 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.88 sec
2016-09-22 01:02:12,440 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.88 sec
INFO : 2016-09-22 01:02:09,449 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.88 sec
INFO : 2016-09-22 01:02:12,440 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.88 sec
MapReduce Total cumulative CPU time: 4 seconds 880 msec
Ended Job = job_1474468001307_0007
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 10.03 sec HDFS Read: 14523 HDFS Write: 32 SUCCESS
Total MapReduce CPU Time Spent: 10 seconds 30 msec
OK
INFO : MapReduce Total cumulative CPU time: 4 seconds 880 msec
INFO : Ended Job = job_1474468001307_0007
+———-+———–+–+
| up.name | cp.name |
+———-+———–+–+
| lvy | NULL |
| xzm | hangzhou |
| xzy | wenzhou |
+———-+———–+–+
查看执行计划
explain SELECT up.name, cp.name
FROM hive_user_p up
right JOIN hive_city_p cp
ON up.city_id = cp.id
where up.dt= ‘2016-09-20’
如何看数据是否有问题
-------------------------
在rm页面找到对应的job ,点击进入,类似:http://hadoop03:19888/jobhistory/jobcounters/job_1474468001307_0007
再往里点能看到counter
mapper看hdfs read
reduce看shuffle或者reduce input
再看rm页面各个任务的执行情况,按照时长排序,找最长的 和正常长度的
点counter页面对比下
如果数据量和时间成正比 就是数据倾斜了
mapper的话 可能是文件大小不均匀
要看文件格式是否可切
reduce的话么 可以看看有个开关做二次聚合
hive.groupby.skewindata = true
或者如果时间都差不多 看看多开点reduce
一个100g也要在一个task处理
parquet加其他索引是可以按照块拆的
可切的存储格式,一个文件太大会按照block大小分给不同task
建议用parquet可以切或者orc