impala操作hive数据实例

最新推荐文章于 2024-09-16 02:56:09 发布

汀桦坞

最新推荐文章于 2024-09-16 02:56:09 发布

阅读量7.4k

点赞数 1

分类专栏：大数据文章标签： Hive Impala

本文链接：https://blog.csdn.net/wiborgite/article/details/78813342

版权

大数据专栏收录该内容

76 篇文章 8 订阅

订阅专栏

背景说明：

基于CHD quick VM环境，在一个VM中同时包含了HDFS、YARN、HBase、Hive、Impala等组件。

本文将一个文本数据从HDFS加载到Hive，同步元数据后，在Impala中进行数据操作。

-----------------------------------------------------------------------------------------Linux Shell的操作-----------------------------------------------------------

1、将PC本地的数据文件上传到VM中/home/data目录下

[root@quickstart data]# pwd
/home/data
[root@quickstart data]# ls
p10pco2a.dat  stock_data2.csv
[root@quickstart data]# head p10pco2a.dat 
WOCE_P10,1993,279.479,-16.442,172.219,24.9544,34.8887,1.0035,363.551,2
WOCE_P10,1993,279.480,-16.440,172.214,24.9554,34.8873,1.0035,363.736,2
WOCE_P10,1993,279.480,-16.439,172.213,24.9564,34.8868,1.0033,363.585,2
WOCE_P10,1993,279.481,-16.438,172.209,24.9583,34.8859,1.0035,363.459,2
WOCE_P10,1993,279.481,-16.437,172.207,24.9594,34.8859,1.0033,363.543,2
WOCE_P10,1993,279.481,-16.436,172.205,24.9604,34.8858,1.0035,363.432,2
WOCE_P10,1993,279.489,-16.417,172.164,24.9743,34.8867,1.0036,362.967,2
WOCE_P10,1993,279.490,-16.414,172.158,24.9742,34.8859,1.0035,362.960,2
WOCE_P10,1993,279.491,-16.412,172.153,24.9747,34.8864,1.0033,362.998,2
WOCE_P10,1993,279.492,-16.411,172.148,24.9734,34.8868,1.0031,363.022,2

2、将/home/data/p10pco2a.dat文件上传到HDFS

[root@quickstart data]# hdfs dfs -put p10pco2a.dat /tmp/
[root@quickstart data]# hdfs dfs -ls /tmp
-rw-r--r--   1 root   supergroup     281014 2017-12-14 18:47 /tmp/p10pco2a.dat

-----------------------------------------------------------------------Hive的操作----------------------------------------------------------------------------

1、启动Hive CLI

# hive

2、Hive中创建数据库

CREATE DATABASE  weather;

3、Hive中创建表

create table weather.weather_everydate_detail
(
section string,
year bigint,
date double,
latim double,
longit double,
sur_tmp double,
sur_sal double,
atm_per double,
xco2a double,
qf bigint
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

4、将HDFS中的数据加载到已创建的Hive表中

LOAD DATA INPATH '/tmp/p10pco2a.dat' INTO TABLE weather.weather_everydate_detail；

hive> LOAD DATA INPATH '/tmp/p10pco2a.dat' INTO TABLE weather.weather_everydate_detail;
Loading data to table weather.weather_everydate_detail
Table weather.weather_everydate_detail stats: [numFiles=1, totalSize=281014]
OK
Time taken: 1.983 seconds

5、查看Hive表确保数据已加载

use weather;
select * from weather.weather_everydate_detail limit 10;
select count(*) from weather.weather_everydate_detail;

hive> select * from weather.weather_everydate_detail limit 10;
OK
WOCE_P10	1993	279.479	-16.442	172.219	24.9544	34.8887	1.0035	363.551	2
WOCE_P10	1993	279.48	-16.44	172.214	24.9554	34.8873	1.0035	363.736	2
WOCE_P10	1993	279.48	-16.439	172.213	24.9564	34.8868	1.0033	363.585	2
WOCE_P10	1993	279.481	-16.438	172.209	24.9583	34.8859	1.0035	363.459	2
WOCE_P10	1993	279.481	-16.437	172.207	24.9594	34.8859	1.0033	363.543	2
WOCE_P10	1993	279.481	-16.436	172.205	24.9604	34.8858	1.0035	363.432	2
WOCE_P10	1993	279.489	-16.417	172.164	24.9743	34.8867	1.0036	362.967	2
WOCE_P10	1993	279.49	-16.414	172.158	24.9742	34.8859	1.0035	362.96	2
WOCE_P10	1993	279.491	-16.412	172.153	24.9747	34.8864	1.0033	362.998	2
WOCE_P10	1993	279.492	-16.411	172.148	24.9734	34.8868	1.0031	363.022	2
Time taken: 0.815 seconds, Fetched: 10 row(s)
hive> select count(*) from weather.weather_everydate_detail;
Query ID = root_20171214185454_c783708d-ad4b-46cc-9341-885c16a286fe
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1512525269046_0001, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1512525269046_0001/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1512525269046_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2017-12-14 18:55:27,386 Stage-1 map = 0%,  reduce = 0%
2017-12-14 18:56:11,337 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 39.36 sec
2017-12-14 18:56:18,711 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 41.88 sec
MapReduce Total cumulative CPU time: 41 seconds 880 msec
Ended Job = job_1512525269046_0001
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 41.88 sec   HDFS Read: 288541 HDFS Write: 5 SUCCESS
Total MapReduce CPU Time Spent: 41 seconds 880 msec
OK
4018
Time taken: 101.82 seconds, Fetched: 1 row(s)

6、执行一个普通查询：

hive> select * from weather_everydate_detail where sur_sal=34.8105;
OK
WOCE_P10	1993	312.148	34.602	141.951	24.0804	34.8105	1.0081	361.29	2
WOCE_P10	1993	312.155	34.602	141.954	24.0638	34.8105	1.0079	360.386	2
Time taken: 0.138 seconds, Fetched: 2 row(s)

hive> select * from weather_everydate_detail where sur_sal=34.8105;
OK
WOCE_P10	1993	312.148	34.602	141.951	24.0804	34.8105	1.0081	361.29	2
WOCE_P10	1993	312.155	34.602	141.954	24.0638	34.8105	1.0079	360.386	2
Time taken: 1.449 seconds, Fetched: 2 row(s)

-----------------------------------------------------------------------------------------------------Impala的操作-----------------------------------------------------------
1、启动Impala CLI

# impala-shell

2、在Impala中同步元数据

[quickstart.cloudera:21000] > INVALIDATE METADATA;
Query: invalidate METADATA
Query submitted at: 2017-12-14 19:01:12 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=43460ace5d3a9971:9a50f46600000000

Fetched 0 row(s) in 3.25s

3、在Impala中查看Hive中表的结构

[quickstart.cloudera:21000] > use weather;
Query: use weather
[quickstart.cloudera:21000] > desc weather.weather_everydate_detail;
Query: describe weather.weather_everydate_detail
+---------+--------+---------+
| name    | type   | comment |
+---------+--------+---------+
| section | string |         |
| year    | bigint |         |
| date    | double |         |
| latim   | double |         |
| longit  | double |         |
| sur_tmp | double |         |
| sur_sal | double |         |
| atm_per | double |         |
| xco2a   | double |         |
| qf      | bigint |         |
+---------+--------+---------+
Fetched 10 row(s) in 3.70s

4、查询记录数量

[quickstart.cloudera:21000] > select count(*) from weather.weather_everydate_detail;
Query: select count(*) from weather.weather_everydate_detail
Query submitted at: 2017-12-14 19:03:11 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=5542894eeb80e509:1f9ce37f00000000
+----------+
| count(*) |
+----------+
| 4018     |
+----------+
Fetched 1 row(s) in 2.51s

说明：对比Impala与Hive中的count查询，2.15 VS 101.82，Impala的优势还是相当明显的

5、执行一个普通查询

[quickstart.cloudera:21000] > select * from weather_everydate_detail where sur_sal=34.8105;
Query: select * from weather_everydate_detail where sur_sal=34.8105
Query submitted at: 2017-12-14 19:20:27 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=c14660ed0bda471f:d92fcf0e00000000
+----------+------+---------+--------+---------+---------+---------+---------+---------+----+
| section  | year | date    | latim  | longit  | sur_tmp | sur_sal | atm_per | xco2a   | qf |
+----------+------+---------+--------+---------+---------+---------+---------+---------+----+
| WOCE_P10 | 1993 | 312.148 | 34.602 | 141.951 | 24.0804 | 34.8105 | 1.0081  | 361.29  | 2  |
| WOCE_P10 | 1993 | 312.155 | 34.602 | 141.954 | 24.0638 | 34.8105 | 1.0079  | 360.386 | 2  |
+----------+------+---------+--------+---------+---------+---------+---------+---------+----+
Fetched 2 row(s) in 0.25s

[quickstart.cloudera:21000] > select * from weather_everydate_detail where sur_tmp=24.0804;
Query: select * from weather_everydate_detail where sur_tmp=24.0804
Query submitted at: 2017-12-14 23:15:32 (Coordinator: http://quickstart.cloudera:25000)
Query progress can be monitored at: http://quickstart.cloudera:25000/query_plan?query_id=774e2b3b81f4eed7:8952b5b400000000
+----------+------+---------+--------+---------+---------+---------+---------+--------+----+
| section  | year | date    | latim  | longit  | sur_tmp | sur_sal | atm_per | xco2a  | qf |
+----------+------+---------+--------+---------+---------+---------+---------+--------+----+
| WOCE_P10 | 1993 | 312.148 | 34.602 | 141.951 | 24.0804 | 34.8105 | 1.0081  | 361.29 | 2  |
+----------+------+---------+--------+---------+---------+---------+---------+--------+----+
Fetched 1 row(s) in 3.86s

6.结论

对于Hive中需要编译为mapreduce执行的SQL，在Impala中执行是有明显的速度优势的，但是Hive也不是所有的查询都要编译为mapreduce，此类型的查询，impala相比于Hive就没啥优势了。