Impala必要知识总结

最新推荐文章于 2022-08-01 15:34:42 发布

荒野求思

最新推荐文章于 2022-08-01 15:34:42 发布

阅读量1.8k

点赞数 3

分类专栏： impala 文章标签： impala

本文链接：https://blog.csdn.net/qq_31807385/article/details/84936689

版权

impala 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

impala的架构：

Impala自身包含三个模块：Impalad、Statestore和Catalog，除此之外它还依赖Hive Metastore和HDFS

impala的内外部操作：

impala分为外部操作和内部操作：

外部shell：

impala-shell -i hadoop106  指定连接运行impalad守护进程的主机

impala-shell -q 'select * from stu;' -o output.txt   外部shell查询，并将查询结构输出到本地

impala-shell -f sel.sql -o output.txt   添加查询脚本。

impala-shell -r 建立连接后刷新impala元数据

登录hive在该数据库下创建表：并查看一下表有几个：
hive> create table student(id int,name string)
    > row format delimited
    > fields terminated by '\t';
hive> show tables;
stu
student

在impala中查询当前的数据库有多少表：
[hadoop106:21000] > show tables;
Query: show tables
+------+
| name |
+------+
| stu  |
+------+

然后再退出，使用重新登录：
[root@hadoop105 ~]# impala-shell -i hadoop106
[hadoop106:21000] > show tables;
Query: show tables
+------+
| name |
+------+
| stu  |
+------+
发现任然没有该表的信息：
[hadoop106:21000] > quit;

接下来使用刷新数据的方式登录
[root@hadoop105 ~]# impala-shell -i hadoop106 -r

[hadoop106:21000] > show tables;
Query: show tables
+---------+
| name    |
+---------+
| stu     |
| student |
+---------+
发现能够查询到表的信息。

impala-shell -q 'select * from stu' -B --output_delimiter="\t" -o output.txt  
impala 的去格式化输出，这种方式没有表格线，各个字段之间使用制表符分开

---------------------------------------------------------------------------------------
外部shell：

shell hadoop fs -ls /   impala中查询一下HDFS上的文件目录

shell ls      impala中查询linux下的文件目录

刷新指定表的元数据信息：
现在hive中创建表student（省略），然后在导入数据，然后查看表中的数据
hive> load data local inpath '/opt/module/datas/stu.txt' into table student;

hive> select * from student;
1001	zhangfei
1002	guanyu
1003	liubei

接下来进入impala来查询表中是否有数据
[hadoop106:21000] > select * from student;
Query: select * from student
没有数据，接下来我们对表刷新，然后再尝试查询
[hadoop106:21000] > refresh student;

[hadoop106:21000] > select * from student;

+------+----------+
| id   | name     |
+------+----------+
| 1001 | zhangfei |
| 1002 | guanyu   |
| 1003 | liubei   |
+------+----------+

相对于Hive，impala的不同：

1，impala在创建数据库时，不支持with dbproperties
[hadoop106:21000] > create database db_test with dbproperties('name'='iseayou');
Query: create database db_test with dbproperties('name'='iseayou')
ERROR: AnalysisException: Syntax error in line 1:
create database db_test with dbproperties('name'='iseayou')
                        ^
Encountered: WITH
Expected: COMMENT, LOCATION
CAUSED BY: Exception: Syntax error

2，impala不支持alter database的语法

3，在使用当前的数据库时，无法删除当前的数据库，需要离开当前库，然后在删除该库；
[hadoop106:21000] > create database db_test;
Query: create database db_test

[hadoop106:21000] > use db_test;
Query: use db_test
[hadoop106:21000] > create table stu(id int,name string);
Query: create table stu(id int,name string)
[hadoop106:21000] > show tables;
Query: show tables
+------+
| name |
+------+
| stu  |
+------+

[hadoop106:21000] > drop database db_test;
Query: drop database db_test
ERROR: AnalysisException: Cannot drop current default database: db_test

[hadoop106:21000] > drop database db_test cascade;
Query: drop database db_test cascade
ERROR: AnalysisException: Cannot drop current default database: db_test

[hadoop106:21000] > use default;
Query: use default
[hadoop106:21000] > drop database db_test cascade;
Query: drop database db_test cascade
[hadoop106:21000] > show databases;
Query: show databases
+------------------+----------------------------------------------+
| name             | comment                                      |
+------------------+----------------------------------------------+
| _impala_builtins | System database for Impala builtin functions |
| db_impala        |                                              |
| default          | Default Hive database                        |
+------------------+----------------------------------------------+
Fetched 3 row(s) in 0.01s

4，impala支持分区表，但是如果没有分区，load data导入数据的时候，不能自动创建分区，需要先修改
表，然后在导入数据；演示如下：
先把数据上传到HDFS上，然后在表中插入分区，然后在导入数据：
create table stu_par(id int,name string)
partitioned by(month string)
row format delimited
fields terminated by '\t';

shell hadoop fs -put /opt/module/datas/stu.txt
alter table stu_par add partition(month='12');
load data inpath '/stu' into table stu_par partition(month='12');
[hadoop106:21000] > select * from stu_par where month = '12';
+------+----------+-------+
| id   | name     | month |
+------+----------+-------+
| 1001 | zhangfei | 12    |
| 1002 | guanyu   | 12    |
| 1003 | liubei   | 12    |
+------+----------+-------+
另外impala支持添加多个分区，使用空格隔开；但是不支持删除多个分区
查询分区的语法是：
 show partitions stu_par;


5，impala导入数据的时候，不支持本地导入。即不支持load data local 
但是支持insert select的方式  和 load data inpath的方式；

6，implant导出数据一般使用impala -o 不支持 insert overwrite directory的方式
impala-shell -i hadoop106 -q 'select * from st;' -B --output_delimiter=',' -o output.txt;

[root@hadoop105 ~]# cat output.txt
1001,zhangfei
1004,xiaoqiao
1002,guanyu
1005,daqiao
1003,liubei

impala的查询：

1，Impala不支持CLUSTER BY, DISTRIBUTE BY, SORT BY
2，Impala中不支持分桶表
本质的原因是impala不使用MapReduce

3，Impala不支持COLLECT_SET(col)和explode（col）函数
本质的原因是impala不完全支持数组

4，Impala支持开窗函数

t1:
select *,row_number() over(partition by subject order by score desc) row_number
from score;
 
select * from
(
select *,row_number() over(partition by subject order by score desc) row_number
from score)t1
where t1.row_number <= 3;
 
+----------+-------------+-----------+----------------+--+
| t1.name  | t1.subject  | t1.score  | t1.row_number  |
+----------+-------------+-----------+----------------+--+
| 悟空       | 数学          | 95        | 1              |
| 张飞       | 数学          | 86        | 2              |
| 小乔       | 数学          | 85        | 3              |
| 张飞       | 英语          | 84        | 1              |
| 八戒       | 英语          | 84        | 2              |
| 小乔       | 英语          | 78        | 3              |
| 八戒       | 语文          | 94        | 1              |
| 悟空       | 语文          | 87        | 2              |
| 小乔       | 语文          | 65        | 3              |
+----------+-------------+-----------+----------------+--+
！

impala的压缩和存储：

1，parquet类型的数据支持创建，插入，和查询：
[hadoop106:21000] > create table stu_parquet(id int,name string)
row format delimited
fields terminated by '\t'
stored as parquet;

[hadoop106:21000] > show tables;
Query: show tables
+-------------+
| name        |
+-------------+
| stu         |
| stu_par     |
| stu_parquet |
| student     |
+-------------+
Fetched 4 row(s) in 0.01s
[hadoop106:21000] > insert into table stu_parquet select * from stu;

[hadoop106:21000] > select * from stu_parquet;

+------+----------+
| id   | name     |
+------+----------+
| 1005 | daqiao   |
| 1004 | xiaoqiao |
| 1001 | zhangfei |
| 1002 | guanyu   |
| 1003 | liubei   |
+------+----------+

2，sequencefile类型的数据支持创建，但是默认是不支持插入，需要设置一下才能insert；
[hadoop106:21000] > create table stu_sq(id int, name string)
                  > row format delimited 
                  > fields terminated by '\t'
                  > stored as sequenceFile;
Query: create table stu_sq(id int, name string)
row format delimited
fields terminated by '\t'
stored as sequenceFile

Fetched 0 row(s) in 0.14s
[hadoop106:21000] > show tables;
Query: show tables
+-------------+
| name        |
+-------------+
| stu         |
| stu_par     |
| stu_parquet |
| stu_sq      |
| student     |
+-------------+
[hadoop106:21000] > insert into table stu_sq values(1001,'zhangsan');

WARNINGS: Writing to table format SEQUENCE_FILE is not supported. Use query option ALLOW_UNSUPPORTED_FORMATS to override.


[hadoop106:21000] > set ALLOW_UNSUPPORTED_FORMATS=true;
ALLOW_UNSUPPORTED_FORMATS set to true
[hadoop106:21000] > insert into table stu_sq values(1001,'zhangsan');

[hadoop106:21000] > select * from stu_sq;
+------+----------+
| id   | name     |
+------+----------+
| 1001 | zhangsan |
+------+----------+