目录
impala的架构:
Impala自身包含三个模块:Impalad、Statestore和Catalog,除此之外它还依赖Hive Metastore和HDFS
impala的内外部操作:
impala分为外部操作和内部操作:
外部shell:
impala-shell -i hadoop106 指定连接运行impalad守护进程的主机
impala-shell -q 'select * from stu;' -o output.txt 外部shell查询,并将查询结构输出到本地
impala-shell -f sel.sql -o output.txt 添加查询脚本。
impala-shell -r 建立连接后刷新impala元数据
登录hive在该数据库下创建表:并查看一下表有几个:
hive> create table student(id int,name string)
> row format delimited
> fields terminated by '\t';
hive> show tables;
stu
student
在impala中查询当前的数据库有多少表:
[hadoop106:21000] > show tables;
Query: show tables
+------+
| name |
+------+
| stu |
+------+
然后再退出,使用重新登录:
[root@hadoop105 ~]# impala-shell -i hadoop106
[hadoop106:21000] > show tables;
Query: show tables
+------+
| name |
+------+
| stu |
+------+
发现任然没有该表的信息:
[hadoop106:21000] > quit;
接下来使用刷新数据的方式登录
[root@hadoop105 ~]# impala-shell -i hadoop106 -r
[hadoop106:21000] > show tables;
Query: show tables
+---------+
| name |
+---------+
| stu |
| student |
+---------+
发现能够查询到表的信息。
impala-shell -q 'select * from stu' -B --output_delimiter="\t" -o output.txt
impala 的去格式化输出,这种方式没有表格线,各个字段之间使用制表符分开
---------------------------------------------------------------------------------------
外部shell:
shell hadoop fs -ls / impala中查询一下HDFS上的文件目录
shell ls impala中查询linux下的文件目录
刷新指定表的元数据信息:
现在hive中创建表student(省略),然后在导入数据,然后查看表中的数据
hive> load data local inpath '/opt/module/datas/stu.txt' into table student;
hive> select * from student;
1001 zhangfei
1002 guanyu
1003 liubei
接下来进入impala来查询表中是否有数据
[hadoop106:21000] > select * from student;
Query: select * from student
没有数据,接下来我们对表刷新,然后再尝试查询
[hadoop106:21000] > refresh student;
[hadoop106:21000] > select * from student;
+------+----------+
| id | name |
+------+----------+
| 1001 | zhangfei |
| 1002 | guanyu |
| 1003 | liubei |
+------+----------+
相对于Hive,impala的不同:
1,impala在创建数据库时,不支持with dbproperties
[hadoop106:21000] > create database db_test with dbproperties('name'='iseayou');
Query: create database db_test with dbproperties('name'='iseayou')
ERROR: AnalysisException: Syntax error in line 1:
create database db_test with dbproperties('name'='iseayou')
^
Encountered: WITH
Expected: COMMENT, LOCATION
CAUSED BY: Exception: Syntax error
2,impala不支持alter database的语法
3,在使用当前的数据库时,无法删除当前的数据库,需要离开当前库,然后在删除该库;
[hadoop106:21000] > create database db_test;
Query: create database db_test
[hadoop106:21000] > use db_test;
Query: use db_test
[hadoop106:21000] > create table stu(id int,name string);
Query: create table stu(id int,name string)
[hadoop106:21000] > show tables;
Query: show tables
+------+
| name |
+------+
| stu |
+------+
[hadoop106:21000] > drop database db_test;
Query: drop database db_test
ERROR: AnalysisException: Cannot drop current default database: db_test
[hadoop106:21000] > drop database db_test cascade;
Query: drop database db_test cascade
ERROR: AnalysisException: Cannot drop current default database: db_test
[hadoop106:21000] > use default;
Query: use default
[hadoop106:21000] > drop database db_test cascade;
Query: drop database db_test cascade
[hadoop106:21000] > show databases;
Query: show databases
+------------------+----------------------------------------------+
| name | comment |
+------------------+----------------------------------------------+
| _impala_builtins | System database for Impala builtin functions |
| db_impala | |
| default | Default Hive database |
+------------------+----------------------------------------------+
Fetched 3 row(s) in 0.01s
4,impala支持分区表,但是如果没有分区,load data导入数据的时候,不能自动创建分区,需要先修改
表,然后在导入数据;演示如下:
先把数据上传到HDFS上,然后在表中插入分区,然后在导入数据:
create table stu_par(id int,name string)
partitioned by(month string)
row format delimited
fields terminated by '\t';
shell hadoop fs -put /opt/module/datas/stu.txt
alter table stu_par add partition(month='12');
load data inpath '/stu' into table stu_par partition(month='12');
[hadoop106:21000] > select * from stu_par where month = '12';
+------+----------+-------+
| id | name | month |
+------+----------+-------+
| 1001 | zhangfei | 12 |
| 1002 | guanyu | 12 |
| 1003 | liubei | 12 |
+------+----------+-------+
另外impala支持添加多个分区,使用空格隔开;但是不支持删除多个分区
查询分区的语法是:
show partitions stu_par;
5,impala导入数据的时候,不支持本地导入。即不支持load data local
但是支持insert select的方式 和 load data inpath的方式;
6,implant导出数据一般使用impala -o 不支持 insert overwrite directory的方式
impala-shell -i hadoop106 -q 'select * from st;' -B --output_delimiter=',' -o output.txt;
[root@hadoop105 ~]# cat output.txt
1001,zhangfei
1004,xiaoqiao
1002,guanyu
1005,daqiao
1003,liubei
impala的查询:
1,Impala不支持CLUSTER BY, DISTRIBUTE BY, SORT BY
2,Impala中不支持分桶表
本质的原因是impala不使用MapReduce
3,Impala不支持COLLECT_SET(col)和explode(col)函数
本质的原因是impala不完全支持数组
4,Impala支持开窗函数
t1:
select *,row_number() over(partition by subject order by score desc) row_number
from score;
select * from
(
select *,row_number() over(partition by subject order by score desc) row_number
from score)t1
where t1.row_number <= 3;
+----------+-------------+-----------+----------------+--+
| t1.name | t1.subject | t1.score | t1.row_number |
+----------+-------------+-----------+----------------+--+
| 悟空 | 数学 | 95 | 1 |
| 张飞 | 数学 | 86 | 2 |
| 小乔 | 数学 | 85 | 3 |
| 张飞 | 英语 | 84 | 1 |
| 八戒 | 英语 | 84 | 2 |
| 小乔 | 英语 | 78 | 3 |
| 八戒 | 语文 | 94 | 1 |
| 悟空 | 语文 | 87 | 2 |
| 小乔 | 语文 | 65 | 3 |
+----------+-------------+-----------+----------------+--+
!
impala的压缩和存储:
1,parquet类型的数据支持创建,插入,和查询:
[hadoop106:21000] > create table stu_parquet(id int,name string)
row format delimited
fields terminated by '\t'
stored as parquet;
[hadoop106:21000] > show tables;
Query: show tables
+-------------+
| name |
+-------------+
| stu |
| stu_par |
| stu_parquet |
| student |
+-------------+
Fetched 4 row(s) in 0.01s
[hadoop106:21000] > insert into table stu_parquet select * from stu;
[hadoop106:21000] > select * from stu_parquet;
+------+----------+
| id | name |
+------+----------+
| 1005 | daqiao |
| 1004 | xiaoqiao |
| 1001 | zhangfei |
| 1002 | guanyu |
| 1003 | liubei |
+------+----------+
2,sequencefile类型的数据支持创建,但是默认是不支持插入,需要设置一下才能insert;
[hadoop106:21000] > create table stu_sq(id int, name string)
> row format delimited
> fields terminated by '\t'
> stored as sequenceFile;
Query: create table stu_sq(id int, name string)
row format delimited
fields terminated by '\t'
stored as sequenceFile
Fetched 0 row(s) in 0.14s
[hadoop106:21000] > show tables;
Query: show tables
+-------------+
| name |
+-------------+
| stu |
| stu_par |
| stu_parquet |
| stu_sq |
| student |
+-------------+
[hadoop106:21000] > insert into table stu_sq values(1001,'zhangsan');
WARNINGS: Writing to table format SEQUENCE_FILE is not supported. Use query option ALLOW_UNSUPPORTED_FORMATS to override.
[hadoop106:21000] > set ALLOW_UNSUPPORTED_FORMATS=true;
ALLOW_UNSUPPORTED_FORMATS set to true
[hadoop106:21000] > insert into table stu_sq values(1001,'zhangsan');
[hadoop106:21000] > select * from stu_sq;
+------+----------+
| id | name |
+------+----------+
| 1001 | zhangsan |
+------+----------+