2、cloudera impala sql语法与示例、impala的数据导入的4种方式、java api操作impala和综合示例比较hive与impala的查询速度

cloudera impala 系列文章

1、cloudera impala的介绍、与hive的异同、两种部署方式以及内外部命令
2、cloudera impala sql语法与示例、impala的数据导入的4种方式、java api操作impala和综合示例比较hive与impala的查询速度



本文简单的介绍了impala的sql语法与使用示例,也介绍了impala导入数据的几种方式、java api操作impala,最后用一个综合示例简单的对比了impala和hive的查询速度。
本文依赖impala、sqoop、hive、hadoop环境好用。
本文分为4个部分,即impala的简单sql语法与示例、impala数据插入的四种方式、通过java api访问impapa和一个综合示例。

一、Impala sql语法

  • 如果在hive窗口中插入数据或者新建的数据库或者数据库表,那么在impala当中是不可直接查询,需要执行invalidate metadata以通知元数据的更新
  • 在impala-shell当中插入的数据,在impala当中是可以直接查询到的,不需要刷新数据库,其中使用的就是catalog这个服务的功能实现的,catalog是impala1.2版本之后增加的模块功能,主要作用就是同步impala之间的元数据
  • 更新操作通知Catalog,Catalog通过广播的方式通知其它的Impalad进程。默认情况下Catalog是异步加载元数据的,因此查询可能需要等待元数据加载完成之后才能进行(第一次加载)

本文的示例是在cdh环境安装的impala,所以机器名称与上文中的机器名不同,但不影响使用与验证。

1、数据库操作示例

1)、创建数据库

CREATE DATABASE 语句用于在Impala中创建新数据库。

CREATE DATABASE IF NOT EXISTS database_name;

这里,IF NOT EXISTS是一个可选的子句。如果我们使用此子句,则只有在没有具有相同名称的现有数据库时,才会创建具有给定名称的数据库。

impala默认使用impala用户执行操作,会报权限不足问题,解决办法:

  • 1、给HDFS指定文件夹授予权限
hadoop fs -chmod -R 777 hdfs://server-1:9000/user/hive
  • 2、haoop 配置文件中hdfs-site.xml 中设置权限为false
    在这里插入图片描述
    上述两种方式都可以。

默认就会在hive的数仓路径下创建新的数据库名文件夹

/user/hive/warehouse/ittest.db

也可以在创建数据库的时候指定hdfs路径。需要注意该路径的权限。

hadoop fs -mkdir -p /input/impala
hadoop fs -chmod -R 777 /input/impala 

create  external table  t3(id int ,name string ,age int )  
row  format  delimited fields terminated  by  '\t' 
location  '/input/impala/external';

[server7:21000] test> show databases;
Query: show databases
+-----------------------------------------------------------------------------------------------+-------------------------------------------------+
| name                                                                                          | comment                                         |
+-----------------------------------------------------------------------------------------------+-------------------------------------------------+
| _impala_builtins                                                                              | System database for Impala builtin functions    |
| cloudera_manager_metastore_canary_test_db_hive_hivemetastore_9d6beb1369433f8a2d12dd14fd802ce8 | Cloudera Manager Metastore Canary Test Database |
| default                                                                                       | Default Hive database                           |
| test                                                                                          |                                                 |
+-----------------------------------------------------------------------------------------------+-------------------------------------------------+
Fetched 4 row(s) in 0.04s

2)、删除数据库

Impala的DROP DATABASE语句用于从Impala中删除数据库。 在删除数据库之前,建议从中删除所有表。
如果使用级联删除,Impala会在删除指定数据库中的表之前删除它。

DROP database sample cascade;

2、表特定语句

1)、create table语句

CREATE TABLE语句用于在Impala中的所需数据库中创建新表。
需要指定表名字并定义其列和每列的数据类型。
impala支持的数据类型和hive类似,除了sql类型外,还支持java类型。

create table IF NOT EXISTS database_name.table_name (	   
    column1 data_type,	   
    column2 data_type,	   
    column3 data_type,	   
    ………	   
    columnN data_type	
);

CREATE TABLE IF NOT EXISTS my_db.student(name STRING, age INT, contact INT );

默认建表的数据存储路径跟hive一致。也可以在建表的时候通过location指定具体路径,需要注意hdfs权限问题。

在这里插入图片描述

2)、insert语句

Impala的INSERT语句有两个子句:into和overwrite。
into用于插入新记录数据,overwrite用于覆盖已有的记录。

insert into table_name (column1, column2, column3,...columnN)	
values (value1, value2, value3,...valueN);	

Insert into table_name values (value1, value2, value2);

这里,column1,column2,… columnN是要插入数据的表中的列的名称。还可以添加值而不指定列名,但是,需要确保值的顺序与表中的列的顺序相同。
示例:

create table employee (Id INT, name STRING, age INT,address STRING, salary BIGINT);
insert into employee VALUES (1, 'Ramesh', 32, 'Ahmedabad', 20000 );
insert into employee values (2, 'Khilan', 25, 'Delhi', 15000 );
Insert into employee values (3, 'kaushik', 23, 'Kota', 30000 );
Insert into employee values (4, 'Chaitali', 25, 'Mumbai', 35000 );
Insert into employee values (5, 'Hardik', 27, 'Bhopal', 40000 );
Insert into employee values (6, 'Komal', 22, 'MP', 32000 );


[server7:21000] test> select * from employee;
Query: select * from employee
Query submitted at: 2022-11-08 13:23:18 (Coordinator: http://server7:25000)
Query progress can be monitored at: http://server7:25000/query_plan?query_id=f84adae242c9f6af:a43bb90900000000
+----+----------+-----+-----------+--------+
| id | name     | age | address   | salary |
+----+----------+-----+-----------+--------+
| 4  | Chaitali | 25  | Mumbai    | 35000  |
| 6  | Komal    | 22  | MP        | 32000  |
| 1  | Ramesh   | 32  | Ahmedabad | 20000  |
| 5  | Hardik   | 27  | Bhopal    | 40000  |
| 3  | kaushik  | 23  | Kota      | 30000  |
| 2  | Khilan   | 25  | Delhi     | 15000  |
+----+----------+-----+-----------+--------+
Fetched 6 row(s) in 0.85s

overwrite覆盖子句覆盖表当中全部记录。 覆盖的记录将从表中永久删除。

Insert overwrite employee values (1, 'Ram', 26, 'Vishakhapatnam', 37000 );

[server7:21000] test> Insert overwrite employee values (1, 'Ram', 26, 'Vishakhapatnam', 37000 );
Query: Insert overwrite employee values (1, 'Ram', 26, 'Vishakhapatnam', 37000 )
Query submitted at: 2022-11-08 13:24:04 (Coordinator: http://server7:25000)
Query progress can be monitored at: http://server7:25000/query_plan?query_id=f44bb549f202b8ed:7c0a193800000000
Modified 1 row(s) in 0.24s
[server7:21000] test> select * from employee;
Query: select * from employee
Query submitted at: 2022-11-08 13:25:08 (Coordinator: http://server7:25000)
Query progress can be monitored at: http://server7:25000/query_plan?query_id=7c423d231e0a32a9:f9650b7600000000
+----+------+-----+----------------+--------+
| id | name | age | address        | salary |
+----+------+-----+----------------+--------+
| 1  | Ram  | 26  | Vishakhapatnam | 37000  |
+----+------+-----+----------------+--------+

3)、select语句

Impala SELECT语句用于从数据库中的一个或多个表中提取数据。 此查询以表的形式返回数据。

create table employee (Id INT, name STRING, age INT,address STRING, salary BIGINT);
insert into employee VALUES (1, 'Ramesh', 32, 'Ahmedabad', 20000 );
insert into employee values (2, 'Khilan', 25, 'Delhi', 15000 );
Insert into employee values (3, 'kaushik', 23, 'Kota', 30000 );
Insert into employee values (4, 'Chaitali', 25, 'Mumbai', 35000 );
Insert into employee values (5, 'Hardik', 27, 'Bhopal', 40000 );
Insert into employee values (6, 'Komal', 22, 'MP', 32000 );

create table emp_add(Id INT,address STRING);
insert into emp_add VALUES (1, 'shanghai');
insert into emp_add values (2, 'beijing');
Insert into emp_add values (3, 'new york');

--------------impala 查询   0.96s
[server7:21000] test> select e.*,a.address from employee e join emp_add a on e.Id = a.Id;
Query: select e.*,a.address from employee e join emp_add a on e.Id = a.Id
Query submitted at: 2022-11-08 13:35:31 (Coordinator: http://server7:25000)
Query progress can be monitored at: http://server7:25000/query_plan?query_id=9448e84ede2fd9d6:e4e1fc5b00000000
+----+---------+-----+-----------+--------+----------+
| id | name    | age | address   | salary | address  |
+----+---------+-----+-----------+--------+----------+
| 3  | kaushik | 23  | Kota      | 30000  | new york |
| 2  | Khilan  | 25  | Delhi     | 15000  | beijing  |
| 1  | Ramesh  | 32  | Ahmedabad | 20000  | shanghai |
+----+---------+-----+-----------+--------+----------+
Fetched 3 row(s) in 0.96s


---------------------hive 查询     32.5秒
0: jdbc:hive2://server8:10000> select e.*,a.address from employee e join emp_add a on e.Id = a.Id; 
INFO  : Compiling command(queryId=hive_20221108133409_d068b90e-76c7-48ea-8776-7739d5a60ae7): select e.*,a.address from employee e join emp_add a on e.Id = a.Id
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:e.id, type:int, comment:null), FieldSchema(name:e.name, type:string, comment:null), FieldSchema(name:e.age, type:int, comment:null), FieldSchema(name:e.address, type:string, comment:null), FieldSchema(name:e.salary, type:bigint, comment:null), FieldSchema(name:a.address, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20221108133409_d068b90e-76c7-48ea-8776-7739d5a60ae7); Time taken: 0.228 seconds
INFO  : Executing command(queryId=hive_20221108133409_d068b90e-76c7-48ea-8776-7739d5a60ae7): select e.*,a.address from employee e join emp_add a on e.Id = a.Id
WARN  : 
INFO  : Query ID = hive_20221108133409_d068b90e-76c7-48ea-8776-7739d5a60ae7
INFO  : Total jobs = 1
INFO  : Starting task [Stage-4:MAPREDLOCAL] in serial mode
INFO  : Execution completed successfully
INFO  : MapredLocal task succeeded
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-3:MAPRED] in serial mode
INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
INFO  : number of splits:1
INFO  : Submitting tokens for job: job_1667806921128_0006
INFO  : Executing with tokens: []
INFO  : The url to track the job: http://server8:8088/proxy/application_1667806921128_0006/
INFO  : Starting Job = job_1667806921128_0006, Tracking URL = http://server8:8088/proxy/application_1667806921128_0006/
INFO  : Kill Command = /opt/cloudera/parcels/CDH-6.2.1-1.cdh6.2.1.p0.1425774/lib/hadoop/bin/hadoop job  -kill job_1667806921128_0006
INFO  : Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
INFO  : 2022-11-08 13:34:31,437 Stage-3 map = 0%,  reduce = 0%
INFO  : 2022-11-08 13:34:38,800 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 3.56 sec
INFO  : MapReduce Total cumulative CPU time: 3 seconds 560 msec
INFO  : Ended Job = job_1667806921128_0006
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-3: Map: 1   Cumulative CPU: 3.56 sec   HDFS Read: 7354 HDFS Write: 225 HDFS EC Read: 0 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 3 seconds 560 msec
INFO  : Completed executing command(queryId=hive_20221108133409_d068b90e-76c7-48ea-8776-7739d5a60ae7); Time taken: 31.918 seconds
INFO  : OK
+-------+----------+--------+------------+-----------+------------+
| e.id  |  e.name  | e.age  | e.address  | e.salary  | a.address  |
+-------+----------+--------+------------+-----------+------------+
| 1     | Ramesh   | 32     | Ahmedabad  | 20000     | shanghai   |
| 3     | kaushik  | 23     | Kota       | 30000     | new york   |
| 2     | Khilan   | 25     | Delhi      | 15000     | beijing    |
+-------+----------+--------+------------+-----------+------------+
3 rows selected (32.497 seconds)

--------------------hue hive查询           29.4秒
INFO  : Compiling command(queryId=hive_20221108133429_a446dfa0-5801-447a-bdfc-cc775c76d014): select e.*,a.address from employee e join emp_add a on e.Id = a.Id
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:e.id, type:int, comment:null), FieldSchema(name:e.name, type:string, comment:null), FieldSchema(name:e.age, type:int, comment:null), FieldSchema(name:e.address, type:string, comment:null), FieldSchema(name:e.salary, type:bigint, comment:null), FieldSchema(name:a.address, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20221108133429_a446dfa0-5801-447a-bdfc-cc775c76d014); Time taken: 0.097 seconds
INFO  : Executing command(queryId=hive_20221108133429_a446dfa0-5801-447a-bdfc-cc775c76d014): select e.*,a.address from employee e join emp_add a on e.Id = a.Id
WARN  : 
INFO  : Query ID = hive_20221108133429_a446dfa0-5801-447a-bdfc-cc775c76d014
INFO  : Total jobs = 1
INFO  : Starting task [Stage-4:MAPREDLOCAL] in serial mode
INFO  : Execution completed successfully
INFO  : MapredLocal task succeeded
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-3:MAPRED] in serial mode
INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
INFO  : number of splits:1
INFO  : Submitting tokens for job: job_1667806921128_0007
INFO  : Executing with tokens: []
INFO  : The url to track the job: http://server8:8088/proxy/application_1667806921128_0007/
INFO  : Starting Job = job_1667806921128_0007, Tracking URL = http://server8:8088/proxy/application_1667806921128_0007/
INFO  : Kill Command = /opt/cloudera/parcels/CDH-6.2.1-1.cdh6.2.1.p0.1425774/lib/hadoop/bin/hadoop job  -kill job_1667806921128_0007
INFO  : Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
INFO  : 2022-11-08 13:34:49,739 Stage-3 map = 0%,  reduce = 0%
INFO  : 2022-11-08 13:34:57,962 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 3.32 sec
INFO  : MapReduce Total cumulative CPU time: 3 seconds 320 msec
INFO  : Ended Job = job_1667806921128_0007
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-3: Map: 1   Cumulative CPU: 3.32 sec   HDFS Read: 7403 HDFS Write: 225 HDFS EC Read: 0 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 3 seconds 320 msec
INFO  : Completed executing command(queryId=hive_20221108133429_a446dfa0-5801-447a-bdfc-cc775c76d014); Time taken: 29.409 seconds
INFO  : OK

4)、describe语句

Impala中的describe语句用于提供表的描述。 此语句的结果包含有关表的信息,例如列名称及其数据类型。

Describe table_name;
Describe employee;
[server7:21000] test> Describe employee;
Query: describe employee
+---------+--------+---------+
| name    | type   | comment |
+---------+--------+---------+
| id      | int    |         |
| name    | string |         |
| age     | int    |         |
| address | string |         |
| salary  | bigint |         |
+---------+--------+---------+
Fetched 5 row(s) in 0.06s

# 此外,还可以使用hive的查询表元数据信息语句。
desc formatted table_name;
Describe formatted employee;

[server7:21000] test> Describe formatted employee;
Query: describe formatted employee
+------------------------------+------------------------------------------------------------+----------------------+
| name                         | type                                                       | comment              |
+------------------------------+------------------------------------------------------------+----------------------+
| # col_name                   | data_type                                                  | comment              |
|                              | NULL                                                       | NULL                 |
| id                           | int                                                        | NULL                 |
| name                         | string                                                     | NULL                 |
| age                          | int                                                        | NULL                 |
| address                      | string                                                     | NULL                 |
| salary                       | bigint                                                     | NULL                 |
|                              | NULL                                                       | NULL                 |
| # Detailed Table Information | NULL                                                       | NULL                 |
| Database:                    | test                                                       | NULL                 |
| OwnerType:                   | USER                                                       | NULL                 |
| Owner:                       | root                                                       | NULL                 |
| CreateTime:                  | Tue Nov 08 13:29:56 CST 2022                               | NULL                 |
| LastAccessTime:              | UNKNOWN                                                    | NULL                 |
| Retention:                   | 0                                                          | NULL                 |
| Location:                    | hdfs://hadoopha/user/hive/warehouse/test.db/employee       | NULL                 |
| Table Type:                  | MANAGED_TABLE                                              | NULL                 |
| Table Parameters:            | NULL                                                       | NULL                 |
|                              | transient_lastDdlTime                                      | 1667885396           |
|                              | NULL                                                       | NULL                 |
| # Storage Information        | NULL                                                       | NULL                 |
| SerDe Library:               | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe         | NULL                 |
| InputFormat:                 | org.apache.hadoop.mapred.TextInputFormat                   | NULL                 |
| OutputFormat:                | org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL                 |
| Compressed:                  | No                                                         | NULL                 |
| Num Buckets:                 | 0                                                          | NULL                 |
| Bucket Columns:              | []                                                         | NULL                 |
| Sort Columns:                | []                                                         | NULL                 |
+------------------------------+------------------------------------------------------------+----------------------+
Fetched 28 row(s) in 0.06s

5)、alter table

Impala中的Alter table语句用于对给定表执行更改。使用此语句,可以添加,删除或修改现有表中的列,也可以重命名它们。

-- 表重命名:
ALTER TABLE [old_db_name.]old_table_name RENAME TO  [new_db_name.]new_table_name
-- 向表中添加列:
ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...])
-- 从表中删除列:
ALTER TABLE name DROP [COLUMN] column_name
-- 更改列的名称和类型:
ALTER TABLE name CHANGE column_name new_name new_type
ALTER TABLE employee RENAME TO  user;
[server7:21000] test> ALTER TABLE employee RENAME TO  user;
Query: ALTER TABLE employee RENAME TO  user
+--------------------------+
| summary                  |
+--------------------------+
| Renaming was successful. |
+--------------------------+
Fetched 1 row(s) in 0.15s

ALTER TABLE user ADD COLUMNS (address2 string )
[server7:21000] test> ALTER TABLE user ADD COLUMNS (address2 string );
Query: ALTER TABLE user ADD COLUMNS (address2 string )
+---------------------------------------------+
| summary                                     |
+---------------------------------------------+
| New column(s) have been added to the table. |
+---------------------------------------------+
Fetched 1 row(s) in 0.12s
[server7:21000] test> select * from user ;
Query: select * from user
Query submitted at: 2022-11-08 13:53:46 (Coordinator: http://server7:25000)
Query progress can be monitored at: http://server7:25000/query_plan?query_id=b14301e64d05ad0e:50ab69200000000
+----+----------+-----+-----------+--------+----------+
| id | name     | age | address   | salary | address2 |
+----+----------+-----+-----------+--------+----------+
| 5  | Hardik   | 27  | Bhopal    | 40000  | NULL     |
| 6  | Komal    | 22  | MP        | 32000  | NULL     |
| 1  | Ramesh   | 32  | Ahmedabad | 20000  | NULL     |
| 2  | Khilan   | 25  | Delhi     | 15000  | NULL     |
| 3  | kaushik  | 23  | Kota      | 30000  | NULL     |
| 4  | Chaitali | 25  | Mumbai    | 35000  | NULL     |
+----+----------+-----+-----------+--------+----------+
Fetched 6 row(s) in 0.41s

ALTER TABLE user DROP   address2;
[server7:21000] test> ALTER TABLE user DROP   address2;
Query: ALTER TABLE user DROP   address2
+--------------------------+
| summary                  |
+--------------------------+
| Column has been dropped. |
+--------------------------+
Fetched 1 row(s) in 0.12s
[server7:21000] test> select * from user ;
Query: select * from user
Query submitted at: 2022-11-08 13:58:24 (Coordinator: http://server7:25000)
Query progress can be monitored at: http://server7:25000/query_plan?query_id=304a7cca4fc50e6f:e48d13e500000000
+----+----------+-----+-----------+--------+
| id | name     | age | address   | salary |
+----+----------+-----+-----------+--------+
| 5  | Hardik   | 27  | Bhopal    | 40000  |
| 6  | Komal    | 22  | MP        | 32000  |
| 2  | Khilan   | 25  | Delhi     | 15000  |
| 1  | Ramesh   | 32  | Ahmedabad | 20000  |
| 4  | Chaitali | 25  | Mumbai    | 35000  |
| 3  | kaushik  | 23  | Kota      | 30000  |
+----+----------+-----+-----------+--------+
Fetched 6 row(s) in 0.21s

ALTER TABLE user CHANGE salary salary double;
[server7:21000] test> ALTER TABLE user CHANGE salary salary double;
Query: ALTER TABLE user CHANGE salary salary double
+--------------------------+
| summary                  |
+--------------------------+
| Column has been altered. |
+--------------------------+
Fetched 1 row(s) in 0.11s
[server7:21000] test> select * from user order by id ;
Query: select * from user order by id
Query submitted at: 2022-11-08 14:00:04 (Coordinator: http://server7:25000)
Query progress can be monitored at: http://server7:25000/query_plan?query_id=d9495c09b7e64452:66710b7d00000000
+----+----------+-----+-----------+--------+
| id | name     | age | address   | salary |
+----+----------+-----+-----------+--------+
| 1  | Ramesh   | 32  | Ahmedabad | 20000  |
| 2  | Khilan   | 25  | Delhi     | 15000  |
| 3  | kaushik  | 23  | Kota      | 30000  |
| 4  | Chaitali | 25  | Mumbai    | 35000  |
| 5  | Hardik   | 27  | Bhopal    | 40000  |
| 6  | Komal    | 22  | MP        | 32000  |
+----+----------+-----+-----------+--------+
Fetched 6 row(s) in 0.53s

6)、delete、truncate table

Impala drop table语句用于删除Impala中的现有表。此语句还会删除内部表的底层HDFS文件。
注意:使用此命令时必须小心,因为删除表后,表中可用的所有信息也将永远丢失。

DROP table database_name.table_name;

Impala的Truncate Table语句用于从现有表中删除所有记录。保留表结构。
也可以使用DROP TABLE命令删除一个完整的表,但它会从数据库中删除完整的表结构,如果希望存储一些数据,将需要重新创建此表。

truncate table_name;

7)、view视图

视图仅仅是存储在数据库中具有关联名称的Impala查询语言的语句。 它是以预定义的SQL查询形式的表的组合。
视图可以包含表的所有行或选定的行。

Create View IF NOT EXISTS view_name as Select statement

Create View IF NOT EXISTS user_salary as Select id,name salary from user;
[server7:21000] test> Create View IF NOT EXISTS user_salary as Select id,name salary from user;
Query: Create View IF NOT EXISTS user_salary as Select id,name salary from user
+------------------------+
| summary                |
+------------------------+
| View has been created. |
+------------------------+
Fetched 1 row(s) in 0.06s
[server7:21000] test> select * from user_salary;
Query: select * from user_salary
Query submitted at: 2022-11-08 14:03:08 (Coordinator: http://server7:25000)
Query progress can be monitored at: http://server7:25000/query_plan?query_id=ad41be5d26690069:5aa01eb200000000
+----+----------+
| id | salary   |
+----+----------+
| 6  | Komal    |
| 5  | Hardik   |
| 4  | Chaitali |
| 3  | kaushik  |
| 2  | Khilan   |
| 1  | Ramesh   |
+----+----------+
Fetched 6 row(s) in 5.39s
[server7:21000] test> select * from user_salary;
Query: select * from user_salary
Query submitted at: 2022-11-08 14:03:16 (Coordinator: http://server7:25000)
Query progress can be monitored at: http://server7:25000/query_plan?query_id=694fabc4e8331823:5dc2d71100000000
+----+----------+
| id | salary   |
+----+----------+
| 5  | Hardik   |
| 6  | Komal    |
| 3  | kaushik  |
| 4  | Chaitali |
| 2  | Khilan   |
| 1  | Ramesh   |
+----+----------+
Fetched 6 row(s) in 0.23s

-- 修改视图
ALTER VIEW database_name.view_name as select 子句

ALTER VIEW user_salary as select id ,name ,salary from user order by id ;
[server7:21000] test> ALTER VIEW user_salary as select id ,name ,salary from user order by id ;
Query: ALTER VIEW user_salary as select id ,name ,salary from user order by id
+------------------------+
| summary                |
+------------------------+
| View has been altered. |
+------------------------+
Fetched 1 row(s) in 0.08s
[server7:21000] test> select * from user_salary;
Query: select * from user_salary
Query submitted at: 2022-11-08 14:05:58 (Coordinator: http://server7:25000)
Query progress can be monitored at: http://server7:25000/query_plan?query_id=7245cdd5ba036b69:fe228af00000000
+----+----------+--------+
| id | name     | salary |
+----+----------+--------+
| 6  | Komal    | 32000  |
| 5  | Hardik   | 40000  |
| 3  | kaushik  | 30000  |
| 4  | Chaitali | 35000  |
| 2  | Khilan   | 15000  |
| 1  | Ramesh   | 20000  |
+----+----------+--------+
WARNINGS: Ignoring ORDER BY clause without LIMIT or OFFSET: ORDER BY id ASC.
An ORDER BY appearing in a view, subquery, union operand, or an insert/ctas statement has no effect on the query result unless a LIMIT and/or OFFSET is used in conjunction with the ORDER BY.

Fetched 6 row(s) in 0.34s
[server7:21000] test> select * from user_salary limit 10 ;
Query: select * from user_salary limit 10
Query submitted at: 2022-11-08 14:07:02 (Coordinator: http://server7:25000)
Query progress can be monitored at: http://server7:25000/query_plan?query_id=a445fd6d6329af5b:361cbd7100000000
+----+----------+--------+
| id | name     | salary |
+----+----------+--------+
| 1  | Ramesh   | 20000  |
| 5  | Hardik   | 40000  |
| 4  | Chaitali | 35000  |
| 6  | Komal    | 32000  |
| 3  | kaushik  | 30000  |
| 2  | Khilan   | 15000  |
+----+----------+--------+
WARNINGS: Ignoring ORDER BY clause without LIMIT or OFFSET: ORDER BY id ASC.
An ORDER BY appearing in a view, subquery, union operand, or an insert/ctas statement has no effect on the query result unless a LIMIT and/or OFFSET is used in conjunction with the ORDER BY.

Fetched 6 row(s) in 0.14s

-- 删除视图
DROP VIEW database_name.view_name;
drop view user_salary;
[server7:21000] test> drop view user_salary;
Query: drop view user_salary
+------------------------+
| summary                |
+------------------------+
| View has been dropped. |
+------------------------+
Fetched 1 row(s) in 0.06s

8)、order by子句

Impala ORDER BY子句用于根据一个或多个列以升序或降序对数据进行排序。
默认情况下,一些数据库按升序对查询结果进行排序。

select * from table_name ORDER BY col_name  [ASC|DESC] [NULLS FIRST|NULLS LAST]

可以使用关键字ASC或DESC分别按升序或降序排列表中的数据。
如果使用NULLS FIRST,表中的所有空值都排列在顶行;如果使用NULLS LAST,包含空值的行将最后排列。

在这里插入图片描述

9)、group by子句

Impala GROUP BY子句与SELECT语句协作使用,以将相同的数据排列到组中。

select data from table_name Group BY col_name;

10)、having子句

Impala中的Having子句允许您指定过滤哪些组结果显示在最终结果中的条件。
一般来说,Having子句与group by子句一起使用,它将条件放置在由GROUP BY子句创建的组上。

11)、limit、offset

Impala中的limit子句用于将结果集的行数限制为所需的数,即查询的结果集不包含超过指定限制的记录。
一般来说,select查询的resultset中的行从0开始。使用offset子句,可以决定从哪里考虑输出。

在这里插入图片描述

12)、with子句

如果查询太复杂,我们可以为复杂部分定义别名,并使用Impala的with子句将它们包含在查询中。

with x as (select 1), y as (select 2) (select * from x union y);
--使用with子句显示年龄大于25的员工和客户的记录。
with t1 as (select * from customers where age>25), 
   t2 as (select * from employee where age>25) 
   (select * from t1 union select * from t2);

13)、distinct

Impala中的distinct运算符用于通过删除重复值来获取唯一值。

select distinct columnsfrom table_name;

二、Impala数据导入方式

1、load data

  • 创建一个表:
create table user(id int ,name string,age int ) 
row format delimited fields terminated by "\t";

准备数据user.txt并上传到hdfs的 /user/impala路径,内容如下图示例
在这里插入图片描述

  • 加载数据
load data inpath '/user/impala/' into table user;
  • 查询加载的数据
select  *  from  user;
-- 如果查询不不到数据,那么需要刷新一遍数据表。
refresh  user;

2、insert into values

这种方式非常类似于RDBMS的数据插入方式。

create table t_test2(id int,name string);
insert into table t_test2 values(1,”zhangsan”);

3、insert into select

插入一张表的数据来自于后面的select查询语句返回的结果。示例如下图
在这里插入图片描述

4、create as select

建表的字段个数、类型、数据来自于后续的select查询语句。示例如下图
在这里插入图片描述

三、Impala的java开发

使用impala来做数据库查询的情况,可以通过java代码来进行操作impala的查询。

1、下载impala jdbc依赖

下载路径:https://www.cloudera.com/downloads/connectors/impala/jdbc/2-5-28.html
没有maven库可用,需要将下载的lib文件手动加入到工程的lib中

2、java api示例

public static void test(){
        Connection con = null;
        ResultSet rs = null;
        PreparedStatement ps = null;
        String JDBC_DRIVER = "com.cloudera.impala.jdbc41.Driver";
        String CONNECTION_URL = "jdbc:impala://server6:21050";
        try
        {
            Class.forName(JDBC_DRIVER);
            con = (Connection) DriverManager.getConnection(CONNECTION_URL);
            ps = con.prepareStatement("select * from test.employee;");
            rs = ps.executeQuery();
            while (rs.next())
            {
                System.out.println(rs.getString(1));
                System.out.println(rs.getString(2));
                System.out.println(rs.getString(3));
            }
        } catch (Exception e)
        {
            e.printStackTrace();
        } finally
        {
            try {
                rs.close();
                ps.close();
                con.close();
            } catch (SQLException e) {
                e.printStackTrace();
            }
        }
    }
    public static void main(String[] args) {
        test();
    }

四、综合应用案例

其中用到了sqoop、hive和impala,当然其中的基础hadoop肯定用到了

实现2张千万(1260万)级别的表的数据按照id进行join操作,查看其运行时间
总结:
1、将mysql中1260万条数据导入hive中,大概耗时1分钟
2、将hive中1260万条数据复制到hive中,大概耗时1分钟
3、在impala中将两张表按照id进行join,并显示10条,耗时11.3秒
4、在hive中将两张表按照id进行join,并显示10条,耗时97.6秒
5、在impala中,按照user_name='alan816662’查询,耗时1.3s select * from dx_user where user_name = ‘alan816662’ ;
6、在hive中,按照user_name='alan816662’查询,耗时25.2s select * from dx_user where user_name = ‘alan816662’ ;
7、在mysql中,按照user_name='alan816662’查询,耗时6s select * from dx_user where user_name = ‘alan816662’ ;
1、通过sqoop将表直接导入到hive中,dx_user;

- 导入速度还是比较快的,大概在1分钟左右
sqoop import --connect jdbc:mysql://192.168.10.44:3306/test --username root --password rootroot \
--table dx_user \
--hive-import \
--m 1 \
--hive-database test;

[root@server7 lib]# sqoop import --connect jdbc:mysql://192.168.10.44:3306/test --username root --password rootroot \
> --table dx_user \
> --hive-import \
> --m 1 \
> --hive-database test;

2、在impala中,通过create table 创建users表,并插入数据
create table users as select * from dx_user;

[server6:21000] test> create table users as select * from dx_user;
Query: create table users as select * from dx_user
Query submitted at: 2022-11-08 17:27:41 (Coordinator: http://server6:25000)
Query progress can be monitored at: http://server6:25000/query_plan?query_id=b645f9005742c571:856523dd00000000
+--------------------------+
| summary                  |
+--------------------------+
| Inserted 12606948 row(s) |
+--------------------------+
Fetched 1 row(s) in 73.71s

3、在impala中对dx_user、users进行join 11.3s

select u.* from dx_user u join users us on u.id = us.id limit 10;

[server6:21000] test> select u.* from dx_user u join users us on u.id = us.id limit 10;
Query: select u.* from dx_user u join users us on u.id = us.id limit 10
Query submitted at: 2022-11-08 17:33:36 (Coordinator: http://server6:25000)
Query progress can be monitored at: http://server6:25000/query_plan?query_id=834879fce9994e47:f370cb1900000000
+-----------+------------+-----------+-------------+-----------------------+-----------------------+
| id        | user_name  | pass_word | phone       | email                 | create_day            |
+-----------+------------+-----------+-------------+-----------------------+-----------------------+
| 194771083 | alan816662 | 678425    | 13977776789 | alan.chan.chn@163.com | 2021-12-25 00:00:00.0 |
| 194771168 | alan821590 | 350443    | 13977776789 | alan.chan.chn@163.com | 2021-12-28 00:00:00.0 |
| 194771522 | alan421972 | 405125    | 13977776789 | alan.chan.chn@163.com | 2021-12-25 00:00:00.0 |
| 194772683 | alan340515 | 355354    | 13977776789 | alan.chan.chn@163.com | 2021-12-27 00:00:00.0 |
| 194772716 | alan227440 | 434742    | 13977776789 | alan.chan.chn@163.com | 2021-12-27 00:00:00.0 |
| 194773516 | alan568275 | 407620    | 13977776789 | alan.chan.chn@163.com | 2021-12-25 00:00:00.0 |
| 194774174 | alan359571 | 624075    | 13977776789 | alan.chan.chn@163.com | 2021-12-27 00:00:00.0 |
| 194775609 | alan752851 | 670004    | 13977776789 | alan.chan.chn@163.com | 2021-12-28 00:00:00.0 |
| 194778050 | alan29298  | 525850    | 13977776789 | alan.chan.chn@163.com | 2021-12-27 00:00:00.0 |
| 194778465 | alan351513 | 148031    | 13977776789 | alan.chan.chn@163.com | 2021-12-27 00:00:00.0 |
+-----------+------------+-----------+-------------+-----------------------+-----------------------+
Fetched 10 row(s) in 11.34s

4、在hive中对dx_user、users进行join   97.6s

0: jdbc:hive2://server8:10000> select u.* from dx_user u join users us on u.id = us.id limit 10;
INFO  : Compiling command(queryId=hive_20221108173507_9a00773e-44b6-41a5-8183-20fb26cac837): select u.* from dx_user u join users us on u.id = us.id limit 10
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:u.id, type:int, comment:null), FieldSchema(name:u.user_name, type:string, comment:null), FieldSchema(name:u.pass_word, type:string, comment:null), FieldSchema(name:u.phone, type:string, comment:null), FieldSchema(name:u.email, type:string, comment:null), FieldSchema(name:u.create_day, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20221108173507_9a00773e-44b6-41a5-8183-20fb26cac837); Time taken: 0.167 seconds
INFO  : Executing command(queryId=hive_20221108173507_9a00773e-44b6-41a5-8183-20fb26cac837): select u.* from dx_user u join users us on u.id = us.id limit 10
WARN  : 
INFO  : Query ID = hive_20221108173507_9a00773e-44b6-41a5-8183-20fb26cac837
INFO  : Total jobs = 1
INFO  : Starting task [Stage-5:CONDITIONAL] in serial mode
INFO  : Stage-1 is selected by condition resolver.
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Number of reduce tasks not specified. Estimated from input data size: 32
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
INFO  : number of splits:8
INFO  : Submitting tokens for job: job_1667806921128_0029
INFO  : Executing with tokens: []
INFO  : The url to track the job: http://server8:8088/proxy/application_1667806921128_0029/
INFO  : Starting Job = job_1667806921128_0029, Tracking URL = http://server8:8088/proxy/application_1667806921128_0029/
INFO  : Kill Command = /opt/cloudera/parcels/CDH-6.2.1-1.cdh6.2.1.p0.1425774/lib/hadoop/bin/hadoop job  -kill job_1667806921128_0029
INFO  : Hadoop job information for Stage-1: number of mappers: 8; number of reducers: 32
INFO  : 2022-11-08 17:35:21,629 Stage-1 map = 0%,  reduce = 0%
INFO  : 2022-11-08 17:35:39,044 Stage-1 map = 8%,  reduce = 0%, Cumulative CPU 16.6 sec
INFO  : 2022-11-08 17:35:40,065 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 84.91 sec
INFO  : 2022-11-08 17:35:42,112 Stage-1 map = 38%,  reduce = 0%, Cumulative CPU 88.26 sec
INFO  : 2022-11-08 17:35:43,133 Stage-1 map = 42%,  reduce = 0%, Cumulative CPU 92.41 sec
INFO  : 2022-11-08 17:35:44,153 Stage-1 map = 46%,  reduce = 0%, Cumulative CPU 97.14 sec
INFO  : 2022-11-08 17:35:45,181 Stage-1 map = 63%,  reduce = 0%, Cumulative CPU 108.92 sec
INFO  : 2022-11-08 17:35:53,351 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 118.54 sec
INFO  : 2022-11-08 17:35:55,387 Stage-1 map = 88%,  reduce = 0%, Cumulative CPU 132.31 sec
INFO  : 2022-11-08 17:35:59,464 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 147.49 sec
INFO  : 2022-11-08 17:36:03,669 Stage-1 map = 100%,  reduce = 6%, Cumulative CPU 160.92 sec
INFO  : 2022-11-08 17:36:04,693 Stage-1 map = 100%,  reduce = 13%, Cumulative CPU 177.2 sec
INFO  : 2022-11-08 17:36:07,845 Stage-1 map = 100%,  reduce = 16%, Cumulative CPU 184.74 sec
INFO  : 2022-11-08 17:36:08,883 Stage-1 map = 100%,  reduce = 19%, Cumulative CPU 190.1 sec
INFO  : 2022-11-08 17:36:12,015 Stage-1 map = 100%,  reduce = 28%, Cumulative CPU 215.15 sec
INFO  : 2022-11-08 17:36:15,091 Stage-1 map = 100%,  reduce = 34%, Cumulative CPU 227.9 sec
INFO  : 2022-11-08 17:36:19,314 Stage-1 map = 100%,  reduce = 44%, Cumulative CPU 249.71 sec
INFO  : 2022-11-08 17:36:21,366 Stage-1 map = 100%,  reduce = 47%, Cumulative CPU 255.4 sec
INFO  : 2022-11-08 17:36:22,385 Stage-1 map = 100%,  reduce = 50%, Cumulative CPU 262.19 sec
INFO  : 2022-11-08 17:36:26,588 Stage-1 map = 100%,  reduce = 63%, Cumulative CPU 289.58 sec
INFO  : 2022-11-08 17:36:28,631 Stage-1 map = 100%,  reduce = 66%, Cumulative CPU 296.3 sec
INFO  : 2022-11-08 17:36:32,860 Stage-1 map = 100%,  reduce = 75%, Cumulative CPU 315.67 sec
INFO  : 2022-11-08 17:36:33,885 Stage-1 map = 100%,  reduce = 78%, Cumulative CPU 322.59 sec
INFO  : 2022-11-08 17:36:35,945 Stage-1 map = 100%,  reduce = 81%, Cumulative CPU 329.1 sec
INFO  : 2022-11-08 17:36:38,049 Stage-1 map = 100%,  reduce = 84%, Cumulative CPU 334.52 sec
INFO  : 2022-11-08 17:36:39,072 Stage-1 map = 100%,  reduce = 88%, Cumulative CPU 340.77 sec
INFO  : 2022-11-08 17:36:40,090 Stage-1 map = 100%,  reduce = 94%, Cumulative CPU 354.81 sec
INFO  : 2022-11-08 17:36:41,138 Stage-1 map = 100%,  reduce = 97%, Cumulative CPU 361.19 sec
INFO  : 2022-11-08 17:36:44,205 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 366.51 sec
INFO  : MapReduce Total cumulative CPU time: 6 minutes 6 seconds 510 msec
INFO  : Ended Job = job_1667806921128_0029
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 8  Reduce: 32   Cumulative CPU: 366.51 sec   HDFS Read: 2112219819 HDFS Write: 33035 HDFS EC Read: 0 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 6 minutes 6 seconds 510 msec
INFO  : Completed executing command(queryId=hive_20221108173507_9a00773e-44b6-41a5-8183-20fb26cac837); Time taken: 97.43 seconds
INFO  : OK
+-----------+--------------+--------------+--------------+------------------------+------------------------+
|   u.id    | u.user_name  | u.pass_word  |   u.phone    |        u.email         |      u.create_day      |
+-----------+--------------+--------------+--------------+------------------------+------------------------+
| 90836928  | alan67100    | 471807       | 13977776789  | alan.chan.chn@163.com  | 2021-12-27 00:00:00.0  |
| 90837408  | alan348308   | 553402       | 13977776789  | alan.chan.chn@163.com  | 2021-12-27 00:00:00.0  |
| 90838848  | alan167429   | 473683       | 13977776789  | alan.chan.chn@163.com  | 2021-12-27 00:00:00.0  |
| 90843456  | alan567446   | 200171       | 13977776789  | alan.chan.chn@163.com  | 2021-12-28 00:00:00.0  |
| 90843552  | alan876331   | 267453       | 13977776789  | alan.chan.chn@163.com  | 2021-12-25 00:00:00.0  |
| 90843872  | alan387008   | 184127       | 13977776789  | alan.chan.chn@163.com  | 2021-12-27 00:00:00.0  |
| 90846048  | alan712188   | 214632       | 13977776789  | alan.chan.chn@163.com  | 2021-12-25 00:00:00.0  |
| 90846112  | alan371470   | 265403       | 13977776789  | alan.chan.chn@163.com  | 2021-12-25 00:00:00.0  |
| 90846240  | alan48741    | 161574       | 13977776789  | alan.chan.chn@163.com  | 2021-12-28 00:00:00.0  |
| 90848576  | alan167204   | 517801       | 13977776789  | alan.chan.chn@163.com  | 2021-12-25 00:00:00.0  |
+-----------+--------------+--------------+--------------+------------------------+------------------------+
10 rows selected (97.66 seconds)

以上,简单的介绍了impala的sql语法与使用示例,也介绍了impala导入数据的几种方式、java api操作impala,最后用一个综合示例简单的对比了impala和hive的查询速度。

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
手把手视频详细讲解项目开发全过程,需要的小伙伴自行百度网盘下载,链接见附件,永久有效。 课程简介 从零开始讲解大数据分布式计算的发展及Impala的应用场景,对比Hive、MapReduce、Spark等类似框架讲解内存式计算原理,基于Impala构建高性能交互式SQL分析平台 课程亮点 1,知识体系完备,从小白到大神各阶段读者均能学有所获。 2,生动形象,化繁为简,讲解通俗易懂。 3,结合工作实践及分析应用,培养解决实际问题的能力。 4,每一块知识点, 都有配套案例, 学习不再迷茫。 适用人群 1、对大数据感兴趣的在校生及应届毕业生。 2、对目前职业有进一步提升要求,希望从事大数据行业高薪工作的在职人员。 3、对大数据行业感兴趣的相关人员。 课程内容 第一章:内存式计算发展 1.分布式计算的发展 2.大数据分布式计算分类 3.Impala内存式计算诞生 第二章:Impala原理初探 1.Impala的设计思想 2.ImpalaHive之间的联系 3.Impala的分布式架构详解 4.Impala角色概念详解 第三章:基于Cloudera镜像部署分布式Impala 1.基于CDH5.14构建本地Yum镜像 2.企业级分布式Impala部署 3.企业级配置与Hadoop集成 4.企业级配置与Hive集成 5.主从架构及元数据服务管理 第四章:Impala企业实战开发案例 1.基于企业案例实现Impala集群管理 2.Impala最全SQL语法详解 3.实战开发Impala数据库与表管理 4.基于分析案例实现Impala数据管理 5.Impala与应用系统集成JDBC 第五章:Impala原理深入 1.Impala各角色功能详解 2.Impala任务提交原理 3.Impala数据同步原理

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

一瓢一瓢的饮 alanchanchn

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值