INSERT Statement.
Impala支持插入到使用Impala CREATE TABLE
语句创建的表和分区中,或者插入到通过hive创建的预定义表和分区中。
语法
[with_clause]
INSERT [hint_clause] { INTO | OVERWRITE } [TABLE] table_name
[(column_list)]
[ PARTITION (partition_clause)]
{
[hint_clause] select_statement
| VALUES (value [, value ...]) [, (value [, value ...]) ...]
}
partition_clause ::= col_name [= constant] [, col_name [= constant] ...]
hint_clause ::=
hint_with_dashes |
hint_with_cstyle_delimiters |
hint_with_brackets
hint_with_dashes ::= -- +SHUFFLE | -- +NOSHUFFLE -- +CLUSTERED
hint_with_cstyle_comments ::= /* +SHUFFLE */ | /* +NOSHUFFLE */ | /* +CLUSTERED */
hint_with_brackets ::= [SHUFFLE] | [NOSHUFFLE]
(With this hint format, the square brackets are part of the syntax.)
INSERT INTO
:将数据追加到表中,现有的数据保持原样,插入的数据放入一个或多个新数据文件中INSERT OVERWRITE
:替换表中已存在的数据,当前被覆盖的数据文件会立即删除,不会通过hdfs的垃圾箱机制。
INSERT
语句当前不支持写入包含复杂类型(数组、结构和映射)的数据文件。目前,此类表必须使用Parquet文件格式,需要在Impala外部生成Parquet格式的数据文件,然后使用LOAD data
或CREATE EXTERNAL TABLE
将这些数据文件与表关联。
注意:插入时impala不会自动向上转型,需要手动转换,使用CAST(COS(column) AS columnType)
例如
CREATE DATABASE IF NOT EXISTS file_formats;
USE file_formats;
DROP TABLE IF EXISTS text_table;
CREATE TABLE text_table
( id INT, col_1 BOOLEAN, col_2 DOUBLE, col_3 TIMESTAMP )
STORED AS TEXTFILE;
DROP TABLE IF EXISTS parquet_table;
CREATE TABLE parquet_table
( id INT, col_1 BOOLEAN, col_2 DOUBLE, col_3 TIMESTAMP )
STORED AS PARQUET;
insert语句会不断向表中追加数据,如果一个语句运行了两次,那么增加的数据也会变成两倍,比如下面的例子:
[localhost:21000] > insert into table text_table select * from default.tab1;
Inserted 5 rows in 0.41s
[localhost:21000] > insert into table text_table select * from default.tab1;
Inserted 5 rows in 0.46s
[localhost:21000] > select count(*) from text_table;
+----------+
| count(*) |
+----------+
| 10 |
+----------+
Returned 1 row(s) in 0.26s
如果使用insert overwrite语句,则会将之前的数据删除,比如下面的例子:
[localhost:21000] > insert into table parquet_table select * from default.tab1;
Inserted 5 rows in 0.35s
[localhost:21000] > insert overwrite table parquet_table select * from default.tab1 limit 3;
Inserted 3 rows in 0.43s
[localhost:21000] > select count(*) from parquet_table;
+----------+
| count(*) |
+----------+
| 3 |
+----------+
Returned 1 row(s) in 0.43s
注意:在insert语句中,使用任何排序语法,均会被忽略
使用带value的语句例子
[localhost:21000] > describe val_example;
Query: describe val_example
Query finished, fetching results ...
+-------+---------+---------+
| name | type | comment |
+-------+---------+---------+
| id | int | |
| col_1 | boolean | |
| col_2 | double | |
+-------+---------+---------+
[localhost:21000] > insert into val_example values (1,true,100.0);
Inserted 1 rows in 0.30s
[localhost:21000] > select * from val_example;
+----+-------+-------+
| id | col_1 | col_2 |
+----+-------+-------+
| 1 | true | 100 |
+----+-------+-------+
[localhost:21000] > insert overwrite val_example values (10,false,pow(2,5)), (50,true,10/3);
Inserted 2 rows in 0.16s
[localhost:21000] > select * from val_example;
+----+-------+-------------------+
| id | col_1 | col_2 |
+----+-------+-------------------+
| 10 | false | 32 |
| 50 | true | 3.333333333333333 |
+----+-------+-------------------+
LOAD DATA Statement.
语法
LOAD DATA INPATH 'hdfs_file_or_directory_path' [OVERWRITE] INTO TABLE tablename
[PARTITION (partcol1=val1, partcol2=val2 ...)]
注意:
- 加载的数据文件会被移动到impala文件夹中,而不是复制
- 当加载的数据是一个目录时,impala会加载目录下所有的文件,请确保该目录中没有嵌套的子目录,如果有子目录,会失败。
- 不能使用通配符来指定要加载的文件。
- 加载指定路径为目录时,不会加载目录中的隐藏文件。
- 加载数据只能从HDFS中加载数据,不能从本地加载,因此在指定文件路径时,无需携带
hdfs://
- 加载数据时只会做非常有限的检查,如果格式有问题,可能会在查询时抛出异常。
使用样例
random_strings.py 1000 | hdfs dfs -put - /user/doc_demo/thousand_strings.txt
random_strings.py 100 | hdfs dfs -put - /user/doc_demo/hundred_strings.txt
random_strings.py 10 | hdfs dfs -put - /user/doc_demo/ten_strings.txt
[localhost:21000] > create table t1 (s string);
[localhost:21000] > load data inpath '/user/doc_demo/thousand_strings.txt' into table t1;
Query finished, fetching results ...
+----------------------------------------------------------+
| summary |
+----------------------------------------------------------+
| Loaded 1 file(s). Total files in destination location: 1 |
+----------------------------------------------------------+
Returned 1 row(s) in 0.61s
[kilo2-202-961.cs1cloud.internal:21000] > select count(*) from t1;
Query finished, fetching results ...
+------+
| _c0 |
+------+
| 1000 |
+------+
Returned 1 row(s) in 0.67s
-- 再次加载时,会报数据文件不存在
[localhost:21000] > load data inpath '/user/doc_demo/thousand_strings.txt' into table t1;
ERROR: AnalysisException: INPATH location '/user/doc_demo/thousand_strings.txt' does not exist.
-- 如果去查看表的文件,会发现名称和之前的一致
$ hdfs dfs -ls /user/hive/warehouse/load_data_testing.db/t1
Found 1 items
-rw-r--r-- 1 doc_demo doc_demo 13926 2013-06-26 15:40 /user/hive/warehouse/load_data_testing.db/t1/thousand_strings.txt
-- 此时加载100行的数据文件,表中数据会增加100行
[localhost:21000] > load data inpath '/user/doc_demo/hundred_strings.txt' into table t1;
Query finished, fetching results ...
+----------------------------------------------------------+
| summary |
+----------------------------------------------------------+
| Loaded 1 file(s). Total files in destination location: 2 |
+----------------------------------------------------------+
Returned 1 row(s) in 0.24s
[localhost:21000] > select count(*) from t1;
Query finished, fetching results ...
+------+
| _c0 |
+------+
| 1100 |
+------+
Returned 1 row(s) in 0.55s
-- 使用overwrite关键词时,表中原来的数据会被覆盖
[localhost:21000] > load data inpath '/user/doc_demo/ten_strings.txt' overwrite into table t1;
Query finished, fetching results ...
+----------------------------------------------------------+
| summary |
+----------------------------------------------------------+
| Loaded 1 file(s). Total files in destination location: 1 |
+----------------------------------------------------------+
Returned 1 row(s) in 0.26s
[localhost:21000] > select count(*) from t1;
Query finished, fetching results ...
+-----+
| _c0 |
+-----+
| 10 |
+-----+
Returned 1 row(s) in 0.62s