oracle 分桶,Hive 对象 基本操作

在之前的博客中我们看了Hive 环境的搭建,如下:

Hive 3.1.1 环境 搭建手册

https://www.cndba.cn/dave/article/3335

本篇我们了解一下Hive对象的基本操作。

1 创建数据库

hive> create database cndba;

OK

Time taken: 1.103 seconds

hive>

建立一个新数据库,就会在HDFS的/user/hive/warehouse/中生成一个cndba.db文件夹。如果不创建新数据库,不使用hive>use ,系统默认的数据库。可以显式使用:

hive> use default;

默认/user/hive/warehouse/中建表

[dave@www.cndba.cn ~]$ hdfs dfs -ls -R /user

drwxr-xr-x - cndba supergroup 0 2019-01-23 23:55 /user/cndba

drwxr-xr-x - cndba supergroup 0 2019-01-23 23:55 /user/cndba/output

-rw-r--r-- 2 cndba supergroup 0 2019-01-23 23:55 /user/cndba/output/_SUCCESS

-rw-r--r-- 2 cndba supergroup 34795 2019-01-23 23:55 /user/cndba/output/part-r-00000

drwxr-xr-x - cndba supergroup 0 2019-03-07 23:48 /user/hive

drwxr-xr-x - cndba supergroup 0 2019-03-07 23:48 /user/hive/warehouse

drwxr-xr-x - cndba supergroup 0 2019-03-07 23:48 /user/hive/warehouse/cndba.db

[dave@www.cndba.cn ~]$

2 创建表

语法:

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name

[(col_name data_type [COMMENT col_comment], ...)]

[COMMENT table_comment]

[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]

[CLUSTERED BY (col_name, col_name, ...)

[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]

[ROW FORMAT row_format]

[STORED AS file_format]

[LOCATION hdfs_path]

示例:

hive> create table employees(id int,name string,salary string) row format delimited fields terminated by '/,';

2019-03-09 01:10:08,488 INFO [89328b4b-0873-45ca-b2a5-fb2986043911 main] DataNucleus.Persistence: Request to load fields "comment,name,type" of class org.apache.hadoop.hive.metastore.model.MFieldSchema but object is embedded, so ignored

2019-03-09 01:10:08,488 INFO [89328b4b-0873-45ca-b2a5-fb2986043911 main] DataNucleus.Persistence: Request to load fields "comment,name,type" of class org.apache.hadoop.hive.metastore.model.MFieldSchema but object is embedded, so ignored

OK

Time taken: 0.071 seconds

hive>

以上语法创建了一个employees表,对应在Mysql元数据表TBLS表会增加表和列的信息。同时,会在HDFS的中的tabletest.db文件夹中增加一个employees文件夹。所有的 Table 数据(不包括 External Table)都保存在这个目录中。

[dave@www.cndba.cn ~]$ hdfs dfs -ls -R /user/hive

drwxr-xr-x - cndba supergroup 0 2019-03-09 01:10 /user/hive/warehouse

drwxr-xr-x - cndba supergroup 0 2019-03-07 23:48 /user/hive/warehouse/cndba.db

drwxr-xr-x - cndba supergroup 0 2019-03-08 21:55 /user/hive/warehouse/dave

drwxr-xr-x - cndba supergroup 0 2019-03-09 01:10 /user/hive/warehouse/employees

drwxr-xr-x - cndba supergroup 0 2019-03-07 23:53 /user/hive/warehouse/t_cndba

drwxr-xr-x - cndba supergroup 0 2019-03-08 21:27 /user/hive/warehouse/tpm_cndba

[dave@www.cndba.cn ~]$

3 创建临时表

临时表可以存储中间结果:

hive> create table tpm_cndba as select * from t_cndba;

Query ID = cndba_20190308212650_1110d65a-bdbc-43d2-8386-38e12391eb95

Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks is set to 0 since there's no reduce operator

2019-03-08 21:26:54,059 INFO [89328b4b-0873-45ca-b2a5-fb2986043911 main] client.RMProxy: Connecting to ResourceManager at hadoopmaster/192.168.20.80:8032

2019-03-08 21:26:54,475 INFO [89328b4b-0873-45ca-b2a5-fb2986043911 main] client.RMProxy: Connecting to ResourceManager at hadoopmaster/192.168.20.80:8032

Starting Job = job_1551186730130_0001, Tracking URL = http://hadoopmaster:8088/proxy/application_1551186730130_0001/

Kill Command = /home/cndba/hadoop/bin/mapred job -kill job_1551186730130_0001

Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0

2019-03-08 21:27:07,893 Stage-1 map = 0%, reduce = 0%

Ended Job = job_1551186730130_0001

Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver.

Moving data to directory hdfs://hadoopmaster:9000/user/hive/warehouse/.hive-staging_hive_2019-03-08_21-26-50_554_9030156807573657821-1/-ext-10002

Moving data to directory hdfs://hadoopmaster:9000/user/hive/warehouse/tpm_cndba

MapReduce Jobs Launched:

Stage-Stage-1: HDFS Read: 0 HDFS Write: 0 SUCCESS

Total MapReduce CPU Time Spent: 0 msec

OK

Time taken: 19.886 seconds

hive> show tables;

OK

t_cndba

tpm_cndba

Time taken: 0.021 seconds, Fetched: 2 row(s)

hive>

4 Load 加载数据

语法:

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

有LOCAL表示从本地文件系统加载(文件会被拷贝到HDFS中)

无LOCAL表示从HDFS中加载数据(注意:文件直接被移动,而不是拷贝,并且文件名都不改。)

OVERWRITE 表示是否覆盖表中数据(或指定分区的数据)(没有OVERWRITE 会直接APPEND,而不会滤重!)

示例:

从本地导入数据到表格并追加原表

hive> load data local inpath '/home/cndba/dave.txt' into table employees;

Loading data to table default.employees

OK

Time taken: 0.265 seconds

hive>

hive> select * from employees;

OK

100 Steven 24000.00

101 Neena 17000.00

102 Lex 17000.00

103 Alexander 9000.00

104 Bruce 6000.00

105 David 4800.00

106 Valli 4800.00

107 Diana 4200.00

从HDFS导入数据到表格并追加记录:

hive> load data inpath '/dave/dave.txt' into table employees;

Loading data to table default.employees

OK

Time taken: 0.207 seconds

从HDFS导入数据到表格并覆盖原表:

hive> load data inpath '/dave/dave.txt' overwrite into table employees;

Loading data to table default.employees

OK

Time taken: 0.258 seconds

hive>

这里注意一点:文本数据的字段间隔符,是在建表的时候指定的,如果要将自定义间隔符的文件读入一个表,需要通过创建表的语句来指明输入文件间隔符,然后load data到表。

5 插入数据

5.1 INSERT语法

语法如下:

Standard syntax:

INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement

Hive extension (multiple inserts):

FROM from_statement

INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1

[INSERT OVERWRITE TABLE tablename2 [PARTITION ...] select_statement2] ...

Hive extension (dynamic partition inserts):

INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement

Insert时,from子句既可以放在select子句后,也可以放在insert子句前,下面两句是等价的。

hive> insert overwrite table cn select * from employees;

hive> from employees insert overwrite table cn select * where id > 1;

Query ID = cndba_20190309020852_5bc199f6-0431-43c0-adad-4e91b8b4ae1a

Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=

In order to set a constant number of reducers:

set mapreduce.job.reduces=

2019-03-09 02:08:53,370 INFO [89328b4b-0873-45ca-b2a5-fb2986043911 main] client.RMProxy: Connecting to ResourceManager at hadoopmaster/192.168.20.80:8032

2019-03-09 02:08:53,403 INFO [89328b4b-0873-45ca-b2a5-fb2986043911 main] client.RMProxy: Connecting to ResourceManager at hadoopmaster/192.168.20.80:8032

Starting Job = job_1551186730130_0005, Tracking URL = http://hadoopmaster:8088/proxy/application_1551186730130_0005/

Kill Command = /home/cndba/hadoop/bin/mapred job -kill job_1551186730130_0005

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2019-03-09 02:09:02,268 Stage-1 map = 0%, reduce = 0%

2019-03-09 02:09:10,589 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.85 sec

2019-03-09 02:09:19,875 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.99 sec

MapReduce Total cumulative CPU time: 7 seconds 990 msec

Ended Job = job_1551186730130_0005

Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver.

Moving data to directory hdfs://hadoopmaster:9000/user/hive/warehouse/cn/.hive-staging_hive_2019-03-09_02-08-52_882_2438057570131793940-1/-ext-10000

Loading data to table default.cn

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 7.99 sec HDFS Read: 17771 HDFS Write: 3611 SUCCESS

Total MapReduce CPU Time Spent: 7 seconds 990 msec

OK

Time taken: 28.319 seconds

hive>

Hive不支持一条一条的用insert语句进行插入操作,也不支持update的操作。数据是以load的方式,加载到建立好的表中。数据一旦导入,则不可修改。要么drop掉整个表,要么建立新的表,导入新的数据。

如果想一次插入一条数据,可以通过其他方法实现: 假设有一张表B至少有一条数据,我们想向表A(int,string)中插入一条数据,可以用下面的方法实现:

from B insert table A select 1,‘abc’ limit 1;

5.2 WRITE语法

语法如下:

Standard syntax:

INSERT OVERWRITE [LOCAL] DIRECTORY directory1 SELECT ... FROM ...

Hive extension (multiple inserts):

FROM from_statement

INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1

[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...

导出文件到本地:

hive> insert overwrite local directory '/tmp/dave' select * from employees;

注意这里是导到/tmp/dave的目录下,下面还会生成一个文件:

[hadoop@hadoopMaster tmp]$ ll /tmp/dave

total 8

-rw-r--r--. 1 hadoop hadoop 8056 Mar 9 21:30 000000_0

[hadoop@hadoopMaster tmp]$ ll /tmp/dave/000000_0

-rw-r--r--. 1 hadoop hadoop 8056 Mar 9 21:30 /tmp/dave/000000_0

[hadoop@hadoopMaster tmp]$

hive> insert overwrite directory '/dave' select * from employees;

[hadoop@Slave1 hadoop]$ hdfs dfs -ls /dave

Found 2 items

-rw-r--r-- 2 hadoop supergroup 8056 2019-03-09 21:31 /dave/000000_0

-rw-r--r-- 2 hadoop supergroup 0 2019-03-03 01:11 /dave/_SUCCESS

[hadoop@Slave1 hadoop]$

一个源可以同时插入到多个目标表或目标文件,多目标insert可以用一句话来完成:

FROM src

INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100

INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200

INSERT OVERWRITE TABLE dest3 PARTITION(ds='2013-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300

INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;

6 显示表

查看表名:

hive> show tables;

OK

employees

page_view

Time taken: 0.112 seconds, Fetched: 2 row(s)

hive>

查看表名,部分匹配,注意这里的规则在模糊匹配之前有个.:

hive> show tables 'cn.*';

OK

cndba

Time taken: 0.117 seconds, Fetched: 1 row(s)

hive>

查看某表的所有Partition,如果没有就报错:

SHOW PARTITIONS page_view;

查看某表结构:

hive> desc cndba;

OK

id int

name string

salary string

Time taken: 0.138 seconds, Fetched: 3 row(s)

hive>

查看有限行内容,同Greenplum,用limit关键词:

hive> select * from employees limit 3;

OK

100 Steven 24000.00

101 Neena 17000.00

102 Lex 17000.00

Time taken: 0.561 seconds, Fetched: 3 row(s)

hive>

7 external表

Hive 创建内部表时,会将数据移动到数据仓库指向的路径;创建外部表,仅记录数据所在的路径,不对数据的位置做任何改变,数据源可以在任意位置。 外部表删除时,只删除元数据信息,存储在HDFS中的数据没有被删除。

[hadoop@hadoopMaster hadoop]$ hdfs dfs -mkdir -p /ext/ahdba

[hadoop@hadoopMaster ~]$ hdfs dfs -put ~/ahdba.txt /ext/ahdba

[hadoop@hadoopMaster hadoop]$ hdfs dfs -ls -R /ext

drwxr-xr-x - hadoop supergroup 0 2019-03-09 21:52 /ext/ahdba

-rw-r--r-- 2 hadoop supergroup 2120 2019-03-09 21:52 /ext/ahdba/ahdba.txt

hive> create external table ahdba(id int,name string,salary string) row format delimited fields terminated by '/,' stored as textfile location '/ext/ahdba';

OK

Time taken: 0.266 seconds

hive>

hive> select * from ahdba limit 3;

OK

100 Steven 24000.00

101 Neena 17000.00

102 Lex 17000.00

Time taken: 0.243 seconds, Fetched: 3 row(s)

hive>

8 Partition表(分区表)

如果文件很大,用分区表可以快过滤出按分区字段划分的数据。假设anqing中有两个分区part1和part2. 实际就是在HDFS中的anqing文件夹下面再建立两个文件夹,每个文件名就是part1和part2。hive中的分区就是再多建一个目录,优点:便于统计,效率更高,缩小数据集。

示例:

hive> create table anqing(id int,name string,salary string) partitioned by (part_flag string) row format delimited fields terminated by '/,';

插入数据:

hive> load data local inpath '/home/hadoop/dave.txt' overwrite into table anqing partition(part_flag='1');

Loading data to table default.anqing partition (part_flag=1)

OK

Time taken: 1.003 seconds

hive> load data local inpath '/home/hadoop/dave.txt' overwrite into table anqing partition(part_flag='2');

Loading data to table default.anqing partition (part_flag=2)

OK

Time taken: 0.752 seconds

hive> load data local inpath '/home/hadoop/dave.txt' overwrite into table anqing partition(part_flag='3');

Loading data to table default.anqing partition (part_flag=3)

OK

Time taken: 1.013 seconds

hive>

目录结构如下:

[hadoop@Slave1 hadoop]$ hdfs dfs -ls -R /user/hive/warehouse/anqing

drwxr-xr-x - hadoop supergroup 0 2019-03-09 22:05 /user/hive/warehouse/anqing/part_flag=1

-rw-r--r-- 2 hadoop supergroup 2120 2019-03-09 22:05 /user/hive/warehouse/anqing/part_flag=1/dave.txt

drwxr-xr-x - hadoop supergroup 0 2019-03-09 22:05 /user/hive/warehouse/anqing/part_flag=2

-rw-r--r-- 2 hadoop supergroup 2120 2019-03-09 22:05 /user/hive/warehouse/anqing/part_flag=2/dave.txt

drwxr-xr-x - hadoop supergroup 0 2019-03-09 22:05 /user/hive/warehouse/anqing/part_flag=3

-rw-r--r-- 2 hadoop supergroup 2120 2019-03-09 22:05 /user/hive/warehouse/anqing/part_flag=3/dave.txt

[hadoop@Slave1 hadoop]$

hive> select count(1) from anqing where part_flag='1';

Query ID = hadoop_20190309220709_376b5416-1e10-4633-9ebf-aa5ed30eadd0

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=

In order to set a constant number of reducers:

set mapreduce.job.reduces=

2019-03-09 22:07:10,274 INFO [11ef99e0-fe20-400e-87b6-1006d834fa31 main] client.RMProxy: Connecting to ResourceManager at hadoopMaster/192.168.56.100:8032

2019-03-09 22:07:10,301 INFO [11ef99e0-fe20-400e-87b6-1006d834fa31 main] client.RMProxy: Connecting to ResourceManager at hadoopMaster/192.168.56.100:8032

Starting Job = job_1552137405153_0007, Tracking URL = http://hadoopMaster:8088/proxy/application_1552137405153_0007/

Kill Command = /home/hadoop/hadoop/bin/mapred job -kill job_1552137405153_0007

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2019-03-09 22:07:17,411 Stage-1 map = 0%, reduce = 0%

2019-03-09 22:07:24,641 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.68 sec

2019-03-09 22:07:29,805 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.54 sec

MapReduce Total cumulative CPU time: 3 seconds 540 msec

Ended Job = job_1552137405153_0007

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.54 sec HDFS Read: 14697 HDFS Write: 103 SUCCESS

Total MapReduce CPU Time Spent: 3 seconds 540 msec

OK

106

Time taken: 22.651 seconds, Fetched: 1 row(s)

hive>

相关命令:

SHOW TABLES; # 查看所有的表

SHOW TABLES 'TMP'; #支持模糊查询

SHOW PARTITIONS TMP_TABLE; #查看表有哪些分区

DESCRIBE TMP_TABLE; #查看表结构

hive> show partitions anqing;

OK

part_flag=1

part_flag=2

part_flag=3

Time taken: 0.171 seconds, Fetched: 3 row(s)

hive> desc anqing;

OK

id int

name string

salary string

part_flag string

# Partition Information

# col_name data_type comment

part_flag string

Time taken: 0.175 seconds, Fetched: 8 row(s)

hive>

9 分桶表

Hive里的分桶=MapReduce中的分区,而Hive中的分区只是将数据分到了不同的文件夹。

9.1 创建分桶表

hive> create table emp_buck(id int,name string,salary string) clustered by(id) sorted by(id DESC) into 4 buckets row format delimited fields terminated by ',';

OK

Time taken: 0.192 seconds

hive>

含义:根据id字段分桶,每个桶按照id字段局部有序,4个桶。建桶的时候不会的数据做处理,只是要求插入的数据是被分好桶的。

9.2 分桶表内插入数据

一般不适用load数据进入分桶表,因为load进入的数据不会自动按照分桶规则分为四个小文件。所以,一般使用select查询向分桶表插入文件。

设置变量,设置分桶为true,设置reduce数量是分桶的数量个数

set hive.enforce.bucketing = true;

set mapreduce.job.reduces=4;

hive> insert overwrite table emp_buck select * from ahdba cluster by(id);

或者:

insert overwrite table emp_buck select * from ahdba distribute by(id) sort by(id asc);

其中, distribute by(id) sort by(id asc)等价cluster by(id),cluster by(id) = 分桶+排序。

先分发,再局部排序。区别是distribute更加灵活,可以根据一个字段分区,另外字段排序。

第二个子查询的输出了4个文件作为主查询的输入。

9.3 分桶表的原理与作用

原理:

Hive是针对某一列进行桶的组织。Hive采用对列值哈希,然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中。(原理和MapReduce的getPartition方法相同)

作用:

(1) 最大的作用是用来提高join操作的效率;

前提是两个都是分桶表且分桶数量相同或者倍数关系。

比如:select a.id,a.name,b.addr from a join b on a.id = b.id;

如果a表和b表已经是分桶表,而且分桶的字段是id字段。

对于JOIN操作两个表有一个相同的列,如果对这两个表都进行了桶操作。那么将保存相同列值的桶进行JOIN操作就可以,可以大大较少JOIN的数据量。

(2)取样(sampling)更高效。在处理大规模数据集时,在开发和修改查询的阶段,如果能在数据集的一小部分数据上试运行查询,会带来很多方便。

10 创建视图

CREATE VIEW [IF NOT EXISTS] view_name [ (column_name [COMMENT column_comment], ...) ]

[COMMENT view_comment]

[TBLPROPERTIES (property_name = property_value, ...)]

AS SELECT ...

注:视图关键字。 视图是只读的,不能用LOAD/INSERT/ALTER

hive> create view v_ahdba as select * from ahdba;

OK

Time taken: 0.43 seconds

hive> select * from v_ahdba limit 3;

OK

100 Steven 24000.00

101 Neena 17000.00

102 Lex 17000.00

Time taken: 0.21 seconds, Fetched: 3 row(s)

hive>

11 Alter Table

11.1 添加分区

ALTER TABLE table_name ADD [IF NOT EXISTS] partition_spec [ LOCATION 'location1' ] partition_spec [ LOCATION 'location2' ] ...

partition_spec:

: PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...)

示例:

hive> show partitions anqing;

OK

part_flag=1

part_flag=2

part_flag=3

Time taken: 0.175 seconds, Fetched: 3 row(s)

hive>

hive> ALTER TABLE anqing ADD

> PARTITION (part_flag='4') location '/user/hive/warehouse/anqing/part_flag=4'

> PARTITION (part_flag='5') location '/user/hive/warehouse/anqing/part_flag=5' ;

OK

Time taken: 0.422 seconds

hive> show partitions anqing;

OK

part_flag=1

part_flag=2

part_flag=3

part_flag=4

part_flag=5

Time taken: 0.317 seconds, Fetched: 5 row(s)

hive>

删除分区

ALTER TABLE table_name DROP partition_spec, partition_spec,...

示例:

hive> ALTER TABLE anqing DROP PARTITION (part_flag='4');

Dropped the partition part_flag=4

OK

Time taken: 0.451 seconds

hive> show partitions anqing;

OK

part_flag=1

part_flag=2

part_flag=3

part_flag=5

Time taken: 0.144 seconds, Fetched: 4 row(s)

hive>

11.2 重命名表

hive> show tables;

OK

ahdba

anqing

cndba

emp_buck

employees

page_view

v_ahdba

Time taken: 0.046 seconds, Fetched: 7 row(s)

hive> alter table ahdba rename to huaining;

OK

Time taken: 0.187 seconds

hive> show tables;

OK

anqing

cndba

emp_buck

employees

huaining

page_view

v_ahdba

Time taken: 0.055 seconds, Fetched: 7 row(s)

hive>

这个命令可以让用户为表更名。数据所在的位置和分区名并不改变。换而言之,老的表名并未“释放”,对老表的更改会改变新表的数据。

11.3 修改列/属性

ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name]

这个命令可以允许改变列名、数据类型、注释、列位置或者它们的任意组合。

11.4 添加/替换列

ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], …)

ADD是代表新增一字段,字段位置在所有列后面(partition列前);REPLACE则是表示替换表中所有字段。https://www.cndba.cn/Dave/article/3338

12 保存select查询结果的几种方式

1、将查询结果保存到一张新的hive表中

create table t_tmp

as

select * from t_p;

2、将查询结果保存到一张已经存在的hive表中

insert into table t_tmp

select * from t_p;

3、将查询结果保存到指定的文件目录(可以是本地,也可以是HDFS)

insert overwrite local directory ‘/home/hadoop/test’

select * from t_p;

插入HDFS

insert overwrite directory ‘/aaa/test’

select * from t_p;

13 查看/删除数据

13.1 查询

13.1.1 语法

SELECT [ALL | DISTINCT] select_expr, select_expr, ...

FROM table_reference

[WHERE where_condition]

[GROUP BY col_list]

[ CLUSTER BY col_list

| [DISTRIBUTE BY col_list] [SORT BY col_list]

]

[LIMIT number]

CLUSTER BY字段含义:根据这个字段进行分区,需要注意设置reduce_num数量。

order by 会对输入做全局排序,因此只有一个reducer,会导致当输入规模较大时,需要较长的计算时间。

sort by不是全局排序,其在数据进入reducer前完成排序。因此,如果用sort by进行排序,并且设置mapred.reduce.tasks>1,则sort by只保证每个reducer的输出有序,不保证全局有序。

distribute by(字段)根据指定的字段将数据分到不同的reducer,且分发算法是hash散列。

Cluster by(字段) 除了具有Distribute by的功能外,还会对该字段进行排序。

如果分桶和sort字段是同一个时,此时,cluster by = distribute by + sort by:

select from inner_table

select count() from inner_table

https://www.cndba.cn/Dave/article/3338

删除一个内部表的同时会同时删除表的元数据和数据。删除一个外部表,只删除元数据而保留数据。

语法:

DROP TABLE tbl_name

13.1.2 GROUP BY

groupByClause: GROUP BY groupByExpression (, groupByExpression)*

groupByExpression: expression

groupByQuery: SELECT expression (, expression)* FROM src groupByClause?

13.1.3 Order/Sort By

Order by 语法:

colOrder: ( ASC | DESC )

orderBy: ORDER BY colName colOrder? (',' colName colOrder?)*

query: SELECT expression (',' expression)* FROM src orderBy

Sort By 语法: Sort顺序将根据列类型而定。如果数字类型的列,则排序顺序也以数字顺序。如果字符串类型的列,则排序顺序将字典顺序。

colOrder: ( ASC | DESC )

sortBy: SORT BY colName colOrder? (',' colName colOrder?)*

query: SELECT expression (',' expression)* FROM src sortBy

13.2 Limit/Top/REGEX

Limit 可以限制查询的记录数。查询的结果是随机选择的。下面的查询语句从 t1 表中随机查询5条记录:

SELECT * FROM t1 LIMIT 5

https://www.cndba.cn/Dave/article/3338

下面的查询语句查询销售记录最大的 5 个销售代表。https://www.cndba.cn/Dave/article/3338https://www.cndba.cn/Dave/article/3338

SET mapred.reduce.tasks = 1

SELECT * FROM sales SORT BY amount DESC LIMIT 5

SELECT 语句可以使用正则表达式做列选择,下面的语句查询除了 ds 和 hr 之外的所有列:

SELECT (ds|hr)?+.+ FROM sales

14 Hive中的join

14.1 语法:

join_table:

table_reference JOIN table_factor [join_condition]

| table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition

| table_reference LEFT SEMI JOIN table_reference join_condition

table_reference:

table_factor

| join_table

table_factor:

tbl_name [alias]

| table_subquery alias

| ( table_references )

join_condition:

ON equality_expression ( AND equality_expression )*

equality_expression:

expression = expression

Hive 只支持等值连接(equality joins)、外连接(outer joins)和(left/right joins)。Hive 不支持所有非等值的连接,因为非等值连接非常难转化到 map/reduce 任务。另外,Hive 支持多于 2 个表的连接。

14.2 只支持等值join。

例如:

SELECT a.FROM a JOIN b ON (a.id = b.id)

SELECT a. FROM a JOIN b ON (a.id = b.id AND a.department = b.department)https://www.cndba.cn/Dave/article/3338

14.3 可以join多于2个表。

例如:

SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)

如果join中多个表的 join key 是同一个,则 join 会被转化为单个 map/reduce 任务,例如:

SELECT a.val, b.val, c.val FROM a JOIN b

ON (a.key = b.key1) JOIN c

ON (c.key = b.key1)

被转化为单个 map/reduce 任务,因为 join 中只使用了 b.key1 作为 join key。

https://www.cndba.cn/Dave/article/3338

SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1)

JOIN c ON (c.key = b.key2)

而这一 join 被转化为 2 个 map/reduce 任务。因为 b.key1 用于第一次 join 条件,而 b.key2 用于第二次 join。

14.4 join 时,每次 map/reduce 任务的逻辑。

reducer 会缓存 join 序列中除了最后一个表的所有表的记录,再通过最后一个表将结果序列化到文件系统。这一实现有助于在 reduce 端减少内存的使用量。实践中,应该把最大的那个表写在最后(否则会因为缓存浪费大量内存)。例如:

SELECT a.val, b.val, c.val FROM a

JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)

所有表都使用同一个 join key(使用 1 次 map/reduce 任务计算)。Reduce 端会缓存 a 表和 b 表的记录,然后每次取得一个 c 表的记录就计算一次 join 结果,类似的还有:

SELECT a.val, b.val, c.val FROM a

JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2)

这里用了 2 次 map/reduce 任务。第一次缓存 a 表,用 b 表序列化;第二次缓存第一次 map/reduce 任务的结果,然后用 c 表序列化。

15 Hive 的自定义函数

当Hive提供的内置函数无法满足你的业务处理需要时,此时就可以考虑使用用户自定义函数(UDF:user-defined function)。

15.1 UDF

基本函数:

SHOW FUNCTIONS;

DESCRIBE FUNCTION

;

15.2 UDTF

UDTF即Built-in Table-Generating Functions 使用这些UDTF函数有一些限制:

1、SELECT里面不能有其它字段,如:

SELECT pageid, explode(adid_list) AS myCol…

2、不能嵌套,如:

SELECT explode(explode(adid_list)) AS myCol… # 不支持

3、不支持GROUP BY / CLUSTER BY / DISTRIBUTE BY / SORT BY ,如:

SELECT explode(adid_list) AS myCol … GROUP BY myColhttps://www.cndba.cn/Dave/article/3338

15.3 EXPLODE

场景:将数据进行转置,如:

create table test2(mycol array);

insert OVERWRITE table test2 select * from (select array(1,2,3) from a union all select array(7,8,9) from d)c;

hive> select * from test2;

OK

[1,2,3]

[7,8,9]

hive> SELECT explode(myCol) AS myNewCol FROM test2;

OK

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值