Hive 分区表&内部函数

最新推荐文章于 2024-05-19 16:46:04 发布

weixin_33953384

最新推荐文章于 2024-05-19 16:46:04 发布

阅读量235

点赞数

文章标签：大数据数据库 python

原文链接：https://my.oschina.net/gordonnemo/blog/1831206

版权

2019独角兽企业重金招聘Python工程师标准>>>

1.SELECT

[hadoop@hadoop05 data]$ hadoop fs -ls hdfs://192.168.199.105:9000/user/hive/warehouse/hive2_gordon.db/gordon_emp;

Found 1 items

-rwxr-xr-x 1 hadoop supergroup 700 2018-06-14 21:36 hdfs://192.168.199.105:9000/user/hive/warehouse/hive2_gordon.db/gordon_emp/emp.txt

hive> select * from gordon_emp;

hive> desc extended gordon_emp;

hive> desc formatted gordon_emp;

####简单语句不走mapreduce

hive> use hive2_gordon;

Time taken: 0.355 seconds

hive> select * from gordon_emp where ename ='SCOTT';

7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20

Time taken: 2.326 seconds, Fetched: 1 row(s)

hive> select * from gordon_emp where empno >'7500';

7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30

7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20

7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30

7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30

7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10

7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20

7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10

7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30

7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20

7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30

7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20

7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10

8888 HIVE PROGRAM 7839 1988-1-23 10300.0 NULL NULL

Time taken: 0.454 seconds, Fetched: 13 row(s)

hive> select * from gordon_emp where salary between 800 and 1500;

7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20

7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30

7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30

7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30

7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20

7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30

7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10

Time taken: 0.394 seconds, Fetched: 7 row(s)

hive> select * from gordon_emp limit 5;

7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20

7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30

7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30

7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20

7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30

Time taken: 0.242 seconds, Fetched: 5 row(s)

hive> select * from gordon_emp where ename in ('SCOTT');

7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20

Time taken: 0.22 seconds, Fetched: 1 row(s)

hive> select * from gordon_emp where comm is null;

7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20

7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20

7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30

7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10

7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20

7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10

7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20

7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30

7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20

7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10

8888 HIVE PROGRAM 7839 1988-1-23 10300.0 NULL NULL

Time taken: 0.413 seconds, Fetched: 11 row(s)

hive>

2.聚合函数

聚合函数对一组值执行计算，并返回单个值。"多进一出"

max/min/count/sum/avg

聚合函数走mapreduce。

hive>select avg(salary),max(salary),min(salary) ,sum(salary)from gordon_emp;

Query ID = hadoop_20180615224545_4f9c5c70-7aed-4745-a503-ff5e4a3dfe90

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

set mapreduce.job.reduces=<number>

Job running in-process (local Hadoop)

2018-06-15 22:48:31,340 Stage-1 map = 0%, reduce = 0%

2018-06-15 22:48:33,612 Stage-1 map = 100%, reduce = 0%

2018-06-15 22:48:34,656 Stage-1 map = 100%, reduce = 100%

Ended Job = job_local187248363_0001

MapReduce Jobs Launched:

Stage-Stage-1: HDFS Read: 9800 HDFS Write: 0 SUCCESS

Total MapReduce CPU Time Spent: 0 msec

2621.6666666666665 10300.0 800.0 39325.0

Time taken: 8.633 seconds, Fetched: 1 row(s)

####分组函数 group by

求每个部门的平均工资

hive>select deptno, avg(salary) from gordon_emp group by deptno;

NULL 10300.0

10 2916.6666666666665

20 2175.0

30 1566.6666666666667

Time taken: 2.968 seconds, Fetched: 4 row(s)

hive>select deptno,ename, avg(salary) from gordon_emp group by deptno;

FAILED: SemanticException [Error 10025]: Line 1:14 Expression not in GROUP BY key 'ename'

注意：select中出现的字段，如果没有出现在组函数/聚合函数中，必须出现在group by里面

求每个部门(deptno)、工作岗位(job)的最高工资(salary)

hive>select deptno,job, avg(salary) from gordon_emp group by deptno,job;

NULL PROGRAM 10300.0

10 CLERK 1300.0

10 MANAGER 2450.0

10 PRESIDENT 5000.0

20 ANALYST 3000.0

20 CLERK 950.0

20 MANAGER 2975.0

30 CLERK 950.0

30 MANAGER 2850.0

30 SALESMAN 1400.0

Time taken: 7.369 seconds, Fetched: 10 row(s)

求每个部门的平均工资大于2000的部门

hive>select deptno, avg(salary) from gordon_emp group by deptno where avg(salary)>2000;

FAILED: ParseException line 1:60 missing EOF at 'where' near 'deptno'

hive>select deptno, avg(salary) from gordon_emp group by deptno having avg(salary)>2000;

NULL 10300.0

10 2916.6666666666665

20 2175.0

Time taken: 2.434 seconds, Fetched: 3 row(s)

注意：where是需要写在group by之前，where和having的执行位置。聚合之后就不能用where

3.case when

分类

select ename,salary,

case

when salary>1 and salary<=1000 then 'lower'

when salary>1000 and salary<=2000 then 'middle'

when salary>2000 and salary<=4000 then 'high'

else 'highest'

end

from gordon_emp ;

SMITH 800.0 lower

ALLEN 1600.0 middle

WARD 1250.0 middle

JONES 2975.0 high

MARTIN 1250.0 middle

BLAKE 2850.0 high

CLARK 2450.0 high

SCOTT 3000.0 high

KING 5000.0 highest

TURNER 1500.0 middle

ADAMS 1100.0 middle

JAMES 950.0 lower

FORD 3000.0 high

MILLER 1300.0 middle

HIVE 10300.0 highest

Time taken: 0.329 seconds, Fetched: 15 row(s)

4.Join

join_a.txt join

1 gordon

2 j

3 k

join_b.txt

1 30

2 29

4 21

create table a(

id int, name string

) row format delimited fields terminated by '\t';

create table b(

id int, age int

) row format delimited fields terminated by '\t';

load data local inpath '/home/hadoop/data/join_a.txt' overwrite into table a;

load data local inpath '/home/hadoop/data/join_b.txt' overwrite into table b;

####笛卡尔积

hive> select * from a join b;

1 gordon 1 30

1 gordon 2 29

1 gordon 4 18

2 j 1 30

2 j 2 29

2 j 4 18

3 k 1 30

3 k 2 29

3 k 4 18

Time taken: 28.875 seconds, Fetched: 9 row(s)

inner join = join

hive>select a.id ,a.name,b.age from a left join b on a.id = b.id;

1 gordon 30

2 j 29

3 k NULL

Time taken: 34.857 seconds, Fetched: 3 row(s)

hive>select a.id ,a.name,b.age from a right join b on a.id = b.id;

1 gordon 30

2 j 29

NULL NULL 18

hive>select a.id ,a.name,b.age from a full join b on a.id = b.id;

1 gordon 30

2 j 29

3 k NULL

NULL NULL 18

Time taken: 9.929 seconds, Fetched: 4 row(s)

LEFT SEMI JOIN 是 IN/EXISTS 子查询的一种更高效的实现。

Hive 当前没有实现 IN/EXISTS 子查询，所以你可以用 LEFT SEMI JOIN 重写你的子查询语句。

LEFT SEMI JOIN 的限制是， JOIN 子句中右边的表只能在 ON 子句中设置过滤条件，在 WHERE 子句、SELECT 子句或其他地方过滤都不行。

hive>select * from a left semi join b on a.id =b.id;

1 gordon

2 j

Time taken: 32.552 seconds, Fetched: 2 row(s)

5.分区表 partition

日志：who when what 一般日志表比较大

click_log

click_log_yyyyMMdd

分区表存在的意义:

hive hdfs + partition <== 通过where partition 指定分区

==> reduce io 降低IO

静态分区-单级分区

create table order_partition(

ordernumber string,

eventtime string

)

partitioned by (event_month string)

row format delimited fields terminated by '\t';

HIVE清屏！clear

加载数据：

LOAD DATA LOCAL INPATH '/home/hadoop/data/order.txt'

OVERWRITE INTO TABLE order_partition

PARTITION(event_month='2014-05');

查看hive日志

解决方案：mysql下面执行。

Use gordon;

alter table PARTITIONS convert to character set latin1;

alter table PARTITION_KEYS convert to character set latin1;

重新加载：

查看HDFS上的数据。

[hadoop@hadoop05 sbin]$ hadoop fs -ls /user/hive/warehouse/hive2_gordon.db/order_partition;

Found 1 items

drwxr-xr-x - hadoop supergroup 0 2018-06-15 23:28 /user/hive/warehouse/hive2_gordon.db/order_partition/event_month=2014-05

hive> select * from order_partition where event_month='2014-05';

10703007267488 2014-05-01 06:01:12.334+01 NULL 2014-05

10101043505096 2014-05-01 07:28:12.342+01 NULL 2014-05

10103043509747 2014-05-01 07:50:12.33+01 NULL 2014-05

10103043501575 2014-05-01 09:27:12.33+01 NULL 2014-05

10104043514061 2014-05-01 09:03:12.324+01 NULL 2014-05

Time taken: 0.165 seconds, Fetched: 5 row(s)

hive>

####手工创建hdfs分区，hive没有数据。

[hadoop@hadoop05 sbin]$ hadoop fs -mkdir /user/hive/warehouse/hive2_gordon.db/order_partition/event_month=2014-06;

[hadoop@hadoop05 sbin]$ hadoop fs -put order.txt /user/hive/warehouse/hive2_gordon.db/order_partition/event_month=2014-06;

put: `order.txt': No such file or directory

[hadoop@hadoop05 sbin]$ cd /home/hadoop/data

[hadoop@hadoop05 data]$ hadoop fs -put order.txt /user/hive/warehouse/hive2_gordon.db/order_partition/event_month=2014-06;

[hadoop@hadoop05 data]$ hadoop fs -ls /user/hive/warehouse/hive2_gordon.db/order_partition/event_month=2014-06;

Found 1 items

-rw-r--r-- 1 hadoop supergroup 242 2018-06-15 23:36 /user/hive/warehouse/hive2_gordon.db/order_partition/event_month=2014-06/order.txt

hive>select * from order_partition where event_month='2014-06';

Time taken: 0.237 seconds

####mysql元数据没有加载。

mysql> select * from partitions;

+---------+-------------+------------------+---------------------+-------+--------+

+---------+-------------+------------------+---------------------+-------+--------+

| 1 | 1529076506 | 0 | event_month=2014-05 | 34 | 33 |

+---------+-------------+------------------+---------------------+-------+--------+

1 row in set (0.03 sec)

mysql> select * from partition_keys;

+--------+--------------+-------------+-----------+-------------+

+--------+--------------+-------------+-----------+-------------+

+--------+--------------+-------------+-----------+-------------+

1 row in set (0.00 sec)

mysql> select * from partition_key_vals;

+---------+--------------+-------------+

| PART_ID | PART_KEY_VAL | INTEGER_IDX |

+---------+--------------+-------------+

| 1 | 2014-05 | 0 |

+---------+--------------+-------------+

1 row in set (0.00 sec)

解决方案一：

hive> MSCK REPAIR TABLE ORDER_PARTITION;

Partitions not in metastore: order_partition:event_month=2014-06

Repair: Added partition to metastore ORDER_PARTITION:event_month=2014-06

Time taken: 0.56 seconds, Fetched: 2 row(s)

mysql去查看就有了

mysql> select * from partitions;

+---------+-------------+------------------+---------------------+-------+--------+

+---------+-------------+------------------+---------------------+-------+--------+

| 1 | 1529076506 | 0 | event_month=2014-05 | 34 | 33 |

| 2 | 1529077282 | 0 | event_month=2014-06 | 35 | 33 |

+---------+-------------+------------------+---------------------+-------+--------+

2 rows in set (0.00 sec)

注意：操作比较暴力，如果表时间长，执行很长时间

解决方案二：

hdfs上创建一个07分区

工作上建议方案：

ALTER TABLE order_partition ADD IF NOT EXISTS

PARTITION (event_month='2014-07') ;

mysql> select * from partitions;

+---------+-------------+------------------+---------------------+-------+--------+

+---------+-------------+------------------+---------------------+-------+--------+

| 1 | 1529076506 | 0 | event_month=2014-05 | 34 | 33 |

| 2 | 1529077282 | 0 | event_month=2014-06 | 35 | 33 |

| 3 | 1529077487 | 0 | event_month=2014-07 | 36 | 33 |

+---------+-------------+------------------+---------------------+-------+--------+

3 rows in set (0.00 sec)

通过insert方法

create table order_4_partition(

ordernumber string,

eventtime string

)

row format delimited fields terminated by '\t';

load data local inpath '/home/hadoop/data/order.txt' overwrite into table order_4_partition;

insert overwrite table order_partition

partition(event_month='2014-08')

select * from order_4_partition;

查看表的分区：

hive> show partitions order_partition;

event_month=2014-05

event_month=2014-06

event_month=2014-07

event_month=2014-08

Time taken: 0.197 seconds, Fetched: 4 row(s)

注意：分区的字段不能在表的字段中。

多级分区

create table order_mulit_partition(

ordernumber string,

eventtime string

)

partitioned by (event_month string,event_day string)

row format delimited fields terminated by '\t';

LOAD DATA LOCAL INPATH '/home/hadoop/data/order.txt'

OVERWRITE INTO TABLE order_mulit_partition

PARTITION(event_month='2014-05', event_day='01');

查询带分区条件：

hive> show partitions order_mulit_partition;

event_month=2014-05/event_day=01

Time taken: 0.151 seconds, Fetched: 1 row(s)

动态分区

实验：注意：分区的字段不能在表的字段中。

hive> create table gordon_static_emp

> (empno int, ename string, job string, mgr int, hiredate string, salary double, comm double,deptno string )

> PARTITIONED by(deptno string)

> ROW FORMAT DELIMITED

> FIELDS TERMINATED BY '\t' ;

FAILED: SemanticException [Error 10035]: Column repeated in partitioning columns

hive>create table gordon_static_emp

(empno int, ename string, job string, mgr int, hiredate string, salary double, comm double)

PARTITIONED by(deptno string)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t' ;

按分区插入数据，每次都要写比较繁琐，引出动态分区。

hive>insert into table gordon_static_emp partition(deptno='20')

select empno,ename,job,mgr,hiredate,salary,comm from gordon_emp

where deptno=20;

Query ID = hadoop_20180615235353_00afe0ef-4679-4699-b907-c8bd25dc6361

Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks is set to 0 since there's no reduce operator

Job running in-process (local Hadoop)

2018-06-15 23:55:27,821 Stage-1 map = 0%, reduce = 0%

2018-06-15 23:55:28,854 Stage-1 map = 100%, reduce = 0%

Ended Job = job_local966661548_0001

Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver.

Moving data to: hdfs://192.168.199.105:9000/user/hive/warehouse/hive2_gordon.db/gordon_static_emp/deptno=20/.hive-staging_hive_2018-06-15_23-55-19_985_12010454832864485-1/-ext-10000

Loading data to table hive2_gordon.gordon_static_emp partition (deptno=20)

Partition hive2_gordon.gordon_static_emp{deptno=20} stats: [numFiles=1, numRows=5, totalSize=214, rawDataSize=209]

MapReduce Jobs Launched:

Stage-Stage-1: HDFS Read: 700 HDFS Write: 311 SUCCESS

Total MapReduce CPU Time Spent: 0 msec

Time taken: 10.982 seconds

动态分区表建立：

create table gordon_dynamic_emp

(empno int, ename string, job string, mgr int, hiredate string, salary double, comm double)

PARTITIONED by(deptno string)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY '\t' ;

动态分区明确要求：分区字段写在select的最后面，名称是否要一致？可以测试一下。

insert into table gordon_dynamic_emp partition(deptno)

select empno,ename,job,mgr,hiredate,salary,comm,deptno from gordon_emp ;

hive>

> insert into table ruozedata_dynamic_emp partition(deptno)

> select empno,ename,job,mgr,hiredate,salary,comm,deptno from ruozedata_emp ;

FAILED: SemanticException [Error 10001]: Line 2:60 Table not found 'ruozedata_emp'

Moving data to: hdfs://192.168.199.105:9000/user/hive/warehouse/hive2_gordon.db/gordon_dynamic_emp/.hive-staging_hive_2018-06-16_00-02-44_572_3957465789389434554-1/-ext-10000

解决方案：

动态分区必须非严格模式：

set hive.exec.dynamic.partition.mode=nonstrict;

这是hive中常用的设置key=value的方式

语法格式：

set key=value; 设置

set key; 取值

官网：

在hive中提供了一种“严格模式”的设置来阻止用户执行可能会带来未知不好影响的查询。

设置属性hive.mapred.mode 为strict能够阻止以下三种类型的查询：

1、除非在where语段中包含了分区过滤，否则不能查询分区了的表。这是因为分区表通常保存的数据量都比较大，没有限定分区查询会扫描所有分区，耗费很多资源。

Table: logs(…) partitioned by (day int);

不允许：select *from logs;

允许：select *from logs where day=20151212;

2、包含order by，但没有limit子句的查询。因为orderby 会将所有的结果发送给单个reducer来执行排序，这样的排序很耗时。

3、笛卡尔乘积；简单理解就是JOIN没带ON，而是带where的

分区是hive在处理大型表时常用的方法。分区（partition）在hive的物理存储中，体现为表名下的某个目录，这个目录下存储着这个分区下对应的数据。分区的好处在于缩小查询扫描范围，从而提高速度。分区分为两种：静态分区static partition和动态分区dynamic partition。静态分区和动态分区的区别在于导入数据时，是手动输入分区名称，还是通过数据来判断数据分区。对于大数据批量导入来说，显然采用动态分区更为简单方便。

可见，通过动态分区技术，不但可以一次导入数据，而且同时能够根据源数据中不同的分区列的值，动态的生成对应的目录，并把对应的数据写入对应的目录中

6.function

内置函数：build-in

通过查看：show functions

hive.exec.dynamic.partition.mode=nonstrict

hive> set hive.exec.dynamic.partition.mode;

hive.exec.dynamic.partition.mode=nonstrict

hive> show functions;

hive> desc function upper;

upper(str) - Returns str with all characters changed to uppercase

Time taken: 0.119 seconds, Fetched: 1 row(s)

hive>

> desc function extended upper;

upper(str) - Returns str with all characters changed to uppercase

Synonyms: ucase

Example:

> SELECT upper('Facebook') FROM src LIMIT 1;

'FACEBOOK'

Time taken: 0.091 seconds, Fetched: 5 row(s)

hive>

hive> create table dual(x string);

Time taken: 0.301 seconds

insert into table dual values('');

测试分割

hive> select split("192.168.199.151","\\.") from dual;

["192","168","199","151"]

Time taken: 0.189 seconds, Fetched: 1 row(s)

测试时间戳

hive> select unix_timestamp() from dual;

1529079431

Time taken: 0.335 seconds, Fetched: 1 row(s)

hive> select current_date from dual;

2018-06-16

Time taken: 1.089 seconds, Fetched: 1 row(s)

hive> select current_timestamp from dual;

2018-06-16 00:19:24.063

Time taken: 0.397 seconds, Fetched: 1 row(s)

hive> select year("2018-08-08 20:08:08") from dual;

2018

Time taken: 0.426 seconds, Fetched: 1 row(s)

hive>

hive> select date_add("2018-08-08",10) from dual;

2018-08-18

注意：分区的时候会用时间内置函数

Cast 类型转换

hive> select cast("5" as int) from dual;

Time taken: 1.685 seconds, Fetched: 1 row(s)

hive> select cast("5" as date) from dual;

NULL

Time taken: 0.232 seconds, Fetched: 1 row(s)

hive> select cast(current_timestamp as date) from dual;

2018-06-16

Time taken: 0.442 seconds, Fetched: 1 row(s)

字符串：

hive>

> desc function extended substr;

substr(str, pos[, len]) - returns the substring of str that starts at pos and is of length len orsubstr(bin, pos[, len]) - returns the slice of byte array that starts at pos and is of length len

Synonyms: substring

pos is a 1-based index. If pos<0 the starting position is determined by counting backwards from the end of str.

Example:

> SELECT substr('Facebook', 5) FROM src LIMIT 1;

'book'

> SELECT substr('Facebook', -5) FROM src LIMIT 1;

'ebook'

> SELECT substr('Facebook', 5, 1) FROM src LIMIT 1;

'b'

Time taken: 0.072 seconds, Fetched: 10 row(s)

hive> desc function extended concat_ws;

concat_ws(separator, [string | array(string)]+) - returns the concatenation of the strings separated by the separator.