大数据开发教程——Apache Hive实战

Hive 建表高阶语句 CTAS and CTE(重点)

  • CTAS – Create Table As Select

  • CREATE TABLE ctas_employee as SELECT * FROM employee(基于select查询的结果生成表)

  • CTAS CANNOT create a partition, external, or bucket table(不能生成分区表,外部表,桶表)

  • Create table like other table (fast):

  • CREATE TABLE employee_like LIKE employee(复制表的结构,不携带数据)

-- Common Table Expression (CTE) 公共表达式
CREATE TABLE cte_employee AS
WITH
r1 AS (SELECT name FROM r2 WHERE name = 'Michael'),
r2 AS (SELECT name FROM employee WHERE sex_age.sex= 'Male'),
r3 AS (SELECT name FROM employee WHERE sex_age.sex= 'Female')
SELECT * FROM r1 UNION ALL SELECT * FROM r3;

Hive Temporary Table (临时表)

  • A convenient way for an application to automatically manage intermediate data generated during a complex query (similar to CTE which is only one statement) (应用程序自动管理复杂查询期间,生成的中间数据的便捷方式)
  • Session only, auto deleted, same name in different session (只在当前的session有效)
  • Table space is at /tmp/hive-<user_name> (表空间)
  • Temporary table will shadow the permanent one when using the same name (使用相同名称时,临时表将影响永久表)
CREATE TEMPORARY TABLE tmp_table_name1 (c1 string);
支持CTAS表达式
CREATE TEMPORARY TABLE tmp_table_name2 AS..
CREATE TEMPORARY TABLE tmp_table_name3 LIKE..

Hive Table – Drop/Truncate/Alter Table

  • DROP TABLE IF EXISTS employee statement removes metadata completely and move data to .Trash folder in the user home directory in HDFS if configured. With PERGE option at the end, the data is removed completely. When to drop an external table, the data is not removed. Such as DROP TABLE IF EXISTS employee (删除数据会被放入到HDFS.Trash目录中。 删除外部表,不会删除数据)
  • TRUNCATE TABLE employee statement removes all rows of data form an internal table (FAILED on external table). (删除内部表所有数据,不能用于外部表)
  • ALTER TABLE employee RENAME TO new_employee statement renames the table (修改表名)
  • ALTER TABLE c_employee SET TBLPROPERTIES (‘comment’=‘New name, comments’) statement sets table property (修改表属性)
  • ALTER TABLE employee_internal SET SERDEPROPERTIES (‘field.delim’ = '$’) statement set SerDe properties (设置序列化引擎)
  • ALTER TABLE c_employee SET FILEFORMAT RCFILE statement sets file format (设置存储格式)
-- 删除库
drop database if exists db_name;
-- 强制删除库
drop database if exists db_name cascade;
-- 删除表
drop table if exists employee;
-- 清空表
truncate table employee;
-- 清空表,第二种方式的效率较高
insert overwrite table employee select * from employee where 1=0;
-- 删除分区
alter table employee_table drop partition (stat_year_month>='2018-01');
-- 按条件删除数据
insert overwrite table employee_table select * from employee_table where id>'180203a15f';
​
Hive默认不支持删除数据,修改数据,只能添加数据和查询数据
Hive带事务的表支持update和delete

Hive Table – Alter Table Columns

  • ALTER TABLE employee_internal CHANGE old_name new_name STRING [BEFORE|AFTER] sex_age, this statement can be used to change column name, position or type (用于更改列名、位置或类型)
  • ALTER TABLE c_employee ADD COLUMNS (work string), this statement add another column and type to the table at the end (在末尾追加新的列)
  • ALTER TABLE c_employee REPLACE COLUMNS (name string), this statement replace all columns in the table by specified column and type. After ALTER in this example, there is only one column in the table(替换所有的columns,为当前制定的column)
  • Note: The ALTER TABLE statement will only modify the metadata of Hive, not actually data. User should make sure the actual data conforms with the metadata definition. (ALTER TABLE语句只修改Hive的元数据,而不是实际数据。用户应确保实际数据符合元数据定义)
alter table employee_hr set LOCATION 'hdfs:///user/data_employee_hr';

Hive Partitions (分区) - Overview

  • To increase performance Hive has the capability to partition data (为了提高性能,Hive可对数据进行分区)

  • The values of partitioned column divide a table into segments (folders) (分区列的值将表分成段)

  • Entire partitions can be ignored at query time (查询时可以忽略整个分区)

  • Partitions have to be properly created by users. When inserting data must specify a partition (必须由用户正确创建分区。 插入数据时必须指定分区)

  • There is no difference in schema between “partition” columns and regular columns when using in query (在查询中使用时,“分区”列和常规列之间没有区别)

  • At query time, Hive will automatically filter out partitions not being used for better performance (在查询时,Hive将自动过滤掉未使用的分区以获得更好的性能)

定义和操作

静态分区:先把空的分区表创建好,然后再手动导入分区数据。
​
一级分区:
create table dept_partition(
deptno int,
dename string,
loc string
)
partitioned by (year string)
row format delimited fields terminated by ',';
​
加载数据到分区:
1.手动增加分区
alter table dept_partition add partition(year='2021')
2.加载数据到分区:
load data local inpath '/root/hdp/hive_stage/dept.txt' into table dept_partition partition(year='2021');
​
​
二级分区
CREATE TABLE emp_partition_2 (
name string,
work_place ARRAY<string>,
sex_age STRUCT<sex:string,age:int>,
skills_score MAP<string,int>,
depart_title MAP<STRING,ARRAY<STRING>>
) 
PARTITIONED BY (year INT, month INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':';
​
--添加分区
alter table employee_partitioned add partition (year=2020,month=11);
--显示表的所有分区
show partitions employee_partitioned;
--删除分区
alter table employee_partitioned drop if exists partition (year=2020,month=11);
--可以直接使用HDFS将数据移动到Hive分区表对应的目录
​
-- Static Partition is NOT enabled automatically. We have to add/delete partition by ALTER TABLE statement(静态分区未自动启用。 我们必须通过ALTER TABLE语句添加/删除分区)

Hive严格模式

hive提供了一个严格模式,可以防止用户执行那些可能产生意想不到的不好的效果的查询。即某些查询在严格 模式下无法执行。通过设置hive.mapred.mode的值为strict,可以禁止三种类型的查询。

  1. 带有分区的表的查询如果在一个分区表执行SQL,除非where语句中包含分区字段过滤条件来显示数据范围,否则不允许执行。换句话说,就是用户不允许扫描所有的分区。进行这个限制的原因是,通常分区表都拥有非常大的数据集,而且数据增加迅速。如果没有进行分区限制的查询可能会消耗令人不可接受的巨大资源来处理这个表:

  2. hive> SELECT DISTINCT(planner_id) FROM fracture_ins WHERE planner_id=5; FAILED: Error in semantic analysis: No Partition Predicate Found for Alias “fracture_ins” Table "fracture_ins

  3. 如下这个语句在where语句中增加了一个分区过滤条件(也就是限制了表分区):SELECT DISTINCT(planner_id) FROM fracture_ins > WHERE planner_id=5 AND hit_date=20120101; … normal results …

  4. 带有order by的查询 对于使用了order by的查询,要求必须有limit语句。因为orderby为了执行排序过程会将所有的结果分发到同一个reducer中进行处理,强烈要求用户增加这个limit语句可以防止reducer额外执行很长一段时间:

  5. hive> SELECT * FROM fracture_ins WHERE hit_date>2012 ORDER BY planner_id; FAILED: Error in semantic analysis: line 1:56 In strict mode, limit must be specified if ORDER BY is present planner_id

  6. 只需要增加limit语句就可以解决这个问题: hive> SELECT * FROM fracture_ins WHERE hit_date>2012 ORDER BY planner_id > LIMIT 100000; … normal results …

  7. 限制笛卡尔积的查询对关系型数据库非常了解的用户可能期望在执行join查询的时候不使用on语句而是使用where语句,这样关系数据库的执行优化器,就可以高效的将where语句转换成那个on语句。不幸的是,hive不会执行这种优化,因此,如果表足够大,那么这个查询就会 出现不可控的情况: hive> SELECT * FROM fracture_act JOIN fracture_ads : Error in semantic analysis: In strict mode, cartesian product is not allowed. If you really want to perform the operation, +set hive.mapred.mode=nonstrict+

  8. 下面这个才是正确的使用join和on语句的查询: hive> SELECT * FROM fracture_act JOIN fracture_ads > ON (fracture_act.planner_id = fracture_ads.planner_id);

Hive 动态分区

Hive also supports dynamically giving partition values. This is useful when data volume is large and we don’t know what will be the partition values (Hive还支持动态提供分区值。 当数据量很大并且我们不知道分区值是什么时,这很有用)
By default, the user must specify at least one static partition column. This is to avoid accidentally overwriting partitions. To disable this restriction, we can set the partition mode to nonstrict (see line 3) from the default strict mode (默认情况下,用户必须至少指定一个静态分区列。 这是为了避免意外覆盖分区。 要禁用此限制,可以将分区模式从默认的严格模式设置为非严格模式)

实验数据
create table employee_hr (
    name string,
    employee_id string,
    sin_number string,
    start_date string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '/hive_stage/employee_hr';
​
动态分区表:
CREATE TABLE emp_partition_dynamic (
name string,
work_place ARRAY<string>,
sex_age STRUCT<sex:string,age:int>,
skills_score MAP<string,int>,
depart_title MAP<STRING,ARRAY<STRING>>
) 
PARTITIONED BY (year INT, month INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':';
​
从实验数据表把数据导入动态分区表
​
--开启动态分区设置
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
​
动态分区的数据必须使用insert语法插入
insert into table employee_partitioned partition(year,month)
select
name,
array('Toronto') as work_place, 
named_struct("sex","Male","age",30) as sex_age,
map("Python",90) as skills_score,
map("R&D",array('Developer')) as depart_title,
year(start_date) as year,
month(start_date) as month
from employee_hr eh

Hive Buckets (分桶)

  • The bucket corresponds to the segments of files in the HDFS (bucket对应于HDFS中的文件段)
  • Mechanism to randomly sample data or speed JOIN (随机抽样数据或加速JOIN的速度)
  • Break data into a set of buckets based on a hash function of a "bucket columns” (根据“bucket列”的哈希函数将数据分成一组)
  • Hive does not automatically enforce bucketing. It is required to set the enforced bucketing. (Hive不会自动执行分桶。 需要设置强制分桶)
    SET hive.enforce.bucketing = true;
  • The bucket column choice closely depends on the business logic (存储桶列的选择密切依赖于业务逻辑)
  • To define number of buckets, we should avoid having too much or too little data in each buckets. A better choice somewhere near two blocks of data. Use 2N as the number of buckets (要定义桶的数量,我们应该避免每个桶中的数据太多或太少。 在靠近两个数据块的地方更好的选择。 使用2N作为桶的数量)
  • What’s the size of the two blocks? Default block size in Hadoop HDFS(这两个区块的大小是多少? Hadoop HDFS中的默认块大小)

Hive Buckets 建表语句

  • Use CLUSTERED BY statements to define buckets 与分区不同,分桶列名出现在列定义,Support more than 1 column
  • To populate the data into the buckets, we have to use INSERT statement instead of LOAD statement (Line 16–17) since it does not verify the data against metadata definition (copy files to the folders only) (要将数据填充到存储桶中,我们必须使用INSERT语句而不是LOAD语句,因为它不会根据元数据定义验证数据)
  • Buckets are list of files segments (Bucket是文件段列表)
实验数据表
create table if not exists employee_id (
name string, 
employee_id int,
work_place array<string>,
sex_age struct<sex:string,age:int>,
skills_score map<string,int>, 
depart_title map<string,array<string>>
)
​
row format delimited fields terminated by '|'
collection items terminated by ','
map keys terminated by ':';
​
load data inpath '/hive_stage/employee_id/employee_id.txt' overwrite into table employee_id
​
​
创建桶表:
set hive.enforce.bucketing = true;
​
create table if not exists employee_id_buckets (
name string, employee_id int,
work_place array<string>, 
sex_age struct<sex:string,age:int>,
skills_score map<string,int>,
depart_title map<string,array<string>>
)
​
clustered by (employee_id) into 2 buckets
row format delimited fields terminated by '|'
collection items terminated by ','
map keys terminated by ':';
​
INSERT OVERWRITE TABLE employee_id_buckets
SELECT * FROM employee_id;
​
--Verify the buckets in the HDFS

视频课程戳⬇⬇⬇

查看更多大数据开发教程

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值