大数据hadoop系列：Hive HQL常用操作

最新推荐文章于 2023-07-13 20:45:00 发布

兰波万

最新推荐文章于 2023-07-13 20:45:00 发布

阅读量535

点赞数

分类专栏：大数据hadoop系列文章标签： hive hive表操作

本文链接：https://blog.csdn.net/qq_26766821/article/details/100806974

版权

大数据hadoop系列专栏收录该内容

9 篇文章 1 订阅

订阅专栏

1.hive HQL 的表操作：

1.1 创建数据库

hive> create database zhanzhy;
OK
Time taken: 0.073 seconds
hive> show databases;
OK
default
zhanzhy
Time taken: 0.012 seconds, Fetched: 2 row(s)
hive>

-- 查看数据库详情
hive> desc database zhanzhy;
OK
zhanzhy         hdfs://master:9000/usr/local/src/apache-hive-1.2.2-bin/warehouse/zhanzhy.db     root    USER
Time taken: 0.011 seconds, Fetched: 1 row(s)

删除数据库：

# casecad 表示强制删除数据库，有表也删除
hive> drop database zhanzhy cascade;
OK
Time taken: 0.046 seconds

建表

create table article(sentence string)
row format delimited fields terminated by '\n';

导入数据

-- 导入本地local的数据到hive表中
load data local inpath '/home/zhanzhy/Documents/data/The_man_of_property.txt' 
into table article;

查看数据信息

-- 创建表之前如果没有指定使用的数据库，则默认建在default上，查询的时候表名前面不用带上db.
hive> select * from article limit 1;
OK
Preface
Time taken: 0.022 seconds, Fetched: 1 row(s)

清空表数据

truncate table article;

重命名表

alter table article rename to article_new;

copy其他表数据

create table t2 as select * from t1 where <conditions>;

实践：wordCount

select 
regexp_extract(word,'[[0-9a-zA-Z]]+',0) as word,
count(1) as cnt
from 
(
select 
explode(split(sentence, ' '))  as word 
from article
)t
group by regexp_extract(word,'[[0-9a-zA-Z]]+',0)
order by cnt desc
limit 100;

在这里插入图片描述

2.hive的数据表的类型：

hive内部表和外部表

未被external修饰的是内部表（managed table），被external修饰的为外部表（external table）；
Hive的create创建表的时候，选择的创建方式:

create table内部表
create external table location 'hdfs_ path’外部表(必须是文件)

特点:
在导入数据到外部表，数据并没有移动到自己的数据仓库目录下，也就是说外部表中的数据并不是由它自己来管理的！而内部表则不一样（内部表数据存储在仓库位置在hiv-site.xml中配置的hive.metastore.warehouse.dir下面，每创建一个库，就是一个目录，建表就会生成文件）;

在删除表的时候，Hive将会把属于表的元数据和数据全部删掉；而删除外部表的时候，Hive仅仅删除外部表的元数据，数据是不会删除的！

建立外部表

--外部表不需要导入数据，只要指定数据位置：location 'hdfs_path'
create external table art_ext(sentence string)
row format delimited fields terminated by '\n'
--stored as textfile
location '/data/ext';

hive分区表

使用业务场景：

时间增量数据
提高查询速度(核心)
一级分区、二级分区 partitioned by (date string,time string)
创建表时需要给定partitioned处理，一般是指定日期为string 类型。

partition是辅助查询，缩小查询范围,加快数据的检索速度和对数据按照一定的规格和条件进行管理。

--1. 创建分区表
create table art_dt(sentence string)
partitioned by (dt string)
row format delimited fields terminated by '\n';
--2. 插入数据
insert overwrite table art_dt partition(dt='20190913') 
select * from art_ext limit 100; --此处sql为做数据的etl，或者统计分析等处理逻辑

查看分区数

hive> show partitions art_dt;
OK
dt=20190420
dt=20190421
dt=20190913
Time taken: 0.041 seconds, Fetched: 3 row(s)

添加分区：

hive> alter table art_dt add partition(dt='20190914');
OK
Time taken: 0.155 seconds
hive> show partitions art_dt;
OK
dt=20190420
dt=20190421
dt=20190913
dt=20190914
Time taken: 0.059 seconds, Fetched: 3 row(s)

删除分区

hive> ALTER TABLE art_dt DROP PARTITION (dt='20190914');

动态分区

Static Partition (SP) columns：静态分区；
Dynamic Partition (DP) columns 动态分区。
1.DP列的指定方式与SP列相同 - 在分区子句中（ Partition关键字后面），唯一的区别是，DP列没有值，而SP列有值（ Partition关键字后面只有key没有value）；
2.在INSERT … SELECT …查询中，必须在SELECT语句中的列中最后指定动态分区列，并按PARTITION（）子句中出现的顺序进行排列；
3.所有DP列 - 只允许在非严格模式下使用。在严格模式下，我们应该抛出一个错误。
4.如果动态分区和静态分区一起使用，必须是动态分区的字段在前，静态分区的字段在后。

hive 中默认是静态分区，想要使用动态分区，需要设置如下参数，可以使用临时设置，你也可以写在配置文件（hive-site.xml）里，永久生效。临时配置如下
开启动态分区（默认为false，不开启）

set hive.exec.dynamic.partition=true;  （开启动态分区）
set hive.exec.dynamic.partition.mode=nonstrict;
（指定动态分区模式，默认为strict，即必须指定至少一个分区为静态分区，nonstrict模式表示允许所有的分区字段都可以使用动态分区。）

hive分桶表

hive中table可以拆分成partition, table和partition可以通过 ‘CLUSTERED BY’ 进一步分bucket, bucket中的数据可以通过 'SORT BY’排序。

Bucket的主要作用：

数据sampling
提升某些数据查询效率，例如mapside join

建表

--1.生成辅助表 bucket_test.txt文件数据为数字1-32
create table bucket_num(num int);
load data local inpath '/home/zhanzhy/Documents/data/hive/bucket_test.txt' 
into table bucket_num;

--2.每个数字进入一个bucket
--2.1建表（表的元数据信息建立）
-- 'set hive.enforce.bucketing = true' 可以自动控制上一轮reduce的数量从而适配bucket的个数
-- 当然，用户也可以自主设置mapred.reduce.tasks去适配 bucket个数
set hive.enforce.bucketing = true;
create table bucket_test(num int)
clustered by(num)
into 32 buckets;

--2.2查询数据并导入到对应表中
-- number of mappers: 1; number of reducers: 32
insert overwrite table bucket_test
select cast(num as int) as num from bucket_num

查看sampling数据

--tablesample是抽样语句，语法: TABLESAMPLE(BUCKET x OUT OF y)，y必须是table总bucket数的倍数或者因子。 hive根据y的大小，决定抽样的比例。
hive> select * from bucket_test tablesample(bucket 1 out of 32 on num);
OK
32
Time taken: 0.121 seconds, Fetched: 1 row(s)

如果数据没有分桶，如何进行采样？

select * from bucket_test2 where num%10 > 0;

兰波万

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录