Hive

最新推荐文章于 2024-11-06 17:41:23 发布

青衫背剑远游客

最新推荐文章于 2024-11-06 17:41:23 发布

阅读量230

点赞数

分类专栏： Hadoop 文章标签： hive

本文链接：https://blog.csdn.net/qq_24818607/article/details/78846039

版权

Hadoop 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

1.centos7 上安装Hive2.3.0

2.HAProxy 使多个hive负载均衡

3.hive数据类型

基本数据类型：

整型：int\smallint\bigint\tinyint

浮点型：double\float

布尔型：booolean--false\true

字符型：string

复合类型：

array:一组有序字段，字段类型必须相同

map:

struct:

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name

[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[ROW FORMAT row_format]
[STORED AS file_format]

内部表会copy数据到warehouse，删除表的时候会删除warehouse数据

外部表不会copy数据到warehouse，删除表的时候不会删除warehouse数据(当对数据只有查询权限时，建外部表，不然会报权限不够)

show partitions partition_table //查看分区表的分区信息

导入数据方式：

1、load data local inpath '/bigdata/data/20171221/pati.log' overwrite into table partition_table partition(dt='2017-12-21',dep='devo');

2、alter table partition_table add partition (dt='2017-12-21',dep='devo_1') location '/bigdata/data/20171221/pati.log'

嵌套：from (select n ,salary from partition_table ) e select e.n,e.salary where e.salary > 10000;

select user_name,sum(score) from group_test group by user_name;列名与逗号要空格隔开不然运行报错

hive.map.aggr控制如何聚合，默认是false，如果设置true，hive将会在map端做第一级的聚合。这样通常提供更好的效果，但是要求更多内存才能运行成功。（优化）

hive 只支持等值join

reduce join：select b.class, a.score from group_test a join join_test b on (a.user_name = b.name);

maper join (只有一个大表时): select /*+ MAPJOIN(b小表)*/ a.user_name , b.class , a.score from group_test a join join_test b on (a.user_name = b.name);

left join (半连接): select b.class, a.score from group_test a join join_test b on (a.user_name = b.name);

set hive.mapred.mode=nostrict;(默认值)

set hive.mapred.mofe=strict;模式下必须指定limit否则报错（sort by 不受影响）

使用sort by 可以指定执行的reduce个数(set mapred.reduce.tasks=<number>)这样就可以输出更多的数据。

order by 与sort by区别：order by 会对查询结果集做一次全局排序，也就是说所有的数据传给一个reduce来处理，对于大数据来说这个过程很慢。而sort by 只会在每个reduce中排序，这样只保证每个reduce的输出数据是有序的（并非全局排序），这样可以提高后面全局排序的效率

创建索引：(不是分区表也能创建索引，hive2.3.0)

create index index1_index_test on table index_test(id) as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' with deferred rebuild;

alter index index1_index_test on index_test rebuild;

查看索引：show index on index_test;

查看分区：show partitions index_test;

桶(bucket) 是指将表或分区中指定列的值为key进行hash，hash到指定的桶中，这样可以支持高效采集工作。

抽样(sampling) 可以在全体数据上进行采集，这样效率自然就低，它还是要去访问所有的数据。而如果一个表已经对某一列制作了bucket，就可以采集所有桶中指定序号的某个桶，这就减少了访问量。

hadoop fsck /

hadoop fsck -delete

hive存储文件类型：

textfile

sequencefile是hadoop api提供的一种二进制文件支持，其具有使用方便、可分割、可压缩的特点。sequence支持三种压缩选择：none，record，block。record压缩率低，一般建议使用block压缩。

rcfile是一种行列存储相结合的存储方式。压缩比更高，读取列更快，首先将数据按行分块，保证同一个record在一个块上，避免读一个记录需要读取多个block。向rcfile表导入数据时，先创建textfile临时表，load数据进临时表，然后insert 数据进refile表

hive自定义输入格式

通过SerDe(serialize/deserialize),在数据序列化和反序列化时格式化数据。

create external table hive_ser(
times string,
product_name string,
sale_num string
)row format

serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
with serdeproperties
( 'input.regex' = '([^^]*)\\^\\^([^^]*)\\^\\^([^^]*)',
'output.format.string' = '%1$s %2$s %3$s')
stored as textfile;

	复杂数据类型
类型	描述	示例
ARRAY	一组有序字段。字段类型必须相同	Array(1,2)
MAP	一组无序的键/值对，键的类型必须是原子的，值可以是任何类型，同一个映射的键的类型必须相同，值的类型也必须相同	Map('a',1,'b',2)
STRUCT	一组命名的字段，字段类型可以不同	Struct('a',1,1,0)

create table login_array (
ip string,
uid array<bigint>
)
partitioned by (dt string)
row format delimited
fields terminated by ','
collection items terminated by '|'
stored as textfile;

select uid[2] from login_array; 根据顺序查询

select size(uid) from logain_array; 查询长度

select * from login_array where array_contains(uid,1090510226111); 数组查询

create external table map_demo(
ts string,
ip string,
type string,
logtype string,
request map<string,string>,
response map<string,string>
)
partitoned by (dt string)
row format delimited
fields terminated by '#'
collection items terminated by '&'
map keys terminated by '='
stored as textfile;

create table login_struct(
ip string,
usr struct<name:string,age:int>
)
row format delimited
fields terminated by ','
collection items terminated by ':'
stored as textfile;

hive 内置函数：数字运算--ceil向上取整函数，ceiling向上取整函数，floor向下取整函数，round指定精度，rand取随机函数...

日期函数--unix时间戳转日期函数：from _unixtime ,获取当前unix时间戳函数/日期转unix时间戳函数/指定格式日期转unix时间戳函数:unix_timestamp,日期时间转日期函数：to_date，日期转年函数year,日期转月函数：month,日期转天函数：day，日期转小时函数:hour,日期转分钟函数:minute,日期转秒函数:second,日期转周函数:weekofyear,日期比较函数：datediff,日期增加函数:date_add,日期减少函数date_sub

非空查找函数coalesce

字符串函数: