Hive笔记

最新推荐文章于 2024-05-06 23:47:34 发布

weixin_30914981

最新推荐文章于 2024-05-06 23:47:34 发布

阅读量101

点赞数

文章标签：大数据数据库 java

原文链接：http://www.cnblogs.com/Diyo/p/11459270.html

版权

* hadoop的主节点 hdfs集群启动用户下
1 解压
2 配置环境变量
3 初始化元数据库

hive命令
初始化元数据库
schematool -dbType derby -initSchema
schematool -dbType mysql -initSchema

初始化元数据
* mv metastore_db metastore_db.tmp

1 vi hive-site.xml
2 jar -> hive/lib
3 mysql新建用户hive
4 /etc/mysql/my.cnf
#bind-address = 127.0.0.1
5 sudo service mysql restart
6 schematool -dbType mysql -initSchema

grant all privileges on hive.* to 'hive'@'%' identified by 'hive';

flush privileges;

Delete FROM user Where User='hive' and Host='%';

useSSL=false

------------表的创建---------------
展示所有的表
show tables;
展示所有的库
show databases;

set hive.cli.print.header=true;

1 数据类型
tinyint/smallint/int/bigint
y s l

整数的裸值都是 int
long l = 100L;
long l = 99999999;(报错)

float/double/decimal
decimal(x,y)

boolean

string/varchar/char

timestamp/date/interval

binary

Array 数组 Map 键值对 Struct 对象

2 表的类型

内部表外部表分区表桶表

create table t_car(xx xxx,xx xxx)

内部表，就是一般的表，前面讲到的表都是内布表，
与数据库中的Table在概念上类似
每一个Table在Hive中都有一个相应的目录存数据
所有的内部表数据都保存在这个目录中
当表定义被删除的时候，表中的数据和元数据随之一并被删除。

Table 内部表
hiveQL

create table t_phone(id int,name string,price double)
row format delimited
fields terminated by ','
stored as textfile;

本地文件导入
load data local inpath '/home/hdfs/phone_data' into table t_phone

hdfs文件导入
load data inpath '/data2' into table t_student

External Table 外部表

外部表，数据存在与否和表的定义互不约束，仅仅只是表对hdfs上相应文件的一个引用，当删除表定义的时候，表中的数据依然存在。

create external table t_phone_ext(id int,name string,price double)
row format delimited
fields terminated by ','
stored as textfile
location '/ext_table_data';
外部表把某个目录下的数据当做表来使用

导入
1 insert
2 load
3 子查询

hive支持子查询
create table t_phone_back
as
select * from t_phone;

insert overwrite/into table t_order select * from t_user;

Partition 分区表
根据业务编码、日期、其他类型等维度创建分区表，
比如一个重庆市的9个区域各自一个分区，
如果要查某一个区域的数据，只需要访问一个分区的数据，
而不需要从全量数据中进行筛选。
分区底层实现逻辑为：
在一个表对应的目录下，一个分区对应一个目录
使用场景：
单表数据量巨大，而且查询又经常限定某一个类别，
那么可以将表按照该类别进行分区，
以提高数据查询效率，减少资源开销

create table t_order(id int,name string,cost double)
partitioned by (month string)
row format delimited
fields terminated by ',';

load data local inpath '/home/briup/bd1902_hive/data_order' into table t_order
partition(month='8');

show partitions t_order;

Bucket Table 桶表
将大表进行哈希散列抽样存储，方便做数据和代码验证。
比如将表分成10分，每次只拿其中的十分之一来使用，可以快速的得到结果

分桶底层实现逻辑：
在表对应的目录下，将源文件拆分成N个小文件

使用场景：
对于一个庞大的数据集我们经常需要拿出来一小部分作为样例，
然后在样例上验证我们的查询，优化我们的程序,
利用分桶表可以实现数据的抽样

桶表汇中的数据，只能从其他表中用子查询进行插入

set hive.enforce.bucketing=true;

insert into table t_phone_bucket select * from t_phone;

select * from t_phone_bucket tablesample(bucket 3 out of 3 on id);

//array
create table tab_array (a array<int>,b array<string>)
row format delimited
fields terminated by '\t'
collection items terminated by ',';

load data local inpath '/home/briup/data_array' into table tab_array;

1,2,3 hello,world,briup

select a[2],b[1] from tab_array;

//map
create table tab_map (name string,info map<string,string>)
row format delimited
fields terminated by '\t'
collection items terminated by ','
map keys terminated by ':';

zhangsan name:zhangsan,age:18,gender:male

load data local inpath '/home/briup/bd1902_hive/data_map' into table tab_map;

select info['name'] from tab_map;

zhangsan age:18,addr:beijing,height:180,weight:180

//struct 类对象
create table tab_struct(name string,info struct<age:int,tel:string,salary:double>)
row format delimited
fields terminated by '\t'
collection items terminated by ',';

zhangsan 18,18979131234,22.3

load data local inpath '/home/briup/bd1902_hive/data_struct' into table tab_struct;

select info.age,info.tel from tab_struct;

------------数据的查询-------------
select
from
where
order by sort by
group by distinct
distribute by cluster by
limit

join
内连接（等值连接）
inner join
外连接
左外右外全外
left outer join right outer join full outer join
半连接

笛卡尔连接

分区
distribute by cluster by

where 子句不做过多解释

create table test1(id int)
row format delimited
fields terminated by '\t'
stored as textfile;

5
3
6
4
2
9
7
8
1

load data local inpath '/home/briup/bd1902_hive/data_test1' into table test1;

order by id asc 全局排序
ex:
hive> set mapred.reduce.tasks=2;
hive> select * from test1 order by id;

sort by id desc 局部排序
ex:
hive> set mapred.reduce.tasks=2;
hive> select * from test1 sort by id;

distribute by 按照指定的字段或表达式对数据进行划分，输出到对应的Reduce或者文件中
ex:
hive> set mapred.reduce.tasks=2;
hive>INSERT overwrite LOCAL directory '/home/briup/bd1902_hive/res1'
SELECT id FROM test1
distribute BY id;

cluster by
除了兼具distribute by的功能，还兼具sort by的排序功能
ex:
hive> set mapred.reduce.tasks=2;
hive>INSERT overwrite LOCAL directory '/home/briup/bd1902_hive/res2'
SELECT id FROM test1
cluster by id;

group by 和 distinct
create table test2(name String,age int,num String)
row format delimited
fields terminated by '\t'
stored as textfile;

zhao 15 20170807
zhao 14 20170809
zhao 15 20170809
zhao 16 20170809

load data local inpath '/home/briup/bd1902_hive/data_test2' into table test2;

select name from test2 group by name;

select distinct name from test2;

如果数据较多，distinct效率会更低一些，一般推荐使用group by。

伪列
select rownum from

(倒置索引表)？
Hive查询中有两个虚拟列：
INPUT__FILE__NAME：数据对应的HDFS文件名；
BLOCK__OFFSET__INSIDE__FILE：
该行记录在文件中的偏移量；

ex:
hive> select id,INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE from test1;

Hive中的Join可分为Common Join（Reduce阶段完成join）和Map Join（Map阶段完成join）。如果不指定MapJoin或者不符合MapJoin的条件（hive.auto.convert.join=true 对于小表启用mapjoin,hive.mapjoin.smalltable.filesize=25M 设置小表的阈值），那么Hive解析器会将Join操作转换成Common Join,即：在Reduce阶段完成join.

Hive中除了支持和传统数据库中一样的内关联、左关联、右关联、全关联，

还支持LEFT SEMI JOIN
和CROSS JOIN，但这两种JOIN类型
也可以用前面的代替。

user_info
1 zhangsan
2 lisi
3 wangwu

create table user_info(id int,name string)
row format delimited
fields terminated by '\t'
stored as textfile;

load data local inpath '/home/briup/bd1902_hive/data_user_info' into table user_info;

user_age
1 30
2 29
4 21
create table user_age(id int,age int)
row format delimited
fields terminated by '\t'
stored as textfile;

load data local inpath '/home/briup/bd1902_hive/data_user_age' into table user_age;

内连接

SELECT a.id,
a.name,
b.age
FROM user_info a
inner join user_age b
ON (a.id = b.id);

左外连接
SELECT a.id,
a.name,
b.age
FROM user_info a
left join user_age b
ON (a.id = b.id);

右外连接
SELECT a.id,
a.name,
b.age
FROM user_info a
RIGHT OUTER JOIN user_age b
ON (a.id = b.id);

全外连接
SELECT a.id,
a.name,
b.age
FROM user_info a
FULL OUTER JOIN user_age b
ON (a.id = b.id);

LEFT SEMI JOIN
以LEFT SEMI JOIN关键字前面的表为主表，
返回主表的KEY也在副表中的记录

SELECT a.id,
a.name
FROM user_info a
LEFT SEMI JOIN user_age b
ON (a.id = b.id);

--等价于：
SELECT a.id,
a.name
FROM user_info a
WHERE a.id
IN (SELECT id FROM user_age);

笛卡尔积关联（CROSS JOIN）
SELECT a.id,
a.name,
b.age
FROM user_info a
CROSS JOIN user_age b;

create table t_student(id int,name string)
row format delimited
fields terminated by ','
stored as textfile;

load data local inpath '/home/hdfs/data_student' into table t_student;

1,terry
2,kiven
3,larry
4,renee

-----------------内置函数 UDF 和内置运算符--------------------------
#取随机数rand()
select rand() from t_student;
#求a的阶乘
factorial(INT a)
select factorial(10) from t_student;
#求最大值 max()
greatest(T v1, T v2, ...)
select greatest(10,123,53,34,1,23,502,120) from t_student;
#求最小值
least(T v1, T v2, ...)
select least(10,123,53,34,1,23,502,120) from t_student;
#数学常量e
select e() from t_student;
#数学常量pi
select pi() from t_student;

#返回当前时间
select current_date from t_student;

#如果列中有null值，则返回默认值
NULL
nvl(T value, T default_value)
select id,nvl(name, '无名氏') from t_student;

#对于值的不同判断，取不同的值
CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END
如果a=b就返回c,a=d就返回e，否则返回f
如CASE 4 WHEN 5 THEN 5 WHEN 4 THEN 4 ELSE 3 END 将返回4
select id,name,CASE id WHEN 3 THEN '老板' ELSE '员工' END,name from t_student;

#判断某个文件中是否包含某个字符串
in_file(string str, string filename)
select in_file('2,vivo,4000.0','/home/hdfs/phone_data') from t_student;

#通过某个符号切割字符串
split(string str, string pat)
select split('hello,world,briup', ',') from t_student;

#截取字符串
substr(string|binary A, int start, int len)
select substr('ceo-larry', 0, 3) from t_student;

#在某字符串中查找某个子串第一次出现的位置,位置从1开始 indexof
instr(string str, string substr)
select instr('ceo-larry', 'la') from t_student;

#比较两个字符串，不同的字符个数
levenshtein(string A, string B)
select levenshtein('testptest', 'briup') from t_student;

#把array中的字符用某个符号拼接起来
stream.collect(Collector.join)
concat_ws(string SEP, array<string>)
select concat_ws('#', split('hello,world,briup', ',')) from t_student;

-----------------自定义函数 UDF--------------------------
create table t_student_udf(id int,name string,tel string)
row format delimited
fields terminated by ','
stored as textfile;

1,张三,13834112233
2,李四,13994200987
3,王五,13302019922
4,jack,13211223344
未知
load data local inpath '/home/briup/bd1902_hive/data_getarea' into table t_student_udf;

showArea(gender)

1 写一个java类extends UDF，定义某个逻辑
evaluate

2 打成jar 上传到hive所在的节点

3 在hive中创建一个函数，和jar中的自定义类建立关联

add jar /home/hdfs/udf.jar

create temporary function GetArea
as 'com.briup.udf.GetArea';

show functions;

select id,name,tel,GetArea(tel) from t_student_udf;

------------------hive结合hbase----------------------
//hive+hbase
//hbase中建表

create 'person',{NAME => 'f1',VERSIONS => 1},{NAME => 'f2',VERSIONS => 1},{NAME => 'f3',VERSIONS => 1}
put 'person','1001','f1:name','jack'
put 'person','1001','f2:age','18'
put 'person','1002','f1:name','jack'
put 'person','1003','f3:position','ceo'

//hive

SET hbase.zookeeper.quorum=hadoop:2181;
SET zookeeper.znode.parent=/hbase;

ADD jar /home/briup/software/hive/lib/hive-hbase-handler-2.3.5.jar;

CREATE EXTERNAL TABLE person (
rowkey string,
f1 map<STRING,STRING>,
f2 map<STRING,STRING>,
f3 map<STRING,STRING>
) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,f1:,f2:,f3:")
TBLPROPERTIES ("hbase.table.name" = "person");

hbase.zookeeper.quorum：

指定HBase使用的zookeeper集群，默认端口是2181，可以不指定，如果指定，格式为zkNode1:2222,zkNode2:2222,zkNode3:2222

zookeeper.znode.parent

指定HBase在zookeeper中使用的根目录

hbase.columns.mapping

Hive表和HBase表的字段映射关系，分别为：Hive表中第一个字段映射:key(rowkey)，第二个字段映射列族f1，第三个字段映射列族f2,第四个字段映射列族f3

hbase.table.name

HBase中表的名字

INSERT INTO TABLE test
SELECT 'row1' AS rowkey,
map('c3','name3') AS f1,
map('c3','age3') AS f2,
map('c4','job3') AS f3
FROM dual
limit 1;

INSERT INTO TABLE test
SELECT 'row1' AS rowkey,
map('c3','name3') AS f1,
map('c3','age3') AS f2,
map('c4','job3') AS f3
FROM person;

-----------------JDBC连接hive------------------------
前置步骤
hdfs-site.xml
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>

core-site.xml
<property>
<name>hadoop.proxyuser.briup.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.briup.groups</name>
<value>*</value>
</property>
以上两个属性的第三位指的是hive所在机器的用户名

1 hive启动服务，保持持续运行

2 hiveserver2
hive --service hiveserver2 &

默认监听10000

org.apache.hive.jdbc.HiveDriver

jdbc:hive2://192.168.89.128:10000/bd1902

briup

briup

dual oracle 哑表 - 做测试

select 'hello' from dual;
dummy
X

//hive
create table dual (dummy string);
//cmd
echo 'X' > dual.txt;
/hive
load data local inpath '/home/hadoop/dual.txt' overwrite into table daul;

转载于:https://www.cnblogs.com/Diyo/p/11459270.html

weixin_30914981

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hive笔记

* hadoop的主节点 hdfs集群启动用户下1 解压2 配置环境变量3 初始化元数据库hive命令初始化元数据库schematool -dbType derby -initSchemaschematool -dbType mysql -initSchema初始化元数据* mv metastore_db metastore_db.tmp1 vi hive-site.xml2...
复制链接

扫一扫