目录
1、数据类型
tinyint 1byte有符号整数
smalint 2byte有符号整数
int 4byte有符号整数
bigint 8byte有符号整数
boolean 是否
float 单精度浮点数
double 双精度浮点数
string 字符串
timestamp 整数浮点数或者字符串
binary 字节数组
--------------------------------
struct struct<street:string,city:string> {"street":"mozijie","city":"cd"}
map map<string,float> {"s1":110.4,"s2":50.1}
array array<string> ["emp1","emp2","emp3"]
2、常用命令
- 在cli的查询中显示列名:
set hive.cli.print.header=true;
- 在cli中使用hadoop命令:
dfs 。。。。 比如:dfs -ls -R /;
- 集群间拷贝数据:
hadoop distcp hdfs://nn1:8020/foo/bar \ hdfs://nn2:8020/bar/foo
- 查看表结构信息:
describe formatted db.score;
- 查看表的name字段:
describe formatted db.score.name;
- 根据已有表创建表结构:
create TABLE db.stu_score like default.score;
- 设置变量
hive -hiveconf enter_school_date="20130902" -hiveconf min_ag="26" -f testvar.sql;
testvar.sql文件内容:
use test;
select * from student
where
pdate='${hiveconf:enter_school_date}'
and
sage > '${hiveconf:min_ag}'
limit 8;
- 创建外部分区表:
create external TABLE emp(id int,name string) partitioned by(month string) row format delimited fields terminated by ' ' stored as textfile;
- 往分区表加载数据:
load data local inpath '/usr/data/hive/emp' into TABLE [emp] partition(month='10');
load data local inpath '/usr/data/hive/emp' into TABLE [emp] partition(month='12');
- 查看表中的分区信息:show partitions emp;
为防止对分区表查询时产生巨大的mr任务,可进行设置:set hive.mapred.mode=strict,如果对分区表查询时where后未加分区筛选条件,
则不允许查询:FAILED: SemanticException [Error 10041]: No partition predicate found for Alias "emp" TABLE "emp"
取消限制:set hive.mapred.mode=nonstrict
- 快速复制一个分区表:
原分区表:emp
1. create TABLE [employee] like emp;
2. dfs -cp /user/hive/warehouse/school.db/emp/* /user/hive/warehouse/school.db/employee/;
3. msck repair TABLE [employee] ;
4. show partitions [employee] ;
- 删除表:
drop TABLE fjs;如果表名不存在不报错
如果是外部表,则只会删除表的元信息,hdfs上的数据不会删除
- 修改表:
修改表名: ALTER TABLE [emp] rename to [employee];
新增分区并指定数据位置: ALTER TABLE [emp] add partition(month='9') location '/data/hive/emp/month=9/';
删除分区: ALTER TABLE [emp] drop partition(month='9');如果是外部表,使用msck repair TABLE empty;可以恢复分区信息
修改分区表指定分区的数据来源:ALTER TABLE [emp] partition(month='10') set location '/data/hive/emp/month=10/emp10';
修改列信息:ALTER TABLE [emp] CHANGE COLUMN name mingzi STRING;
增加列信息:ALTER TABLE temp add COLUMNS (xing STRING COMMENT 'gender',salary double);
修改存储格式:ALTER TABLE employee partition(month='12') set FILEFORMAT sequencefile;
3、数据操作
- 将查询的数据插入表格[95]
<inesrt overwrite == inesrt into>
insert overwrite table employee partition(month=9) select id,name from emp where month=9;
exception: FAILED: SemanticException [Error 10044]: Line 1:23 Cannot insert into target table because column number/types are different '10': Table insclause-0 has 2 columns, but query has 3 columns.
note:select 后面不能加上分区列,因为hive在插入的时候会默认插入已经指定的month字段,所以在查询的时候需要将分区字段去掉
根据内容进行动态分区:insert overwrite table emp partition(age ) select id,name,age from emp_n;
会根据最后一个字段(age)进行产生分区信息,即使列名不一样也不影响
将指定分区的数据加载入目标表:insert overwrite table stu partition(classid=1,teacherid) select id,name ,teacherid from stu_tmp where classid=1; 如果分区信息指定了的话在select的映射区则不能出现该字段。静态分区的信息应该在动态分区的前面。
- 将查询的数据保存到本地:
1、hive> insert [overwrite] local directory '/usr/data/hive' select * from mydb.stu; -- hive的保存方式<缺点:列值之间无分割符>
2、$> hive -e "select * from mydb.stu" > a.txt
3、$> hive -f "/usr/data/hive/hql/save.hql' > a.txt
4、HiveQL-查询(例子)
- 创建demo表及加载数据:
create table employees(name string,salary float,subordinates array<string>,deduction map<string,float>,address struct<street:string,city:string>) partitioned by(country string,gender string) row format delimited fields terminated by '\t' COLLECTION ITEMS TERMINATED BY ',' MAP KEYS TERMINATED BY ':';
create table e(name string,salary float,subordinates array<string>,deduction map<string,float>,address struct<street:string,city:string>,country string,gender string) row format delimited fields terminated by '\t' COLLECTION ITEMS TERMINATED BY ',' MAP KEYS TERMINATED BY ':';
样例数据【employees】:
zhangsan 8010.50 emp1,emp2,emp3 s1:110.4,s2:50.1 mozijie,cd china man
lisi 11000.4 emp5,emp6 s1:150,s2:90 changfengjie,cd china women
limei 6000 s1:80.1,s2:10.2 glodstreet,ch england women
wangwu 9000 s1:120.4,s2:80.1 ste,ch england man
load data local inpath '/usr/data/hive/employees' into table e;
insert overwrite table employees partition(country,gender) select name,salary,subordinates,deduction,address,country,gender from e;
列信息:
name salary subordinates deduction address country gender
- 查询类型为array的subordinates数据信息:指定下标查询列值中的单个数据
select subordinates from employees;
OK
subordinates
["emp1","emp2","emp3"]
["emp5","emp6 "]
[]
[]
----------------------------------------
select subordinates[0] from employees;
OK
c0
emp1
emp5
NULL
NULL
- 查询类型为Map的deduction数据信息:指定key查询对应的value
select deduction from employees;
OK
deduction
{"s1":110.4,"s2":50.1}
{"s1":150.0,"s2":90.0}
{"s1":120.4,"s2":80.1}
{"s1":80.1,"s2":10.2}
----------------------------------------
select t.deduction['s1'] from employees t;
OK
c0
110.4
150.0
120.4
80.1
----------------------------------------
select t.deduction['s11'] from employees t;
OK
c0
NULL
NULL
NULL
NULL
- 查询类型为struct的address数据信息:通过.来查询对应的数据
select t.address from employees t;
OK
t.address
{"street":"mozijie","city":"cd"}
{"street":"changfengjie","city":"cd "}
{"street":"ste","city":"ch"}
{"street":"glodstreet","city":"ch"}
----------------------------------------
select t.address.street from employees t;
OK
street
mozijie
changfengjie
ste
glodstreet
----------------------------------------
- 使用列值进行计算
- 普通的运算
select upper(name) name,salary before,(deduction['s1']+deduction['s2']) deduction, (salary-(deduction['s1']+deduction['s2'])) after from employees;
name before deduction after
ZHANGSAN 8010.5 160.5 7850.0
LISI 11000.4 240.0 10760.4
WANGWU 9000.0 200.5 8799.5
LIMEI 6000.0 90.299995 5909.7
- 函数
聚合函数[set hive.map.aggr=true]可提高聚合性能:
select count(*) count from employees;
Query ID = hadoop_20181217113232_b5851024-9264-4fee-9ed8-c3c515fb58e3
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2018-12-17 15:41:58,628 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local424395412_0014
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 28040 HDFS Write: 9224 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
count
4
- 表生成函数
1、select explode(subordinates) sub from employees;
OK
sub
emp1
emp2
emp3
emp5
emp6
2、case..when..then..else..end函数
select upper(name) name,salary before ,(deduction['s1']+deduction['s2']) deduction, (salary-(deduction['s1']+deduction['s2'])) after,case gender when 'man' then 'nan' when 'women' then 'nv' else 'xx' end from employees;
OK
name before deduction after c4
ZHANGSAN 8010.5 160.5 7850.0 nan
LISI 11000.4 240.0 10760.4 nv
WANGWU 9000.0 200.5 8799.5 nan
LIMEI 6000.0 90.299995 5909.7 nv
设置hive首先选取本地模式进行数据查询:set hive.exec.mode.local.auto=true
3、浮点数的比较:
select t.deduction['s1'] from employees t where t.deduction['s1']>120.4;
OK
c0
150.0
120.4
----------------------------------------
明显120.4是错误数据,错误的原因:hive在进行比较的时候会把120.4认作是double,而前者就会转变为double再进行比较,比如120.40000001 > 120.40000000000000001
解决办法:
select t.deduction['s1'] from employees t where t.deduction['s1']>cast(120.4 as float);
OK
c0
150.0
4、 LIKE与RLIKE关键字
LIKE:
select name from employees t where t.name like 'li%';
OK
name
lisi
limei
----------------------------------------
5、RLIKE[替换多个like]:
查询有li或者是mei出现过的数据
select name from employees t where t.name rlike '.*(li|mei).*';
OK
name
lisi
limei
----------------------------------------
6、 JOIN连接
1、inner join:等值连接,无法对应的将舍弃
2、left join
select * from employees e left join country c on e.country=c.name;
e.name e.salary e.subordinates e.deduction e.address e.country e.gender c.name c.level
zhangsan 8010.5 ["emp1","emp2","emp3"] {"s2":50.1,"s1":110.4} {"street":"mozijie","city":"cd"} china man china A
lisi 11000.4 ["emp5","emp6 "] {"s2":90.0,"s1":150.0} {"street":"changfengjie","city":"cd "} china women china A
wangwu 9000.0 [] {"s2":80.1,"s1":120.4} {"street":"ste","city":"ch"} england man england B
limei 6000.0 [] {"s2":10.2,"s1":80.1} {"street":"glodstreet","city":"ch"} england women england B
----------------------------------------
select * from country c left join employees e on e.country=c.name;
china A zhangsan 8010.5 ["emp1","emp2","emp3"] {"s2":50.1,"s1":110.4} {"street":"mozijie","city":"cd"} china man
china A lisi 11000.4 ["emp5","emp6 "] {"s2":90.0,"s1":150.0} {"street":"changfengjie","city":"cd "} china women
england B limei 6000.0 [] {"s2":10.2,"s1":80.1} {"street":"glodstreet","city":"ch"} england women
england B wangwu 9000.0 [] {"s2":80.1,"s1":120.4} {"street":"ste","city":"ch"} england man
》》italian B NULL NULL NULL NULL NULL NULL NULL
3、left semi join:查询左边表的某列在右边表中出现的数据
select * from country c left semi join employees e on e.country=c.name;
country表中的name值在employees表中出现的数据
c.name c.level
china A
england B
7、 Sample取样
select * from stu tablesample (bucket 1 out of 2 on rand()) s;
stu共有9条数据,将stu中的数据分为两份,再随机查询其中一份的数据
select * from stu tablesample (0.5 percent) s;
最小单元是一个hdfs数据块的基于百分比进行取样方式。【note】:如果表的数据小于hdfs数据块128m的大小,则直接返回所有行
- 视图
创建:
create view vw_c as select * from country c1 where c1.name='china' union all select * from country c2 where c2.name='england';
- 查看hdfs存储空间的使用情况:
总量:
hadoop fs -df -h
18/12/18 10:47:50 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Filesystem Size Used Available Use%
hdfs://s101:8020 39.2 G 63.1 M 28.5 G 0%
各个DB的情况:
hadoop fs -du -h /user/hive/warehouse
18/12/18 10:49:09 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
62.3 M 62.3 M /user/hive/warehouse/mydb.db
72 72 /user/hive/warehouse/school.db
223 223 /user/hive/warehouse/score1