hive纪要

最新推荐文章于 2022-05-31 23:41:19 发布

斗者_2013

最新推荐文章于 2022-05-31 23:41:19 发布

阅读量898

点赞数

分类专栏： Hadoop 文章标签： hive

本文链接：https://blog.csdn.net/w1014074794/article/details/50732294

版权

Hadoop 专栏收录该内容

34 篇文章 1 订阅

订阅专栏

用文件执行hive查询

1、在系统终端下

[root@cloud4 shell] Hive -f /path/to/file/withqueries.hql

[root@cloud4 shell] Hive -e ‘短sql语句’

2.在hive shell中用户可以使用source命令执行一个脚本

Hive>Source /path/to/file/withqueries.hql

Hive中源表src的使用？

类似Oracle中的dual。查询不设计具体的表。

Select xpath(……) from src limit 1;

如何在hive CLI窗口执行bash shell?

只需要命令前加上！并且以分号（；）结尾就可以了；

! /bin/echo “what up dog”;

在hive CLI窗口执行hadoop的dfs命令？

只需去掉hadoop命令中的关键字hadoop去掉，然后以分好结尾就可以了。

Hive> dfs -ls /;

Hive中的数据类型？

Int

Boolean

Float

Double

String

集合类型：struct，map和array

Hive中创建表时怎么指定分割符？

Row format delimited

Fields terminated by ‘\001’ --使用^A字符作为列分隔符

Collection items terminated by ‘\002’ --使用^B字符作为集合元素间的分隔符

Map keys terminated by ‘003’ --使用^C字符作为map的键和值之间的分隔符

Lines terminated by ‘\n’ --行与行之间的分割符只支持“\n”

Stored by textfile; --使用文档格式存储文件

Hive中的元数据存储的作用？

元数据存储中存储了如表的模式和分区信息等元数据信息。一般源数据存储使用mysql数据库。用户在执行create table x…或者aler table y…等命令时会指定这些信息。

Hive数据信息的存储位置的配置？

Hive.metastore.warehouse.dir=/user/myname/hive/warehouse;

HiveQL:数据定义？

Create database if not exists financials

With dbproperties(‘creator’=’WanLi’,’data’=’2012-01-02’);--增加额外相关的键值对属性信息

;

Describe database financials;--查看数据库描述信息

Describe database extended financials;--查看信息，包含数据库的键值对补充信息

Show databases;

Use financials;

Drop database if exists financials;--数据库中含有表就不能删除

Drop database if exists financials cascade; --hive自行先删除表再删除数据库

Alert database financials set dbproperties(‘edited-by’=’Joe Dba’);

创建表：

Create table if not exists mydb.employees(

Name string comment ‘Employee name’,

Salary float comment ‘Employee salary’,

Subordinates array<String> comment ‘names of subordinated’,

Decuctions map<String,Float> comment ‘keys are deducitons names,values arepercentates’,

Address struct<street:String,city:String,state:String,zip:int> comment ‘home address’

)

Comment ‘Desciption of the table’

Tblproperties(‘creator’=’me’,’created_at’=’2012-01-02 10:00:00’,…)

Location ‘/user/hive/warehouse/mydb.db/employees’;

拷贝表：

Create table if notexists mydb.employees2 like mydb.employees;

Use mydb;

Show tables;

Describe extended mydb.employee;--查看表信息

Describe formatted mydb.employee; --输出更多的表相关信息

内部表

内部表又称为管理表，hive能够控制这些表数据的生命周期。删除表的同时会删除表数据。

但是管理表不适合和其他工作共享数据。

外部表

使用场景：hive没有部分数据的所有权，但是想要在这份数据上执行一些查询，这时，我们可以创建一个外部表指向这些数据，而不需要对其具有所有权。

使用external关键字创建外部表：

Create external table if notexists stocks(

……

)

Row format delimited fields terminated by ’,’

Partitioned by(year int ,month int ,day int)

Location ‘/data/stocks’;

增加分区：

Alert table log_messages add partition(year-2012,month=1,day=2)

Location‘hdfs://master_server/data/log_messages/2012/01/02’;

拷贝分区数据：

Hadoop distcp /data/log_message/2011/12/02 s3n://ourbucket/logs/2012/01/02

重新指定分区：

Alter table log_messages partition(year=2011.month-12,day=2)

Set location ‘s3n://ourbucket/logs/2012/01/02’;

删除旧分区数据：

Hadoop fs -rmr /data/log_messages/2011/01/02

查看分区：

Show Partitions log_messages;

查看指定分区的存放路径

Describe extended log_messages partition(year=2011,month=1,day=2);

删除表

Drop table if exists employees;

修改表

Alter table log_messages rename to logmsgs;--修改表名

--增加分区

Alter table log_messages add if notexists

Partition(year=2011,month=1,day=1) location ‘/logs/2011/01/01’

Partition(year=2012,month=1,day=2) location ‘/logs/2011/01/02’

Partition(year=2012,month=1,day=3) location ‘/logs/2011/01/03’

……

;

--修改分区路径

Alter table log_messages partition(year=2011,month=12,day=2)

Set location ‘s3n://ourbucket/logs/2011/01/02’;

--删除表分区

Alert table log_messages drop if exists partition(year=2011,month=12,day=2);

修改列信息：

Alter table log_messages change column hms hours_minutes_seconds int

Comment ‘The hours,minutes,and seconds partof the timestamp’

After severity;

--增加列

Alter table log_messages add columns(

App_name string comment ‘Application name’,

Session_id Long comment ‘The current session id’

);

--删除或替换列

Alter table log_messages replace columns(

Hours_mins_secs int comment ‘hour,minute,seconds fromtimestamp’,

Severity String comment ‘The message severity’,

Message String comment ‘The rest of the message’

);

--修改表的属性

Alter table log_messages set tblproperties(

‘notes’=’the process id is no longer captured;’

);

--修改分区存储属性

Alter table log_messages partition(year=2012,month=1,day=1)

Set fileformat sequencefile;

HiveQL:数据操作？

Hive中没有行级别的数据插入、数据更新和删除操作。那么向表中装载数据的唯一途径就是

使用一种“大量”的数据装载操作。

或者通过其他方式仅仅将文件写入到正确的目录下。

使用load装载数据

Load data local inpath ‘${env:HOME}/California-employees’

Overwrite into table employee

Partition(country=’US’,state=’CA’);

注意：1.如果分区不存在会自动先创建分区目录

2．Local关键字，意思是本地文件系统路径。数据会被拷贝到目标位置。

如果不使用local关键字，那么这个路径应该是分布式文件系统中的路径，这种情况下，数据是从这个路径转移到目标位置的。

3.overwrite关键字，如果使用了overwrite关键字，那么目标文件夹中之前存在的数据会先被删除掉。

如果没有使用overwrite关键字，仅仅是把新增的文件增加到目标文件夹中而不会删除之前的数据。

4、inpath路径下不可以包含任何文件夹

通过查询语句向表中插入数据

--将staged_employees中查询出的数据载入employees表的对应分区中

Insert overwrite table employees partition(country=’US’,state=’OR’)

Select * fromstaged_employees se

Where se.cnty=’US’ and se.st=’OR’;

一次扫描多次划分的sql？

--如何为3个州创建表employees分区

From staged_employees se

Insertoverwrite table employees

Partition (country=’US’ ,state=’OR’)

Select * where se.cnty=’US’ and se.st=’OR’

Insert overwrite table employees

Partition (country=’US’ ,state=’ CA’)

Select * where se.cnty=’US’ and se.st=’CA’

Insert overwrite table employees

Partition (country=’US’ ,state=’ IL’)

Select * where se.cnty=’US’ and se.st=’IL’

注意：这里从staged_employees表中读取的每条记录都会经过一条select……where ^句子进行判断。如果某条记录满足某个select……where……语句的话，那么这条记录就会被写入到指定的表和分区中。

动态分区插入

--hive基于查询参数推断出需要创建的分区

Hive.exec.dynamic.partition=true 表示开启动态分区功能

Hive.exec.dynamic.partition.mode=strict

(strict”严格模式”，要求至少有一列分区字段是静态的。设置成nonstrict表示允许所有分区都是动态的)

--根据select语句中最后2列来确定分区字段country和state的值。如果共有100个国家和州的话，就会生成100个分区

Insert overwrite table employees partition(country,state)

Select …，se.cnty,se.st

From staged_employees se;

混合使用动态和静态分区

静态分区：分区字段指定值

动态分区：分区字段的值由查询结果决定

Insert overwrite table employees

Partition(country=’US’,state)

Select …，se.cnty,se.st

Fromstaged_employees se

Where se.cnty=’US’;

单个查询语句中创建表并加载数据

--从宽表中提取部分数据到字表中

Create table ca_employees

As select name,salary,address

Fromemployees

Wherese.state=’CA’;

导出数据：

--直接复制数据保存目录文件到指定目录就行了

Hadoop fs -cp source_path target_path;

--也可以使用insert…directory…，将查询数据存放到指定目录

Insert overwrite local directory ‘/tmp/ca_employees’

Selectname,salary,address

From employees

Where se.state=’CA’;

指定多个输入文件夹目录存放数据：

From staged_employees se

Insert overwrite directory ‘/tmp/or_employees’

Select* where se.cty=’US’ and se.st=’OR’

Insert overwrite directory ‘/tmp/ca_employees’

Select* where se.cty=’US’ and se.st=’CA’

Insert overwrite directory ‘/tmp/il_employees’

Select* where se.cty=’US’ and se.st=’IL’

HiveQL:查询

Case … when … then句式

用于处理单个记录的多条件判断问题

Select name,salary,

Case

Whensalary<50000.0 then ‘low’

Whensalary>=50000.0 and salary<70000.0 then ‘middle’

Whensalary>=70000.0 andsalary<100000.0 then ‘high’

Else ‘very high’

Endas bracket from employees;

什么情况下hive可以避免进行mapreduce？

Hive中某些查询不用使用mapreduce，也就是本地模式。

Select * from employees;--直接读取employees对应的存储目录文件

Select * fromemployees --直接读取对于分区数据

Where country=’US’ and state=’CA’

Limit 100;

一般查询语句中包含group by的时候，会执行reduce过程。

设置开启额外的本地模式

Hive.exec.mode.local.auto=true

浮点数比较的陷阱：

一定要2边都是相同的浮点型，才能避免误差

Where deductionsp[‘Federal Taxes’] > cast(0.2 as Float);

如果是java中，支持0.2F表示float类型，但是HIVE中并不支持。

Like和RLike：

Select * from employees whereaddress.street like ‘%Ave.’;

Select * from employees whereaddress.street Rlike ‘.*(Chicage|Ontario).*’;

Join语句：

Hive只支持等值连接,Pig提供交叉查询功能，所以支持非等值的关联查询。

内连接inner join

Select * from stocks a join stocks b on

a.ymd=b.ymd

where a.symbol=’AAPL’ and b.symbol=’IBM’;

join优化：

1. 当对3个或者更多表进行JOIN连接时，如果每个on子句都使用相同的连接键，那么只会产生一个MapReduce job；

2. 保证连续查询的表的大小从左到右是依次增加的。

Hive在对每行记录进行连接操作时，它会尝试将其他表缓存起来，然后扫描最后那个表进行计算。

使用“标记”显示指出最大表

Select /*+STREAMTABLE(s)*/ s.ymd,s.symbol

From stocks s join dividends d

On s.ymd=d.ymd and s.symbol=d.symbol

Where s.symbol=’AAPL’;

3.小表载入内存中/*+mapjoin(d)*/

Hive>set hive.auto.convert.join=true --自动将小表缓存

Hive.mapjoin.smalltable.filesize=25000000 --判断是否是小表的阀值

左外连接left outer join

右边的没有对应项使用null

Select s.ymd,s,symbol,s.price_close

Fromstocks s left outer join dividends d on s.ymd=d.ymd and s.symbol=d.symbol

Wheres.symbol=’AAPL’

外连接查询中添加分区过滤条件：

使用嵌套select，先进行分区过滤查询，然后在进行外连接

Select s.ymd,s.symbol,s.price_close,d.dividend from

(select * from stocks wheresymbol=’AAPL’ and exchange=’NASDAQ’) s

Left outer join

(select * from dividends wheresymbol=’AAPL’ and exchange=’NASDAQ’) d

On s.ymd=d.ymd;

注意：on语句中的分区过滤条件在外连接（outer join）中是无效的，不过在内连接（inner join）中是有效的。

Right outer join 右连接查询

Full outer join 完全外连接查询

Hive中如何实现sql中的in…exists结构？

Hive中是不支持in和exists的，不过可以使用左半开连接（left semi-join）来实现。

--查询stocks表中ymd和symbol在dividends 表中的记录

Select s.ymd,s.symbol,s.price_close

From stocks s left semi join dividends d

On s.ymd=d.ymd and s.symbol=d.symbol;

注意：hive不支持右半开连接（right semi-join）

笛卡尔积join

注意：hive中不会对where条件查询自动优化为内连接查询

Select * from stocks join dividends

Where stock.symbol=dividends.symbol andstock.symbol=’AAPL’;

这个查询会先执行where语句前的查询，即先进行笛卡尔积计算，然后进行where条件判断。

Hive.mapred.mode=strict --严格模式，禁止笛卡尔积

内连接查询中针对小表的优化？

将小表的数据集全部缓存到内存中

Hive0.7前

Select /*+ mapjoin(d)*/ s.ymd,s.symbol,s.price_close,d.dividend

Fromstocks s join dividends d on s.ymd=d.ymd and s.symbol=d.symbol

Wheres.symbol=’AAPL’;

Hive0.7后

Hive>set hive.auto.convert.join=true --自动将小表缓存

Hive.mapjoin.smalltable.filesize=25000000 --判断是否是小表的阀值

Order by和sort by的区别：

Order by：全局排序

Sort by：局部排序，只会在每个reducer中对数据进行排序

Asc 表示升序排序（默认）

Desc 表示降序排序

Select s.ymd,s.symbol,s.price_close

From stock s

Order by s.ymd asc,s.symbol desc;

注意：由于order by全局排序会导致运行时间过长，如果属性hive.mapred.mode=strict的话，那么hive要求这样的语句必须加有limit语句进行限制。

Distribute by

作用：控制map的输出在reducer中是如何划分的。

--使用distribute by保证具有相同股票交易码的记录会分发到同一个reducer中进行处理

--全局排序

Select s.ymd,s.symbol,s.price_close

From stock s

Distribute by s.symbol

Sort by s.symbol ASC,s.ymd ASC

Cluster by

能实现全局排序，等同于Distribute by …sort by,相当于一种简写形式，

--按照symbol字段全局排序，速度应该比直接使用order by快

Select s.ymd,s.symbol,s.price_close

From stock s

Cluster by s.symbol;

全局排序的3种实现？

1. order by

2. Distribute by …sort by

3. Cluster by

类型转换？

Case(value as type)

其中binary类型只能转换为String类型

Select (2.0*cast(cast(b as string) as double)) from src;

抽样查询

Select * from numbers tablesample(bucket 3 out of 10 on rand()) s;

Union all

将2个或多个表进行合并

Select log.ymd,log.level,log.message

From (

Select l1.ymd,l1.level,l1.message,’Log1’ as source

From log1 l1

Union all

Select l2.ymd,l2.level,l2.message,’Log2’ as source

From log2 l2

)log

Sort by log.ymd asc;

第七章 HiveQL：视图

Hive不支持物理视图，hive中视图只是一个逻辑结构。

Hive先执行这个视图，然后使用这个结果进行余下的查询。

作用：

1. 降低查询的负责度

2. 通过创建视图限制数据访问

Create view shorter_join as

Select * from people join cart

On (cart.people_id=people.id)

Where firstname=’john’;

查询语句简化为：

Select lastname from shorter_join where id=3;

创建视图时也能使用if not exists和comment子句

Create view if not exists shipments(time,part)

Comment ‘time and parts for shipments’

Tblproperties(‘creator’=’me’)

As select …;

删除视图：

Delete view if exists shipments;

查看视图：

Show tables;--注意，没有show view

第八章 HiveQL:索引

Hive只有有限的索引功能。Hive中一张表的索引存储在另一张表中。

1. 提高查询速度

剪裁掉一张表的部分数据块，减少MapReduce的输入数据量。

通过explain命令可以查看某个查询语句是否用到了索引。

--创建索引

Create index employees_index

Ontable employees(country)

As ‘org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler’ --索引处理器

With deferred rebuild

Idxproperties(‘creator’=’me’,’created_at’=’some_time’) --索引额外描述信息

In table employees_index_table --索引信息存储的表

Partitioned by (country,name) --索引包含的分区

Comment ‘Employees indexed by country and name’;

--创建内置的Bitmao索引

Create index employees_index

Ontable employees(country)

As ‘BITMAP’ --索引处理器

With deferred rebuild

Idxproperties(‘creator’=’me’,’created_at’=’some_time’) --索引额外描述信息

In table employees_index_table --索引信息存储的表

Partitioned by (country,name) --索引包含的分区

Comment ‘Employees indexed by country and name’;

--重建索引

Alter index employees_index

On table employees

Partition(country=’US’)

Rebuild;

--显示索引

Show formatted index on employees;

--删除索引

Drop index if exists employees_index on table employees;

第10章调优

1.使用explain查看执行计划

Hive>Explain select sum(number) from onecol;

2.使用explain extended 查看更多信息

3.限制调整

主要是对limit的一些控制优化。

Hive.limit.optimize.enable=true --当开启后，使用limit会对源数据进行抽样

4.join优化

大表放在join语句的最右边或使用/*+STREAMTABLE(s)*/显示指出大表

小表载入内存中/*+mapjoin(d)*/

Hive>set hive.auto.convert.join=true --自动将小表缓存

Hive.mapjoin.smalltable.filesize=25000000 --判断是否是小表的阀值

5.本地模式

Set hive.exec.mode.local.auto=true

不进行MapReduce计算

6.并行执行

特定job包含众多的阶段，有些阶段是可以并行执行的，那么job可能越快完成

Hive.exec.parallel=true

7.严格模式

Hive.mapred.mode=strict

禁止3中查询：

Ø 对于分区表，用户不允许扫描所有分区

Ø 对于使用了order by语句的查询，要求必须使用limit语句

Ø 限制笛卡尔积

8.调整mapper和reducer个数

9.JVM重用

Hadoop的默认配置通常是使用派生JVM来执行map和reduce任务的。

而启用JVM重用可以使得JVM实例在同一个job中重新使用N次。

Mapred.job.reuser.jvm.num.tasks=10

10.索引

11.动态分区调整

开启动态分区的“严格模式”

Hive.exec.dynamic.partition.mode=strict

12.开启中间压缩

Hive.exec.compress.intermediate=true

13.开启最终输出结果压缩

Hive.exec.compress.output=true --开启输入结果压缩

Mapred.output.compression.codec=org.apache.hadoop.id.compress.GzipCodec–指定压缩格式

第18章安全和授权

Hive.security.authorization.enabled=true --开启授权

Hive.security.authorization.createtable.owner.grants=All --文件的创建者具有所有权限

--授权，给用户Edward授予创建表的权限

Hive> Grant create ondatabase default to user Edward;

--查看权限

Hive>show grant user Edward on database default;

--使用revoke取消权限

Hive> revoke create ondatabase default to user Edward;

权限列表：

All 赋予所有的权限

Alter 有修改表结构的权限

Create 有创建表的权限

Drop 有删除表或表的分区的权限

Index 有创建索引的权限

Lock 开启并发后，锁定和解锁表的权限

Select 查询表或者分区中数据的权限

Show_database 查看所有数据库的权限

Update 向表或者分区中插入或加载数据的权限

自动授权：

Hive.security.authorization.createtable.owner.grants=select,drop

设置创建表的用户自动授予对这张表的查询和删除的权限

在创建表时自动授予指定用户指定的权限：

--hive管理员admin1和Edward用户默认授予所有表的读权限，user1只有创建的权限

<name>Hive.security.authorization.createtable.owner.grants</name>

<value>admin1,Edward:select;user1:create</value>

</property>

第19章锁

Hive缺少update和insert类型的查询中用到的对于列、行或者查询级别的锁支持。

多用户操作时，锁和协调时非常有用的。

Hive的锁机制需要Zookeeper的支持。

Hive.zookeeper.quorum --配置Zookeeper集群所有机器的ip

Hive.support.concurrency=true --开启hive的锁机制

--查看锁

Show locks；

--对表people创建一个显示锁

Hive> lock table people exclusive;

--解锁

Hive> unlock table people;

斗者_2013

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录