Hive高频使用

最新推荐文章于 2024-08-01 08:58:44 发布

HDP_CDH

最新推荐文章于 2024-08-01 08:58:44 发布

阅读量541

点赞数 1

分类专栏： Hadoop生态圈文章标签： hive

本文链接：https://blog.csdn.net/Java_Hadoop/article/details/84335368

版权

Hadoop生态圈专栏收录该内容

4 篇文章 0 订阅

订阅专栏

1.强制删除带有表的hive库：

drop database 库名 cascade;

2.设置显示当前使用的hive库：

hive> set hive.cli.print.current.db=true;

查询当前使用的hive库：select current_database();

3. hive客户端设置运行队列：

hive>set tez.queue.name=mr ;

4.hive中删除分区操作：

alter table fastdo_lte.tdlte_mro_locate_hour drop if exists partition(city="HUZHOU",sdate=2019012609);

外部表删除数据的时候不能使用truncate，外部表的数据有hadoop管理。

批量删除时间分区:

alter table cdr.cdr_mdt_h drop if exists partition(day>="20190509",day<="20190511");

4-1.hive中修改现有分区

alter table wxwzzx_hive_db.DIM_CFG_GSM_DIST_STATA partition(type_n='newest',slicetime=20200424000000) RENAME TO PARTITION (type_n='old',slicetime=20200424000000)

5.向Hive分区中加载数据：

本地：

load data local inpath '/home/daxin/jdata/cfg_grid10.csv' into table cfg_grid10 partition(ds="20180821",ct="HANGZHOU");

集群：

load data inpath '/user/hadoop/cfg_grid10.csv' into table cfg_grid10 partition(ds="20180821",ct="HANGZHOU");

6.查看Hive中的表结构：

desc tdlte_mro_locate_hour;

desc formatted tdlte_mro_locate_hour;(格式化后的完整信息)

7.查看hive中的函数以及用法：

show functions ;(查看函数)

desc function extended upper ;(查看用法)

8.hive中自定义udf函数(一对一)：

8.1,继承UDF类,并实现evaluate函数

public class FindSpecifiedWords extends UDF{

public int evaluate(String targetWords,String rawWords)

{

int found = 0;

if (rawWords.contains(targetWords)) {

found = 1;

}

return found;

}

8.2,打成jar包,上传到hive的lib或者其它目录;

8.3,添加jar add jar:

add jar /var/lib/hadoop-hdfs/sqoop_b/udf_jar/udf_stringfilter.jar;

创建临时函数:

create temporary function stringfilterEmoji as 'com.hy.StringFilterEmoji';

创建永久函数:

create function stringfilterEmoji as'com.hy.StringFilterEmoji' using jar 'hdfs:///user/hive/warehouse/ods.db/udf_jar/udf_stringfilter.jar';

9.Hive中几种常见的交互命令：

hive -help(查看命令用法)

10.Hive客户端中查看集群:

！hdfs dfs -ls /

如：!hdfs dfs -ls /apps/hive/warehouse/fastdo_lte.db;

11.Hive中不同创建表的方式：

根据原始表提取字段,可以加快数据处理的效率

create table if not exists wang_mro_locate_hour as select * from fastdo_lte.tdlte_mro_locate_hour limit 10000;

创建一个相同表结构的表,但是此时只是表结构相同,没有数据

create table if not exists wang_mro_locate_hour like fastdo_lte.tdlte_mro_locate_hour;

12.Hive内部表通过put导入数据后，通过sql查询不到对应的信息解决方式：

第一种方式在mysql元数据中没有分区的元数据通过sql语句查询不出对应的信息

dfs -mkdir -p /user/hive/warehouse/dept_part/day=20150913 ;

dfs -put /opt/datas/dept.txt /user/hive/warehouse/dept_part/day=20150913 ;

进行分区的修复就是将分区信息添加到元数据中修复后一切正常

hive (default)> msck repair table dept_part ;

MSCK REPAIR TABLE batch wise to avoid OOME ( Out of Memory Error). By giving the configured batch size for the property hive.msck.repair.batch.size

it can run in the batches internally. The default value of the property is zero

第二种方式在mysql元数据中没有分区的元数据通过sql语句查询不出对应的信息

dfs -mkdir -p /user/hive/warehouse/dept_part/day=20150914 ;

dfs -put /opt/datas/dept.txt /user/hive/warehouse/dept_part/day=20150914 ;

进行分区的修复就是将分区信息添加到元数据中修复后一切正常

alter table dept_part add if not exists partition(day='20150914');

13.Hive中group by和having的使用：

group by /having 查询结果的字段要么是group by 后的字段要么是聚会函数的字段分组

* where 是针对单条记录进行筛选

* having 是针对分组结果进行筛选经常与group by连用

select city, sdate,count(1) from tdlte_mro_locate_hour group by city,sdate;

14.Hive客户端查询显示不带表名的字段名：

set hive.resultset.use.unique.column.names=false;

set hive.cli.print.header=true;

15.Hive外部表多级分区数据迁移后，数据恢复：

alter table tdlte_mro_locate_hour_old add if not exists partition(city='JINHUA',sdate='2018100108') location '/apps/hive/warehouse/fastdo_lte.db/tdlte_mro_locate_hour_old/JINHUA/2018100108';

16.Hive外部表数据源移动后，表和数据重新建立联系：

alter table tdlte_mro_locate_hour_old set location '/apps/hive/warehouse/fastdo_lte.db/tdlte_mro_locate_hour_old';

17.shell脚本中执行sql脚本设置：

#!/bin/bash

hive<<EOF

TRUNCATE TABLE data.fact_teacher_info_stunum;

EOF

18.Hive引擎设置：

1、配置mapreduce计算引擎

set hive.execution.engine=mr;

2、配置spark计算引擎

set hive.execution.engine=spark;

3、配置tez 计算引擎

set hive.execution.engine=tez;

19.Hive查询调优：

1）设置reduce task 任务个数：

set mapred.reduce.tasks =1000；#根据数据量修改

set hive.exec.reducers.bytes.per.reducer=500000000；(500m)#设置每个reduce处理的数据量

-- 修改reduce任务从map完成80%后开始执行 set mapreduce.job.reduce.slowstart.completedmaps = 0.8

2)减少map个数：

Hadoop会将单个文件拆分（对应）为多个文件，并且并行处理产生的文件。对应程序的数目取决于分割的数目，通过设置分割组大小上下线来调整map个数

set tez.grouping.min-size=1073741824; #默认 52428800

set tez.grouping.max-size=2147483648;#默认 1073741824

3)调整map个数

mapred.max.split.size：一个split的最大值，即每个map处理文件的最大值

20.Hive中按照指定分隔符导出数据：

set mapred.reduce.tasks=n;

控制最后输出文件个数

sql 最后写 distribute by rand() 保证每个结果文件大小相同

1)将数据导出到本地：

> insert overwrite local directory " hdfs://ns2/jc_wxwzzx/aa_data_dir//HUIYONG_DATA"

> row format delimited

> fields terminated by ',' #按照，分割

> select * from tdlte_mro_locate_hour;

2)将数据导出到集群：

> insert overwrite directory " hdfs://ns2/jc_wxwzzx/aa_data_dir//HUIYONG_DATA"

> row format delimited

> fields terminated by ',' #按照，分割

> select * from tdlte_mro_locate_hour;

把指定目录下的所有文件合并到一起：

hdfs dfs -getmerge /user/hadoop/output local_file

21.Hive 中修改字段类型或者增加字段：

1)修改字段类型

Alter table 表名 change column 原字段名称现字段名称数据类型

2)新增字段表

alter table 表名 add columns(字段名数据类型) cascade 在表结尾添加

添加字段 ( 待验证 )

alter table test.table_add_column_test add columns (added_column string comment '新添加的列') cascade;

再对列进行排序(注意：必须添加cascade关键字，不然不会刷新旧分区数据，关键字cascade能修改元数据)

alter table test.table_add_column_test change column added_column added_column string after original_column1 cascade;

3) REPLACE 按照任意位置新增字段

ALTER TABLE test_change REPLACE COLUMNS (a int, b int);

4) 修改hive表结构中分隔符

alter table test01 set serdeproperties('field.delim'='\t');

alter table test_wang set tblproperties( 'field.delim'='@#');

22.Hive中创建库指定路径：

create database fastdo_lte location '/data/fastdo_lte';

23.Hive 分区修复报错：

msck repair table biaoming 修复分区时报错解决：

set hive.msck.path.validation=ignore;

MSCK REPAIR TABLE batch wise to avoid OOME ( Out of Memory Error). By giving the configured batch size for the property hive.msck.repair.batch.size

it can run in the batches internally. The default value of the property is zero

24.Hive中创建parquet格式的表：

create table 表名(....)

stored as parquet

location

' hdfs://wbdspark02cluster/data/result/mr'

解决parquet表查询结果是NULl的方法：

create table 表名(....)

ROW FORMAT SERDE

'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'

with serdeproperties(

'parquet.column.index.access'='true')

stored as parquet

location

' hdfs://wbdspark02cluster/data/result/mr/mro'

25.Hive表中查询数据，快速定位到底层数据文件或者文件位置：

在hive中可以指定三个静态列:

1. INPUT__FILE__NAME map任务读入File的全路径

2. BLOCK__OFFSET__INSIDE__FILE 如果是RCFile或者是SequenceFile块压缩格式文件则显示Block file Offset，也就是当前快在文件的第一个字偏移量，如果是TextFile，显示当前行的第一个字节在文件中的偏移量

3. ROW__OFFSET__INSIDE__BLOCK RCFile和SequenceFile显示row number, textfile显示为0

注：若要显示ROW__OFFSET__INSIDE__BLOCK ，必须设置set hive.exec.rowoffset=true;

26.Hive中查询除某些字段之外的所有字段：

set hive.support.quoted.identifiers=None;

#查询除day,hour,city之外的字段值

select `(day|hour|city)?+.+` from tdlte_mro_locate_hour limit 1;

27.使用beeline连接hive库

HiveServer2 主机: 192.168.124.9

HiveServer2 Port:10000

beeline -u "jdbc:hive2://192.168.124.9:10000/"

28.hive建表跳过文件第一行

该设置针对每次入的数据

tblproperties(

"skip.header.line.count"="n", --跳过文件行首n行

"skip.footer.line.count"="n" --跳过文件行尾n行

)

29,Hive中修改表文件存储格式

ALTER TABLE tablename SET FILEFORMAT textFile(新的文件格式);

30,Hive 中模糊查询表名

show tables like 'test*';

来模糊匹配表名，此处like可以省略。

31,Map端join 实现方式:

Hive0.7之前，需要使用hint提示 /*+ mapjoin(table) */才会执行MapJoin,否则执行Common Join，

但在0.7版本之后，默认自动会转换Map Join，由参数hive.auto.convert.join来控制，默认为true.

1,sql方式,在sql语句中添加mapjoin标记【mapjoin hint】

语法：select /*+MAPJOIN(smallTable) */ smallTable.key,bigTable.value from smallTable JOIN btTable ON smallTable.key = bigTable.key;

2,开启自动mapjoin，通过设置一下配置启动自动的mapjoin

　set hive.auto.convert.join = true;【该参数为ture时，Hive自动对左边的表统计量，如果是小表就加入内存，即对小表用mapjoin】相关参数配置：

　>hive.mapjoin.smalltable.filesize;【大小表判断阈值，表的大小小于该值则为小表，加载到内存中】

　>hive.ignore.mapjoin.hint;【默认值为true，是否忽略mapjoin hint即mapjoin标记】

　>hive.auto.convert.join.noconditionaltask;【默认值为true，将普通的join转化为普通的mapjoin时，是否将多个mapjoin转化为一个mapjoin】

　>hive.auto.convert.join.noconditionaltask.size;【将多个mapjoin转化为一个mapjoin时，列表的最大值】

32，hive 参数优化

// 让可以不走mapreduce任务的，就不走mapreduce任务

hive> set hive.fetch.task.conversion=more;

// 开启任务并行执行

set hive.exec.parallel=true;

// 解释：当一个sql中有多个job时候，且这多个job之间没有依赖，则可以让顺序执行变为并行执行（一般为用到union all的时候）

// 同一个sql允许并行任务的最大线程数

set hive.exec.parallel.thread.number=8;

// 设置jvm重用

// JVM重用对hive的性能具有非常大的影响，特别是对于很难避免小文件的场景或者task特别多的场景，这类场景大多数执行时间都很短。jvm的启动过程可能会造成相当大的开销，尤其是执行的job包含有成千上万个task任务的情况。

set mapred.job.reuse.jvm.num.tasks=10;

// 合理设置reduce的数目

// 方法1：调整每个reduce所接受的数据量大小

set hive.exec.reducers.bytes.per.reducer=500000000; （500M）

// 方法2：直接设置reduce数量

set mapred.reduce.tasks = 20

// map端聚合，降低传给reduce的数据量

set hive.map.aggr=true

// 开启hive内置的数倾优化机制

set hive.groupby.skewindata=true

33，小文件合并

小文件的产生有三个地方，map输入，map输出，reduce输出，小文件过多也会影响hive的分析效率：

设置map输入的小文件合并

set mapred.max.split.size=256000000;

//一个节点上split的至少的大小(这个值决定了多个DataNode上的文件是否需要合并)

set mapred.min.split.size.per.node=100000000;

//一个交换机下split的至少的大小(这个值决定了多个交换机上的文件是否需要合并)

set mapred.min.split.size.per.rack=100000000;

//执行Map前进行小文件合并

set hive.input.format= org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

设置map输出和reduce输出进行合并的相关参数：

//设置map端输出进行合并，默认为true

set hive.merge.mapfiles = true

//设置reduce端输出进行合并，默认为false

set hive.merge.mapredfiles = true

//设置合并文件的大小

set hive.merge.size.per.task = 256*1000*1000

//当输出文件的平均大小小于该值时，启动一个独立的MapReduce任务进行文件merge。

set hive.merge.smallfiles.avgsize=16000000

HDP_CDH

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录