20210109_hive学习笔记

最新推荐文章于 2022-07-28 13:33:29 发布

yehaver

最新推荐文章于 2022-07-28 13:33:29 发布

阅读量333

点赞数 1

文章标签： hive

本文链接：https://blog.csdn.net/yehaver/article/details/117522952

版权

一、hive安装与配置
1.下载hive apache-hive-3.1.2-bin.tar.gz 需要先安装hadoop和mysql，且启动服务
2.sftp上传到linux,tar -xzvf apache-hive-3.1.2-bin.tar.gz解压
3.修改配置文件
mv /root/apache-hive-3.1.2-bin /root/hive-3.1.2
mv /root/hive-3.1.2/conf/hive-env.sh.template /root/hive-3.1.2/conf/hive-env.sh
vim /root/hive-3.1.2/conf/hive-env.sh
HADOOP_HOME=/root/hadoop-2.10.1
HIVE_CONF_DIR=/root/hive-3.1.2/conf
HIVE_AUX_JARS_PATH=/root/hive-3.1.2/lib
4.配置环境变量及日志文件目录
vim /root/.bash_profile
export HIVE_HOME=/root/hive-3.1.2
export PATH=$HIVE_HOME/bin:$PATH
source /root/.bash_profile
cp /root/hive-3.1.2/conf/hive-log4j2.properties.template /root/hive-3.1.2/conf/hive-log4j2.properties
vim /root/hive-3.1.2/conf/hive-log4j2.properties
property.hive.log.dir = /root/hive-3.1.2/logs #修改日志文件目录
mkdir /root/hive-3.1.2/logs
5.初始化数据; 此处如果不适用derby数据库，请直接跳转到7
cd /root/hive-3.1.2/bin
./schematool -dbType derby -initSchema
6.使用hive命令可以进入操作

7.不使用derby单人数据库，改为mysql数据步骤，先安装mysql数据，在下载 mysql-connector-java-8.0.22.tar.gz包（java包要和mysql版本一直）,解压，上传mysql-connector-java-8.0.22.jar到hive服务器指定目录
mv /root/mysql-connector-java-8.0.22.jar /root/hive-3.1.2/lib/
8.修改配置文件
touch /root/hive-3.1.2/conf/hive-site.xml
vim /root/hive-3.1.2/conf/hive-site.xml
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--><configuration>




   <property>
   <name>javax.jdo.option.ConnectionURL</name>
   <value>jdbc:mysql://master:3306/metastore?createDatabaseIfNotExist=true</value>
   <description>当执行schematool -dbType mysql -initSchema时，会在mysql中创建库metastore，如初始化失败，再次重新执行时，需要先取删掉此库。此名字可以修改为不是mysql已存在的库</description>
   </property>
   <property>
   <name>javax.jdo.option.ConnectionDriverName</name>
   <value>com.mysql.cj.jdbc.Driver</value>
   <description>要上传mysql驱动到hive的lib文件夹</description>
   </property>
   <property>
   <name>javax.jdo.option.ConnectionUserName</name>
   <value>root</value>
   </property>
   <property>
   <name>javax.jdo.option.ConnectionPassword</name>
   <value>yehaver</value>
   </property>
   <property>
   <name>hive.metastore.schema.verification</name>
   <value>false</value>
   <description>强制metastore的schema一致性，开启的话会校验在metastore中存储的信息的版本和hive的jar包中的版本一致性，并且关闭自动schema迁移，用户必须手动的升级hive并且迁移schema，关闭的话只会在版本不一致时给出警告，默认是false不开启；</description>
   </property>
</configuration>
9.重新初始化数据库
schematool -dbType mysql -initSchema
10.hive进入客户端

启动问题1：
Loading class `com.mysql.jdbc.Driver'. This is deprecated. The new driver class is `com.mysql.cj.jdbc.Driver'. The driver is automatically registered via the SPI and manual loading of the driver class is generally unnecessary
更改vim hive-site.xml
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value>
</property>

启动问题2：
InnoDB is limited to row-logging when transaction isolation level is READ COMMITTED or READ UNCOMMITTED
原因：事务隔离级别为 READ-COMMITTED的InnoDB引擎，在binlog方式为 STATEMENT时并不安全。
无法写入二进制文件，因为 BINLOG_FORMAT方式为STATEMENT，存在一个或多个表使用的存储引擎是基于行的日志记录，InnoDB数据库引擎，
当事务隔离级别为READ COMMITTED 或READ UNCOMMITTED 时，只限定于binlog为Row方式。
链接mysql,执行如下命令
SET GLOBAL binlog_format = 'MIXED';
防止下次出错，修改/etc/my.cnf 将binlog-format=MIXED #设置二进制文件格式

当有进程进入hive后jps会显示如下进程
RunJar

二、hive命令使用，大部分与mysql一样；hive里面可以执行 dfs -ls /所有命令，只需去掉hadoop;也可以执行linux命令，只需加感叹号 ! ls /
在root用户下也有.hivehistory，有在hive中执行所有命令的历史记录
show databases like '*hive*'; #mysql通配符是%，hive是*
use default;
create table student(id int,name varchar(20))row format delimited fields terminated by '\t';
insert into student values(100,'yangweiming');
show tables;
show create table table_name;
desc formatted table_name;
select * from student;
select count(1) from student;
select cast('1' as int); #强制转换数据类型
load data local inpath '/root/import_hive.txt' into table student; #外部数据导入 hdfs dfs -put /root/import_put.txt /user/hive/warehouse/student2 #和上面load data 一样，通过select * from student2;
set mapred.reduce.tasks; #hive中查看当前系统参数
set mapred.reduce.tasks=20; #hive中设置当前系统参数
create database if not exists hive location '/user/hive/warehose/'; #创建库
drop database hive cascade; #如果hive里面有表，必须加cascade

在linux命令bash模式中可以使用hive -e执行sql，非常适合写shell脚本
hive -e "select * from student;" > /root/data/my.txt
在linux命令bash模式中可以使用hive -f执行sql文件
hive -f /root/hive_test.sql
#进入hive时可以设置当前session系统参数
hive -hiveconf mapred.reduce.tasks=10

三、hive通过jdbc访问方式
1.启动hive对外服务
cd /root/hive-3.1.2/bin
./hiveserver2
2.使用beeline进入jdbc
beeline
beeline> !connect jdbc:hive2://master:10000
Connecting to jdbc:hive2://master:10000
Enter username for jdbc:hive2://master:10000: root
Enter password for jdbc:hive2://master:10000: *******
21/01/07 14:33:01 [main]: WARN jdbc.HiveConnection: Failed to connect to master:10000
Error: Could not open client transport with JDBC Uri: jdbc:hive2://master:10000: Failed to open new session: java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: root is not allowed to impersonate root (state=08S01,code=0)
在/root/hadoop-2.10.1/etc/hadoop/core-site.xml增加如下内容，原因是我所有用户都用的是root,系统认为我伪装为root用户登录
<property>
<name>hadoop.proxyuser.root.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.root.groups</name>
<value>*</value>
</property>
3.进入jdbc模式后可以像hive一样操作hadoop集群

四、hive数据类型,DDL和DML操作,基本和mysql一致
tinyint
smallint
int
bigint
boolean
float
double
string # 和varchar一样，但可以存2G数据
timestamp
binary

struct #集合数据类型,和python列表、元组、字典、集合
map #集合数据类型,和python列表、元组、字典、集合
array #集合数据类型,和python列表、元组、字典、集合

1.管理表 MANAGED_TABLE
2.外部表 EXTERNAL
3.分区表
4.分桶表
create external table if not exists hive.table_name2( #加external创建外部表，不加为管理表。外部表当使用drop table删除后，hadoop并没有把数据删除,重新建此表后数据依然有，更为安全保存数据
id int,
name string,
friends array<string>,
children map<string,int>,
address struct<street:string,city:string>
)
partitioned by (yr_mon string,biz_dt string) #对表进行分区，指定分区字段。分区字段在上面字段中不要写
row format delimited fields terminated by ','
collection items terminated by '_'
map keys terminated by ';'
lines terminated by '\n'
location '/user/hive/warehose/table_name2'
stored as orc tblproperties ('orc.compress'='none'); #文件存储的格式，具体参考hadoop高级学习笔记中hadoop存储格式. orc格式自带压缩，可以关闭none;或者snappy压缩

alter table hive.table_name2 set tblproperties('EXTERNAL'='TRUE'); #修改表为外部表
alter table hive.table_name2 set tblproperties('EXTERNAL'='FALSE'); #修改表为管理表(内部表)
alter table hive.table_name2 add partition(yr_mon='202103') partition(yr_mon='202102');
alter table hive.table_name2 drop partition(yr_mon='202103'),partition(yr_mon='202102');

show partitions hive.table_name2;

load data local inpath '/root/import_hive.txt' into table hive.table_name partition(yr_mon='202012'); #将数据固定加载到哪个分区
load data local inpath '/root/import_par.txt' into table hive.table_name2 partition(yr_mon='202101',biz_dt='20210101'); #将数据加载到固定分区和子分区
load data local inpath '/root/import_par.txt' overwrite into table hive.table_name2 partition(yr_mon='202101',biz_dt='20210101'); #将数据加载到固定分区和子分区,并覆盖以前数据
insert into table hive.table_name partition(yr_mon) #此开启动态分区让其自己将所有数据分配到不同分区里面,分区字段为最后一个字段
select id,name,yr_mon from emp; #yr_mon分区字段必须为最后一个字段

分区表和分桶表区别
分区表：放不同文件夹，存储路径不同
分桶表：根据某个字段hash到不同文件；其实也是分区，只是hash分区

create table user_bucket(id int comment 'ID',name string comment '姓名',age int comment '年龄') comment '测试分桶'
clustered by (id) sorted by (id) into 4 buckets row format delimited fields terminated by '\t';
#创建分桶表，按照id做hash分区，分为3个桶，每个桶内按照id排序
使用分桶表时需要设置几个参数
set hive.enforce.bucketing=true #启用
set mapreduce.job.reduces=-1 #分桶表默认让其走桶的个数
select * from user_bucket tablesample(bucket 1 out 4 on id); #分桶表仅用于数据抽样计算，少用

1.先上传数据后修复
dfs -mkdir -p /user/hive/warehouse/student/yr_mon=202101
msck repair table student;
2.上上传数据后添加分区
dfs -mkdir -p /user/hive/warehouse/student/yr_mon=202101
alter table student add partition(yr_mon='202101')
3.创建文件夹后load数据到分区
dfs -mkdir -p /user/hive/warehouse/student/yr_mon=202101
load data local inpath '/root/student.txt' into table hive.student partition(yr_mon='202101');

alter table table_name rename to table_name_new;
ALTER TABLE STUINFO CHANGE `NAME` my_name date;

insert overwrite into hive.table_name
select * from hive.table_name2;
insert overwrite hive.table_name
select * from hive.table_name2;
create table table_name as
select * from xxx;

truncate table default.student; # 只能清空管理表，外部表不行

4.多表关联
a和b关联只能用等值链接，且不能含or,否则关联报错

五、hadoop数据导出
hadoop fs -get /user/hive/warehouse/student/000000_0 /root/study/hadoop_get.txt #从hadoop复制文件到本地

insert overwrite local directory '/root/study/export/' row format delimited fields terminated by '\t' #加了overwrite之后会覆盖/root/目录下所有内容，且root权限太大，谨慎执行;/root/study/export/只能给目录，且目录下所有东西都会清空
select * from default.student distribute by deptno order by sal; #导出文件可以让mapreduce对数据先分区distributed by,每个部门一个mapreduce进程，最后order by是汇总每个部门的排序

hive -e 'select * from default.student;' > /root/study/hive_e/student.txt;

export table default.student to '/root/study/export_dir'; #这个是导出到hadoop集群上，并不是导出到本地。其实就是hadoop里面hive表导入到hadoop文件。不常用
import table default.student partition(biz_dt='20210111') from '/root/study/export_dir'; #表中如果有数据导不进去，且导入的路径必须是export导出的(因为必须要有源数据)。不常用

五、hive中select查询
1.order by ,sort by ,cluster by ,distribute by 排序方式
order by（全局排序）会对输入做全局排序，因此只有一个reducer（多个reducer无法保证全局有序），也正因为只有一个reducer，所以当输入的数据规模较大时会导致计算时间较长。
sort by（每个MapReduce内部排序）：每个Reducer内部进行排序，对全局结果集来说不是排序。
distribute by（分区排序）控制某个特定行应该到哪个reducer，通常是为了进行后续的聚集操作。
cluster by 当distribute by和sorts by字段相同时，可以使用cluster by方式代替。cluster by除了具有distribute by的功能外还兼具sort by的功能。但是排序只能是升序排序，不能像distribute by一样去指定排序的规则为ASC或者DESC，否则报错：

新建一个测试用表employInfo：
create table employInfo(deptID int,employID int,employName string,employSalary double)
row format delimited fields terminated by ',';

向测试用表中导入测试数据：
load data local inpath '/home/hadoop/datas/employInfo.txt' into table employInfo;

（1）设置reducer的个数为4
set mapreduce.job.reduces=4;
（2）将查询结果导入到文件中（按照部门编号分区且按照薪资降序排序）
insert overwrite local directory '/root/study/distribute-result'
select * from employInfo distribute by deptID sort by employSalary desc; #一般 distribute by 和 sort by连用
下面两种写法相同
select * from employInfo cluster by deptID;
select * from employInfo distribute by deptID sort by deptID;

2.窗口函数
select name,
   sum(cost) over (distribute by name sort by sal) windows_sum_sal, #hive支持窗口函数，且支持上面4中排序方式
   lag(orderdate,1) over (partition by name order by orderdate) windows_lag_orderdate,
   lead(orderdate,1) over (partition by name order by orderdate) windows_lead_orderdate,
   sum(cost) over (rows between unbounded preceding and current row) win_1, #unbounded preceding没有限制起点，current row终点为当前行
   sum(cost) over (partition by name rows between 3 preceding and current row) win_2, #从前面第三条记录开始，current row终点为当前行
   sum(cost) over (partition by name rows between current row and 3 following) win_3, #从前面第三条记录开始，current row终点为当前行
   ntile(5) over (order by orderdate ASC NULLS LAST) win_ntile #将总数据按订单时间分成五份，时间排前面，如果第五分数据不能整除，则数据可能小于count(1) / 5
from empl;

六、hive相关函数
1.系统自带函数
show functions; #查看所有函数
desc function trim; #查看函数trim描述
desc function extended trim; #查看trim函数的详细使用案例

select date_format(current_timestamp(),'yyyy-MM-DD');
select date_add(current_timestamp(),5);
select datediff(current_timestamp(),'2021-01-01');
select regexp_replace('2021/09/21','/','-');

if(10<5,'大','小') 类似oracle case when else end mysql也有case when和oracle一样
select concat('a','b','c'),collect_set(col_to_row); #collect_set和oracle wm_concat一样

select movie,category_name from movie_info lateral view explode(category) table_tmp as category_name;

2.用户自定义函数
如果要创建自定义函数需要先在java中开发jar函数包()，然后上传到hive服务器上。
sftp root@192.168.100.101;
put 上传的jar函数包;
mv jar函数包 /root/hive-3.1.2/lib/
进入hive模式
   add jar '/root/hive-3.1.2/lib/上传的jar函数包';
   create [temporary] function dbname.function_name as 'java函数打的jar包，给全类名，如：.com.myname.myudf';
   drop [temporary] function if exists dbname.function_name;

七、hive优化
set hive.fetch.task.conversion=more; #本机hadoop 2.10.1已经默认是more,让取数据时走fetch而不是mr应用。
set hive.exec.mode.local.auto=true; #开启本地模式，
   当一个job满足如下条件才能真正使用本地模式：
   1.job的输入数据大小必须小于参数：hive.exec.mode.local.auto.inputbytes.max(默认128MB)
   2.job的mr最大输入文件个数必须小于参数：hive.exec.mode.local.auto.input.files.max(默认4)
   3.job的reduce数必须为0或者1

set hive.auto.convert.join=true; #开启mapjoin，在内存中缓存小表;默认开启
set hive.mapjoin.smalltable.filesize=25000000; #内存中缓存小表定义，小于此大小才是小表；默认25M

如果对某个字段关联或者分区，会进行mapreduce，而此字段又有空值，则将此字段转换为某个固定值处理。防止数据倾斜

set hive.map.aggr=true; #是否在map端进行聚合，默认为开启。主要是使用gourp by 时快
set hive.groupby.skewindata=true; #有数据倾斜时进行负载均衡；默认关闭

少用select * 多写select 指定需要字段

set hive.exec.dynamic.partition=true; #开启动态分区，默认开启，就是以表中某个字段的数据进行分区
set hive.exec.dynamic.partition.mode=nonstrict;#设置动态分区非严格模式。默认为严格模式，严格模式下插入数据时必须指定一个分区
set hive.exec.max.dynamic.partitions=1000; #设置动态分区全局最大个数；默认为1000；
set hive.exec.max.dynamic.partitions.pernode=1000; #设置动态分区一个节点最大个数；默认为100；设置和设置动态分区全局最大个数一样

设置合理map数:
复杂文件增加Map数：当input的文件比较大，任务逻辑复杂，map执行非常慢的时候，可以考虑增加Map数，来使得每个map处理的数据量减少，从而提高任务的执行效率。

小文件合并:
如果一个任务有很多小文件（远远小于块大小128m），则每个小文件都会被当做一个块，用一个map任务来完成，而一个map任务启动和初始化的时间远远大于逻辑处理的时间，会造成很大的资源浪费。

合理设置Reduce数
（1）如果Reduce设置的过大，将会产生很多小文件，对NameNode会产生一定的影响，而且整个作业的运行时间未必会减少；
（2）如果Reduce设置的过小，单个Reduce处理的数据将会加大，很可能会引起OOM（out of memory）异常。
set hive.exec.reducers.bytes.per.reducer=256000000; #每个reduce处理的数据量
set hive.exec.reducers.max=1009; #每个任务最大reduce数
set mapreduce.job.reduces=-1; #每个job的reduce个数(order by和distinct默认只起一个reduce);-1代表根据输入数据来
真正调起的reduce数量 N = min(mapreduce.job.reduces,总输入数据量大小/hive.exec.reducers.bytes.per.reducer)

设置并行执行
set hive.exec.parallel=false; #开启并行执行，默认关闭，除非集群资源多才会开启
set hive.exec.parallel.thread.number=8; #并发数

严格模式
set hive.mapred.mode=strict; #hive严格模式下，几种语句不让执行。
1.笛卡儿积
2.分区表插入数据如果不指定分区
3.大int，大string，大double数据插入时
4.order by必须指定limit

JVM重用
在mapred-site.xml文件中配置
mapreduce.job.jvm.numtasks=1; #一个jvm可连续启动多个同类型任务，默认值1，若为-1表示不受限制。

执行计划分析
explain select * from emp;

yehaver

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
20210109_hive学习笔记

一、hive安装与配置 1.下载hive apache-hive-3.1.2-bin.tar.gz 需要先安装hadoop和mysql，且启动服务2.sftp上传到linux,tar -xzvf apache-hive-3.1.2-bin.tar.gz解压3.修改配置文件mv /root/apache-hive-3.1.2-bin /root/hive-3.1.2mv /root/hive-3.1.2/conf/hive-env.sh.template /root/hive-3.1.2/...
复制链接

扫一扫