Hive-从安装到实践

最新推荐文章于 2024-10-02 00:05:33 发布

卍杺歿卍

最新推荐文章于 2024-10-02 00:05:33 发布

阅读量183

点赞数

分类专栏： hive 文章标签： hive 分区分桶

本文链接：https://blog.csdn.net/qq_31108731/article/details/101649195

版权

大数据同时被 2 个专栏收录

3 篇文章 0 订阅

订阅专栏

hive

2 篇文章 1 订阅

订阅专栏

本文介绍了Hive的基础知识，包括Hive的介绍、安装步骤、数据类型、基本操作如DDL和DML，重点讲解了分区表和分桶的概念及操作，以及数据压缩的设置。Hive作为Hadoop的数据仓库工具，提供类似SQL的查询方式，方便对大规模数据进行统计分析。

摘要由CSDN通过智能技术生成

1、hive介绍

百度百科：

Hive是基于Hadoop的一个数据仓库工具，可以将结构化的数据文件映射为一张数据库表，并提供简单的sql查询功能，可以将sql语句转换为MapReduce任务进行运行。其优点是学习成本低，可以通过类SQL语句快速实现简单的MapReduce统计，不必开发专门的MapReduce应用，十分适合数据仓库的统计分析。

Hive是建立在 Hadoop 上的数据仓库基础构架。它提供了一系列的工具，可以用来进行数据提取转化加载（ETL），这是一种可以存储、查询和分析存储在 Hadoop 中的大规模数据的机制。Hive 定义了简单的类 SQL 查询语言，称为 HQL，它允许熟悉 SQL 的用户查询数据。同时，这个语言也允许熟悉 MapReduce 开发者的开发自定义的 mapper 和 reducer 来处理内建的 mapper 和 reducer 无法完成的复杂的分析工作。

Hive 没有专门的数据格式。 Hive 可以很好的工作在 Thrift 之上，控制分隔符，也允许用户指定数据格式。

2、hive安装

2.1 下载安装

1. 下载hive——地址：http://mirror.bit.edu.cn/apache/hive/

2. 解压：tar -zxvf apache-hive-3.1.2-bin.tar.gz -C /usr/local/

mv apache-hive-3.1.2-bin hive

3. 修改环境变量：export HIVE_HOME=/usr/local/hive

export PATH=$PATH:$HIVE_HOME/bin

4. 执行source /etc/profile：

执行hive --version

2.2、hive配置

1. 修改hive-site.xml文件

cp hive-default.xml.template hive-site.xml


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
        <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://hadoop01:3306/hive?createDatabaseIfNotExist=true</value>
  </property>

   <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore</description>
  </property>
    <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>用户</value>
    <description>Username to use against metastore database</description>
  </property>

   <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>密码</value>
    <description>password to use against metastore database</description>
  </property>

  <property>
    <name>hive.cli.print.current.db</name>
    <value>true</value>
    <description>显示当前数据库</description>
  </property>

  <property>
    <name>hive.cli.print.header</name>
    <value>true</value>
    <description>显示id列</description>
  </property>

 </configuration>

2.2、设置参数

配置(配置文件)< 启动(启动时通过 -conf 设置参数)<命令行(set parm = value)

eg： set mapreduce.job.reduces = 3;

3、hive 数据类型

3.1、基本数据类型

   tinyint ->byte->1byte有符号整数->20
   smallint ->short->2byte有符号整数 -> 20
   int -> int -> 4byte有符号整数 -> 20
   bigint -> long -> 8byte有符号整数 -> 20
   boolean -> boolean -> (true/false) -> true
   float -> float - > 单精度浮点型 -> 3.14159
   double -> double ->双精度浮点数 -> 3.14159
   string -> string -> 字符系列。可以指定字符集。可以使用单引号或者双引号。 -> 'name' "name" 
   timestamp -> -> 时间类型
   binary -> -> 字节数组
  注意：常用的就是int ,bigint, double,string
  hive的string类型相当于数据库中的varchar类型，该类型是一个可变字符串，不过它不能声明其中最多存储多少个字符，理论上它可以存储2GB的字符数。

3.2、集合数据类型

1. struct ->结构体，和c语言的struct类似，可以通过“点”符号访问元素内容。eg:struct{first string,last string} ==>取第一个元素可以通过字段first来引用。--select addr.city from personInfo;
2. map -> map是一组键值对元组集合；类似于java中的map
eg: "first" ->"john","last"->"Doe"  ,访问第二个元素可以通过字段名“last”获取  --select children['zhangsan'] from personInfo;
3. array ->数组集合arr["a","b"] ->arr[1]  --select friends[0] from personInfo;

3.3、实例

1.一条日志
xiaoming,xiaohong_xiaolan,xiaohua1:17_xiaohua2:18,shanxin_hanzhong
zhangsan,lisi_wangwu,zhangsi:17_zhangwu:18,zhejiang_hanzhou
2.关系：name: xiaoming
       friends:xiaohong,xiaolan
       children:name:xiaohua1,age 17,name:xiaohua2,age:18
       addr:province:shanxi;city:hanzhong
3.建表：
  create table personInfo(
    name string，
    friends array<string>，
    children map<string,int>,
    addr struct<province:string,city:string>
  )
  row format delimited fields terminated by ','
  collection items terminated by '_'
  map keys terminated by ':'
  lines terminated by '\n';
  
  字段解释：
   row format delimited fields terminated by ','--列分隔符
   collection items terminated by '_' --map struct 和array的分隔符（数据分割符号）集合分隔符
   map keys terminated by ':' --map中key与value的分隔符
   lines terminated by '\n'  --行分隔符
      
4.将数据插入文件：vim person.txt
 xiaoming,xiaohong_xiaolan,xiaohua1:17_xiaohua2:18,shanxin_hanzhong
zhangsan,lisi_wangwu,zhangsi:17_zhangwu:18,zhejiang_hangzhou

5.将数据加载到表中
laod data local inpath '/xx/person.txt' into table personInfo;

6.查询数据：select * from personInfo;

3.4、数据类型转换：

任何整数型都可以隐式转换为一个更大范围的类型；eg : tinyint ->int
所有整型和数据类型的string都可以隐式转成double.
tinyint、smallint、int 都可以转成float；
可以使用cast显示的对数据进行强制转换;eg: cast('1' as int ) ==》将字符串转为整数。如果强制类型转换失败，如：cast ('s' as int) ,返回null。

4、hive 基本操作

4.1 、DDL数据定义

4.1.1 数据库

1.创建数据库
  （1）、 创建数据库;数据库在HDFS上的默认存储路径是:    /user/hive/warehouse/*.db。
     create database if not exists hive_2;
  （2）、 创建数据库，指定数据库在HDFS上存放的位置;/hive3.db
    create database hive_3 location '/hive3.db';
2.查询数据库
   (1)、 show hive_3; //查询数据库
   (2)、 show databases like 'hive*'; //模糊查询
   (3)、 desc database hive_3;  //显示数据库信息
   (4)、 desc database extended hive_3; //查看扩展元数据
3.修改数据库:
   可以使用alter database 命令为某个库的dbproperties 设置键值对属性值，用于描述作用；数据库的其他元数据不可被更改，包括（数据库名和数据库所在的文件位置）。
   (1)、 alter database hive_3 set dbproterties("createTime"="2019-09-27");
   (2)、 alter database hive_3 set dbproperties("createTime="2019-09-28","createUser"="wql");
4.删除数据库
  （1）、drop database hive_3; //只能删除空数据库
   (2)、drop database hive_3 cascade;// 强制删除数据库

 cascade;// 强制删除数据库

4.1.2 数据表

创建数据表

create [external] table [if not exists] table_name
[列名 类型 [comment 列注释],....]
[comment 表注释]
[partitioned by (列名 类型[comment 列注释]，.....)] //创建一个外部表,通过location指向实际路径
[clustered by (列名 类型 ,....)[sorted by (列名[ASC|DESC]，....)] INFO 桶数量 buckets ]
[row format 行分隔符]
[stored as 文件格式]
[location 路径]

查看建表语句的详细信息： show create table 表名；

hive 创建内部表时，会将数据移动到数据仓库指向的路径；若通过external 关键字创建一个外部表，在创建的同时指向实际的数据文件路径(location) ，hive仅记录数据所在的路径，不会对数据位置进行移动。在删除表的时候，内部表的元数据会被删除，而只删除外部表的引用，不会删除元数据；

内部表和外部表之间的转换

1. 查询表的类型
   desc formatted personInfo；
 Table Type: MANAGED_TABLE  内部表
2. 修改personInfo 为外部表
   alter table personInfo set tblprooerties('EXTERNAL' = 'TRUE');
3. 查询表的类型
   desc formatted personInfo;
   Table Type: EXTERNAL_TABLE  外部表
4. 修改外部表personInfo 为内部表
   alter table personInfo set tblproperties('EXTERNAL' = 'FALSE');
 注意： ('EXTERNAL' = 'TRUE') 和 ('EXTERNAL' = 'FALSE')为固定写法，区分大小写。//true/false不区分大小写。

注意： ('EXTERNAL' = 'TRUE') 和 ('EXTERNAL' = 'FALSE')为固定写法，区分大小写。//true/false不区分大小写。

分区表

分区表实际上就是对应一个HDFS文件系统上的独立的文件夹，该文件夹是该分区所有的数据文件。Hive 中分区就是分目录，把一个大的数据集根据业务需要分割成小的数据集。在查询的时候通过where 子句中的表达式选择查询所需要的指定的分区，这样查询效率会提高很多。

3.1 分区表基本操作

1. 引入分区表（根据日期对文件进行管理）
 /user/hive/warehouse/hive_1.db/order_partition/month=201909/20190927.txt
 /user/hive/warehouse/hive_1.db/order_partition/month=2019010/20191028.txt
 
2.创建分区表
create table order_partition(oid int,price double, desc string) 
partitioned by (month string) row format delimited fields terminated by '\t';

3.加载数据到分区表中
load data local inpath '/home/qiulin/soft/hive/data/20190927.txt' into table order_partition partition(month='201909');

4. 查询分区表数据
select * from order_partition where month = '201909';

5.增加分区
  （1）.增加单个分区
     alter table order_partition add  partition(month='201910');
  （2）、增加多个分区
     alter table order_partition add  partition(month='201911') partition(moth='201912');

6.删除分区
   (1)、删除单个分区
      alter table order_partition drop patition(month='201910');
   (2)、删除多个分区
      alter table order_partition drop patition(month='201911'),partition(month='201912');
  
 7. 查看分区表有多少个分区
  show partitions order_partition;
  
 8. 查看分区表结构
   desc formatted order_partition;

3.2 常用分区表操作

1. 上传数据后修复(存在大量历史数据时，并且文件较多)--masck repair table table_name

 （1）、上传数据：将数据上传到hdfs上
    1.创建分区文件夹：dfs -mkdir -p /user/hive/warehouse/hive_1.db/order_partition/month=201911/day=01;
    2.将文件put到文件夹
     dfs -put /home/qiulin/soft/hive/data/20190927.txt  /user/hive/warehouse/hive_1.db/order_partition/month=201911/day=01;
     
 （2）、查询数据（由于没有元数据和分区表【partitions】建立联系），查询不到数据
  (3)、修复命令
       masck repair table order_partition;
  (4)、再次查询数据
       select * from order_partition where month = '201911' and day = '01';

       
       
 2. 上传数据后添加分区

 (1)、上传数据
       dfs -put /home/qiulin/soft/hive/data/20190927.txt  /user/hive/warehouse/hive_1.db/order_partition/month=201911/day=01;
  (2)、添加分区
       altere table order_partition add partition(month='201911',day='01');

       
 3. 创建文件夹之后 load数据到分区（分区表已经存在时）

 (1)、创建目录
    dfs -mkdir -p /user/hive/warehouse/hive_1.db/order_partition/month=201911/day=01;
 （2）、上传数据
    laod data local inpath '/home/qiulin/soft/hive/data/20190927.txt' into table order_partition partition(month='201912',day=01);

4、修改表

增加/修改/替换列信息

1. 重命名表
 alter table table_name rename to new_table;
2. 更新列
 alter table table_name change column old_colum_name new_colum_name colum_type;
3. 新增列
 alter table table_name add colums (cloum_name cloum_type,...);
4. 替换列：整张表的字段会被改动,整张表的字段为replace后的字段，数据在文件中，不会丢失；若是列类型和文件列的类型不一致，返回null；
 alter table table_name replace colums (colum_name colum_type,...);
注意： change 后跟的是colum,而 add/replace后面是colums;
5.清空表(只能清空内部表)
  truncate table table_name;
6. 删除表
 drop table table_name;

4.2、DML 数据操作

1、向表中加入数据

1. 通过 load 向表中导入数据(load)

(1)、语法
     load data [local] inpath 'file_url' [overwrite] into table table_name [partition(part_colum=xxx,...)];
     local : 表示从本地加载数据到hive表；否则从HDFS加载数据到Hive表。
     overwrite: 表示覆盖表中已有的数据，否则表示追加。

2. 通过查询语句插入数据(insert)

(1)、语法
     a、创建一张分区表：
         create table person(id int,name string) partitioned by (month string) row format delimited fields terminated by '\t';
     b、插入数据:
         insert into table person partition(month="201909") values(1,"wql");
     c、插入查询的数据（overwrite 覆盖之前的数据）
     from person insert overwrite  table person partition(month='201907') select id ,name where month = "201909";

3. 根据查询结果创建表并加载数据(as select )，创建出来的表字段名为查询的字段名。

 (1)、 语法
     create table if not exists table_name as select (colum_name,...) from source_table;
     eg：
     create table if not exists person2 as select id,name from person where month='201909';

4、通过location指定数据路径

 (1)、语法
     a、创建表:
     create table if not exists table_name(colum_name colum_type,...) row format delimited fields terminited by '字段分隔符' location '数据表所在的dfs位置'
     eg:
     create table if not exists person3(id int , name string) row format delimited fields terminated by '\t' location '/user/hive/warehouse/hive_4.db/person3';
     b、上传数据到HDFS上
     dfs -put /home/qiulin/soft/hive/data/person1.txt  /user/hive/warehouse/hive_4.db/person3;
     c、查询数据(多个文件时，数据会追加)
     select * from person3;

 5、insert 导入数据到hive表中

 （1）、导出数据 row format delimited feilds   terminated by '\t',不加是没有分隔符的
      a、查询结果导出到本地(local)
      insert overwrite local directory '/home/qiulin/soft/hive/data/export' row format delimited feilds   terminated by '\t' select * from person3;
      b、查询结果导出到HDFS上
      insert overwrite directory '/user/hive/warehouse/hive_4.db/person3' row format delimited fields terminated by '\t' select * from person3;


6、通过export 导出到HDFS ,再通过import导入（导出的时候附带有元数据）

 （1）、export导出数据到HDFS
     export table personInfo to '/user/hive/warehouse/hive_4.db/person3';
  (2)、import导入数据(只能导入到新表中)
      import table person partiton(month='201905') from '/user/hive/warehouse/hive_4.db/person3'

5、查询表数据

5.1、基本查询

1、Join语句
   Hive支持通常的SQL JOIN语句，但是只支持等值连接，不支持非等值(!=, >,<,...)连接。
   eg:
     select id,name  from person2 p2 join person3 p3 on p2.id = p3.id;
2. Join连接谓词不支持or
   eg:
     select id,name  from person2 p2 join person3 p3 on p2.id = p3.id or p2.name = p3.name;(报错) =>子查询
     select id ,p2Name from 
     (select id,p2.name p2Name,p3.name p3Name  from person2 p2 join person3 p3 on p2.id = p3.id)a wherea.p2Name = p3Name

5.2 、排序

1. 全局排序（order by），一个reducer，主要出现order by，只会出现一个reducer
2. 每个reducer 内部排序(sort by )局部有序，全局无序
   eg:
      select * from person2 id sort by id  //
3. 分区排序（distribute by）
   distribute by :类似mr 中partition,进行分区，结合sort by 使用
   注意：Hive要求distribute by 语句要写在sort by 语句之前，一定要多reducer进行处理，否则无法看到distribute by 的效果。
   eg:
    set mapreduce.job.reduces=3;
    insert overwrite local directory '/home/qiulin/soft/hive/data/result' select * from person3 distribute by id sort by name asc;
4. 当distribute 和 sort by 字段相同时，可以使用cluster by 排序。
    cluster by 除了具有distribute  by的功能外还兼具sort by 的功能。但是排序只能时升序排序，不能指定排序规则(asc|desc)。
  eg:
  select * from person2 cluster by id =>
  select * from person2 distribute by id sort by id //按id分区，相同id不一定都在同一个文件里面。(随机)

5.3、分桶表

分区针对的是数据的存储路径；分桶针对的是数据文件。

1. 创建分桶表
  create table per_buck(id int, name string) clustered by(id) into 3 buckets row format delimited fields terminated by "\t";
2. 插入数据到分桶表（只能通过insert into table 插入，通过MR可以将文件写入不同的数据文件，通过load,import是不能拆分数据文件）
  （1）、设置属性
  set hive.enforce.bucketing = true;
  set mapreduce.job.reduces = -1;
  （2）、插入数据
  insert  into table  per_buck select * from person3;
 3. 分桶抽样-查询分桶数据
   对于非常大的数据集，只需要查询部分数据-抽样查询即可满足的条件，使用分桶表最为合适
   select * from per_buck tablesample(bucket 1 out of 3 on id);
   注意：tablesample 是抽样语句，语法：tablesample (bucket x out of y)
   x:从哪个桶开始抽取，必须小于等于y;
   y:必须是bucks的倍数（>0）,bucket 总数为3，当y=3时，抽取1个bucket数据；当y=6时，抽取1/2个bucket数据；

6、函数

hive 查看系统函数：show functions;

查看某个函数的使用：desc function extended 函数名;

用户自定义函数

UDF:

import org.apache.hadoop.hive.ql.exec.UDF

需要实现evaluate 函数; evaluate 支持重载

在hive中创建函数

(1)、添加jar

add jar linux_jar_path

(2)、创建function

create [temporary] function [dbName.]function_name as class_name;
hive中删除函数drop [temporary] function if exists [dbName.] function_name;

UDF 必须有返回类型，可以返回null，但是返回值类型不能为void

7、压缩

1.查看hadoop支持的压缩类型

hadoop checknative

2.开启reduce 输出阶段压缩

1.开启hive最终输出数据压缩功能
set hive.exec.compress.output=true;
2.开启mapreduce最终输出数据压缩
set mapreduce.output.fileoutputformat.compress=true;
3.设置mapreduce最终数据输出压缩方式
set mapreduce.output.fileoutputformat.compress.codec =org.apache.hadoop.io.compress.SnappyCodec;
4.设置mapreduce最终数据输出压缩为块压缩
set mapreduce.output.fileoutputformat.compress.type=BLOCK;

测试：
 insert overwrite local directory '/usr/local/hive/data/emp-snapy' select * from bussess distribute by costdate sort by cost desc;  //按照costdate 分区，cost排序

3.文件存储格式

    1、文件存储格式在创建表的时候指定存储格式即可(stored as textfile|orc|parquet )
    2、查看文件大小：dfs -du -h /user/hive/warehouse/tableName/

卍杺歿卍

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录