hive的基本简介及安装、配置、使用（一）

最新推荐文章于 2024-08-11 15:16:37 发布

kinglyjn

最新推荐文章于 2024-08-11 15:16:37 发布

阅读量5.1k

点赞数

分类专栏： Cloud JAVA基础文章标签： java hive 数据仓库

本文链接：https://blog.csdn.net/kinglyjn/article/details/77509482

版权

JAVA基础同时被 2 个专栏收录

23 篇文章 0 订阅

订阅专栏

Cloud

7 篇文章 0 订阅

订阅专栏

hive是什么？

由facebook开源，用于解决海量结构化日志的数据统计；
基于hadoop的一个数据仓库工具，使用HDFS进行存储并将结构化数据文件映射成一张表，并提供类sql查询的功能，其底层采用MR进行计算；
本质是将HQL转化成MR程序。

hive架构图

安装前的准备

Java 1.7 (preferred)
Hadoop 2.x (preferred), 1.x (not supported by Hive 2.0.0 onward).

简单安装HIVE（以0.13.1版本为例）

# 1. 下载解压安装包
wget http://archive.apache.org/dist/hive/hive-0.13.1/apache-hive-0.13.1-bin.tar.gz
tar -zxvf apache-hive-0.13.1-bin.tar.gz -C /opt/module

# 2. 配置文件
# 2.1. conf/hive-env.sh
cp hive-env.sh.template  hive-env.sh
vim hive-env.sh
    HADOOP_HOME=/opt/modules/hadoop-2.5.0
    export HIVE_CONF_DIR=/opt/modules/apache-hive-0.13.1-bin/conf

# 3. 在HDFS上创建默认的存储路径(/tmp 和 /user/hive/warehouse，并且赋予权限chmod g+w)
# 通过查找hive-default.xml.template可知，该值可以通过 hive.metastore.warehouse.dir 参数设置
$HADOOP_HOME/bin/hadoop fs -mkdir       /tmp
$HADOOP_HOME/bin/hadoop fs -mkdir       /user/hive/warehouse
$HADOOP_HOME/bin/hadoop fs -chmod g+w   /tmp
$HADOOP_HOME/bin/hadoop fs -chmod g+w   /user/hive/warehouse

# 4. 启动hive的cli客户端，运行一个MR测试任务
bin/hive
> show databases;
> use default;
> show tables;
> create table test_log(ip string, user string, requrl string);
> desc test_log;
> select count(*) from test_log;

使用mysql存储hive元数据

hive元数据默认是存放在derby内存数据库中的，也就是说一次只允许一个客户端操作，这时候如果有另一个客户端运行（bin/hive），则就会报如下错误：

Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

Caused by: javax.jdo.JDOFatalDataStoreException: Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------
java.sql.SQLException: Failed to start database 'metastore_db' with class loader sun.misc.Launcher$AppClassLoader@404b9385, see the next exception for details.

在企业中通常推荐使用mysql存储hive的元数据，如下是hive与mysql的集成方法：

# 1. 拷贝mysql的驱动jar包到 hive的lib文件夹中
cp mysql-connector-java-5.1.38.jar $HIVE_HOME/lib/

# 2. 新建一个 conf/hive-site.xml 文件用于配置 hive 的环境参数
cp hive-default.xml.template hive-site.xml
vim hive-site.xml
---------------------
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <!--hive metastore with mysql-->
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://dbserver:3306/hivemetastore?createDatabaseIfNotExist=true</value>
        <description>JDBC connect string for a JDBC metastore</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
        <description>Driver class name for a JDBC metastore</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>root</value>
        <description>username to use against metastore database</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>23wesdxc</value>
        <description>password to use against metastore database</description>
    </property>

    <!--cli header message-->
    <property>
        <name>hive.cli.print.header</name>
        <value>true</value>
        <description>Whether to print the names of the columns in query output.</description>
    </property>
    <property>
        <name>hive.cli.print.current.db</name>
        <value>true</value>
        <description>Whether to include the current database in the Hive prompt.</description>
    </property>
</configuration>

# 3. 运行hive客户端之后会在mysql生成存储metastore数据的数据库（例如hivemetastore）
bin/hive


# [注1]
用mysql做元数据，需要在mysql命令行执行：
alter database hivemetastore character set latin1;
这样才不会报  max key length is 767 bytes 
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was too long; max key length is 767 bytes
这个异常

# [注2]
在hive中,show tables,create 等命令能正常执行,删除表drop table x时,会出现卡住的现象.
进入mysql,
show variables like 'char%'
可以看到，按理说是正确的.
后面发现,是在建好hive数据库后,没有第一时间将character_set_database编码由utf8修改为latin1.而是去hive中create了一张表.而后才将character_set_database编码由utf8改成了latin

解决办法:
在mysql中将drop hive
重新create hive,
修改编码,alter database hive character set latin1
进入hive shell
创建表,drop表,正常!

日志hive的简单配置

# 1. 在模板的基础上新建 conf/hive-log4j.properties 文件（日志参数设置）
cp hive-log4j.properties.template hive-log4j.properties

# 2.1 使用hive-log4j.properties日志参数配置
bin/hive
Logging initialized using configuration in file:/opt/modules/apache-hive-0.13.1-bin/conf/hive-log4j.properties

# 2.2 使用自定义日志参数配置
> bin/hive -help
usage: hive
 -d,--define <key=value>          Variable subsitution to apply to hive
                                  commands. e.g. -d A=B or --define A=B
    --database <databasename>     Specify the database to use
 -e <quoted-query-string>         SQL from command line
 -f <filename>                    SQL from files
 -h <hostname>                    connecting to Hive Server on remote host
 -H,--help                        Print help information
    --hiveconf <property=value>   Use value for given property
    --hivevar <key=value>         Variable subsitution to apply to hive
                                  commands. e.g. --hivevar A=B
 -i <filename>                    Initialization SQL file
 -p <port>                        connecting to Hive Server on port number
 -S,--silent                      Silent mode in interactive shell
 -v,--verbose                     Verbose mode (echo executed SQL to the
                                  console)

> bin/hive --hiveconf hive.root.logger=INFO,console

# 2.3 日志文件的位置
cat hive-log4j.properties
...
hive.log.dir=${java.io.tmpdir}/${user.name}  
hive.log.file=hive.log # 即 /tmp/ubuntu/hive.log 文件
...

hive/bin
> set
> set system:user.name; #当前会话查看
> set system:user.name=ubuntu; #当前会话设置
...
system:java.io.tmpdir=/tmp
system:user.name=ubuntu
...

hive cli常见交互命令

# hive cli 帮助
bin/hieve -help


# 查看hive的函数的相关命令：
show functions;
desc function upper;
desc function extended upper;


# 在hive cli 交互窗口执行hdfs命令：
hive (default)> dfs -rm -R /user/ubuntu/xxx.log


# 在hive cli 交互窗口运行linux本地命令：
hive (default)> !ls /opt/datas


# 执行 hive sql 的五种方式：
# 方式一：
bin/hive -e "select * from db_hive.student";
# 方式二：
bin/hive -f xxx/xxx.sql
bin/hive -f xxx/xxx.sql > xxx/result.txt
# 方式三：(UDF)
bin/hive -i <filename> 
# 方式四：
hive (default)> source xxx/xxx.sql
# 方式五：
hive (default)> show databases;
hive (default)> create database db_hive;
hive (default)> use default;
hive (default)> show tables;

hive cli DDL/DML

数据库相关：

-- 查看数据库(默认的数据库为default)
show databases;
show databases like "db*";
desc database db_hive_01;

-- 创建和使用数据库
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
          [COMMENT database_comment]
          [LOCATION hdfs_path]
          [WITH DBPROPERTIES (property_name=property_value, ...)];

create database db_hive_01;
create database if not exists db_hive_01;
create database if not exists db_hive_01 comment 'Hive数据库01';
create database if not exists db_hive_01 comment 'Hive数据库01' location '/user/ubuntu/hive/warehouse/db_hive_01.db';
use db_hive_01;


-- 修改数据库
ALTER (DATABASE|SCHEMA) database_name SET OWNER [USER|ROLE] user_or_role;   -- (Note: Hive 0.13.0 and later; SCHEMA added in Hive 0.14.0)

ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...);   -- (Note: SCHEMA added in Hive 0.14.0)

ALTER (DATABASE|SCHEMA) database_name SET LOCATION hdfs_path; -- (Note: Hive 2.2.1, 2.4.0 and later)


-- 删除数据库
DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];
drop database if exists db_hive_01 cascade; --如果数据库中存在表和数据，则删除需要cascade关键字，这样数据库的目录及其子目录也将被删掉

表相关：

-- 创建表
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name -- (Note: TEMPORARY available in Hive 0.14.0 and later)
  [(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
  [COMMENT table_comment]
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
  [SKEWED BY (col_name, col_name, ...)  -- (Note: Available in Hive 0.10.0 and later)]
     ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
     [STORED AS DIRECTORIES]
  [
   [ROW FORMAT row_format] 
   [STORED AS file_format]
     | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]  -- (Note: Available in Hive 0.6.0 and later)
  ]
  [LOCATION hdfs_path]
  [TBLPROPERTIES (property_name=property_value, ...)]   -- (Note: Available in Hive 0.6.0 and later)
  [AS select_statement];  -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
  LIKE existing_table_or_view_name
  [LOCATION hdfs_path];

data_type
  : primitive_type
  | array_type
  | map_type
  | struct_type
  | union_type  -- (Note: Available in Hive 0.7.0 and later)

primitive_type
  : TINYINT
  | SMALLINT
  | INT
  | BIGINT
  | BOOLEAN
  | FLOAT
  | DOUBLE
  | DOUBLE PRECISION -- (Note: Available in Hive 2.2.0 and later)
  | STRING
  | BINARY      -- (Note: Available in Hive 0.8.0 and later)
  | TIMESTAMP   -- (Note: Available in Hive 0.8.0 and later)
  | DECIMAL     -- (Note: Available in Hive 0.11.0 and later)
  | DECIMAL(precision, scale)  -- (Note: Available in Hive 0.13.0 and later)
  | DATE        -- (Note: Available in Hive 0.12.0 and later)
  | VARCHAR     -- (Note: Available in Hive 0.12.0 and later)
  | CHAR        -- (Note: Available in Hive 0.13.0 and later)

array_type
  : ARRAY < data_type >

map_type
  : MAP < primitive_type, data_type >

struct_type
  : STRUCT < col_name : data_type [COMMENT col_comment], ...>

union_type
   : UNIONTYPE < data_type, data_type, ... >  -- (Note: Available in Hive 0.7.0 and later)

row_format
  : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
        [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
        [NULL DEFINED AS char]   -- (Note: Available in Hive 0.13 and later)
  | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]

file_format:
  : SEQUENCEFILE
  | TEXTFILE    -- (Default, depending on hive.default.fileformat configuration)
  | RCFILE      -- (Note: Available in Hive 0.6.0 and later)
  | ORC         -- (Note: Available in Hive 0.11.0 and later)
  | PARQUET     -- (Note: Available in Hive 0.13.0 and later)
  | AVRO        -- (Note: Available in Hive 0.14.0 and later)
  | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname

constraint_specification:
  : [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE ]
    [, CONSTRAINT constraint_name FOREIGN KEY (col_name, ...) REFERENCES table_name(col_name, ...) DISABLE NOVALIDATE



-- 创建普通表示例(用location来指定数据文件的位置，这样创建表的时候也加载了数据)
create table if not exists db_hive.student(
  id int comment '学生id', 
  name string comment '学生姓名'
) comment '学生表'
ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY '\t'
    COLLECTION ITEMS TERMINATED BY '\n'
    STORED AS TEXTFILE
    LOCATION '/user/ubuntu/hive/warehouse/student'; 

-- 创建一个外部表示例(企业中大部分应用场景下使用外部表，创建外部表时必须要指定location参数)
create external table if not exists db_hive.student(
  id int comment '学生id', 
  name string comment '学生姓名'
) comment '学生表'
ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY '\t'
    COLLECTION ITEMS TERMINATED BY '\n'
    STORED AS TEXTFILE
    LOCATION '/user/ubuntu/datas/student';

-- 普通表与外部表(external table)的区别：
1）创建表时：创建内部表时，会将数据移动到数据仓库指向的路径；若创建外部表，仅记录数据所在的路径，不对数据的位置做任何改变。
2）删除表时：在删除表的时候，内部表的元数据和数据会被一起删除， 而外部表只删除元数据，不删除数据。这样外部表相对来说更加安全些，数据组织也更加灵活，方便共享源数据。 
3）对于n个外部表指向HDFS同一个数据文件夹，当更改这个文件夹中的数据时，这n个外部表都会受到影响。 



-- 分区表
分区表实际上就是对应一个hdfs文件系统上的一个独立的文件夹，该文件夹下是该分区所有的数据文件。hive中的分区就是分目录，把一个大的数据集根据业务的需要划分成更小的数据集。在查询时通过where子句中的表达式来选择查询所需要的指定的分区，这样的查询效率会体改很多。
分区表的注意事项：对于普通表，先上传(hdfs put)表相关数据文件到HDFS相对应的文件目录再创建表的方式 与 先创建表再加载(hive load)表相关的数据文件到HDFS 这两种方式的效果是相同的；对于分区表，先上传(hdfs put)表相关数据文件到HDFS相对应的文件目录再创建表之后，select查询该表格，结果是查不到刚才上传的数据的，因为分区表除了在mysql数据库存放与普通表相同的信息，还在mysql存储分区的相关信息（可以查询mysql-metastore：select * from patitions），这时候HIVE端分区表需要采取修复的方法，以在mysql端增加新创建分区表的metastore-patition信息。
修复方法一：（一次性修复，msck：metastore check）
msck repair table db_hive_01.dept_patition;
修复方法二：（企业中常用）
alter table db_hive_01.dept_partition add patition(month='20170204',day='01');



-- 创建分区表
-- 先创建分区表，后向分区表中加载数据
create external table if not exists dept(deptno int, dname string, loc string)
partitioned by(month string, day string) --拥有两级分区
row format delimited fields terminated by '\t'
location '/user/ubuntu/datas/dept';

load data local inpath '/opt/datas/dept01.txt' into table db_hive_01.dept partition(month='201708', day='01');
load data local inpath '/opt/datas/dept02.txt' into table db_hive_01.dept partition(month='201708', day='02');     

-- 先存存在数据文件，后创建分区表
create external table if not exists dept(deptno int, dname string, loc string)
partitioned by(month string, day string) --拥有两级分区
row format delimited fields terminated by '\t'
location '/user/ubuntu/datas/dept';

alter table dept add partition(month='201708', day='01');
alter table dept add partition(month='201708', day='02');
或
msck repair table `db_hive_01.dept`;


-- 以as的方式创建表(CATS, create table as select)
create table if not exists db_hive_01.student_ctas
as select id,name from db_hive_01.student_01_ext; --会生成内部表的数据文件student_ctas/000000_0，包含了select查到的一些数据，删除内部表时也只删除内部表的数据文件


-- 以like的方式创建表(数据非常大，可以每天的数据放在新创建的一张表中)
create external table if not exsist db_hive_01.student_ext_03
like db_hive_01.student_ext_02 location '/user/ubuntu/datas/student02';



-- 查看表信息
desc db_hive.student;
desc extended db_hive.student;
desc formatted db_hive.student;


--查询某个分区的数据：
select * from db_hive_01.dept where month='201708';
-- 查看某张分区表的分区
show partitions db_hive_01.dept_partition;     
-- 合并分区的数据：
$vi xxx.sql
    select dname from db_hive_01.dept_partition where month='201702' and day='01'                 
    union all
    select dname from db_hive_01.dept_partition where event_month='201703' and day='01';
$bin/hive -f xxx.sql > xxx/result.txt;


-- 修改表名(如果是内部表，则相应的数据存储文件路径名也会被修改，注意修改的时候前面不能加数据库名)
alter table student rename to student01;


-- 清空表数据(只能清空内部表的数据)
truncate table db_hive_01.student;


-- group by聚合查询
-- 查询每个部门的平均工资示例：
select e.deptno, avg(e.sal) from db_hibve_01.emp e group by e.deptno;  
-- 查询每个部门中每个岗位的最高薪水示例：
select e.deptno,e.job,max(e.sal) maxsal from db_hibve_01.emp e group by e.deptno,e.job;


-- having过滤
-- having与where的区别：
-- where 是针对单条记录进行筛选，having 是针对分组的结果进行筛选
-- 求部门平均薪水大于2000的部门示例：
select e.deptno,avg(sal) avgsal from db_hive_01.emp e group by e.deptno having avgsal>2000; 


-- 多表查询join...on:
select e.empno,e.ename,d.deptno,d.dname from db_hive_01.emp e join db_hive_01.dept d on e.deptno=d.deptno;



-- Hive对查询结果的排序
-- 1. order by，对全局数据的排序，只在一台节点运行，仅仅只有一个reduce，使用需谨慎！
select * from db_hive_01.emp order by empno;

-- 2. sort by，对每个reduce内部数据进行排序（即对每个分区的数据排序，默认的分区是hash-key对reduce个数取模，执行之前先设置reduce的个数），不是全局排序，对全局结果集来说是没有排序的
set mapreduce.job.reduces = 3; --默认值为-1，一个reduce
insert overwrite local directory '/opt/datas/test/results/' 
select * from db_hive_01.emp sort by empno asc; -- 会生成mapreduce.job.reduces个结果文件

-- 3. distribute by，设置MR分区，对数据进行分区，结合sort by使用
-- 注意：distribute by 必须要在sort by 之前！
-- 像下面的例子，一般有多少个部门编号(deptno)，就使用多少个reducer执行reduce任务(这样每个结果文件存储了一个分区的数据)
insert overwrite local directory '/opt/datas/test/results/' 
select * from db_hive_01.emp e distribute by e.deptno sort by e.empno;

-- 4. cluster by
-- 当distribute和sort字段相同时，使用该方式
insert overwrite local directory '/opt/datas/test/results/' 
select * from db_hive_01.dept d cluster by d.deptno;

表数据的导入和导出：

-- hive导入源数据常见的九种形式：
-- 1. 加载本地文件到hive表：
load data local inpath 'local_path' into table db_hive_01.dept;

-- 2. 加载dfs文件档hive表（加载完成后dfs文件会被删除）
load data inpath 'dfs_path' into table db_hive_01.dept;

-- 3. 覆盖表中已有的数据
load data inpath 'dfs_path' overwrite into table db_hive_01.dept;

-- 4. 加载数据到分区表
load data [local] inpath ‘filepath’ [overwrite] db_hive_01.dept partition(month='201708', day='02',..);

-- 5. 创建表时通过select加载：
create [external] table if not exists db_hive_01.dept_ctas
as select * from db_hive_01.dept;

-- 6. 创建表时通过insert加载：
create [external] table if not exists db_hive_01.dept_like
like db_hive_01.dept；
insert into db_hive_01.dept_like select * from b_hive_01.dept;

-- 7. 创建表时通过location指定加载：
create [external] table if not exists db_hive_01.dept(
    deptno int,
    dname string,
    loc string
) 
row format delimited fields terminated by '\t'
location 'pathxxx';

-- 8. 修复表时加载：
msck repair table db_hive_01.dept_patition;
alert table db_hive_01.dept_partition add patition(month='201702', day='01'); 

-- 9. import，将外部数据导入到hive表中去：
import [external] table db_hive_01.dept_partition [patition(month='201702', day='01')] from 'xxx/db_hive_01_dept_export' [location 'import_target_path'];



-- hive导出结果数据的四种常见的方式：
-- 1. 将结果数据插入到本地磁盘：
hive(db_hive_01)> insert [overwrite] [local] directory 'filepath'
                            row format delimited
                                fields terminated by '\t'
                                collection items terminited by '\n'
                            select * from db_hive_01.dept;

-- 2. 直接在本地将执行结果重定向到指定文件
bin/hive -e "select * from db_hive_01.dept;" > xxx/result.txt

-- 3. export方式将hive表中的数据导出到外部
-- 导出的路径是hdfs上的文件路径，导出操作会拷贝数据文件夹和文件到目标路径，而且会在目标路径下生成_metadata文件存储表的元数据信息
export table db_hive_01.dept_partition [partition(month='201702',day='01')] to 'distpath';

-- 4. sqoop工具导入导出hive数据到传统关系数据库 或 Hbase
略

其他hive命令:

-- hive交互窗口执行的命令历史记录位于
~/.hivehistory

-- 查看函数
show functions;
desc function upper;
desc function extended upper;

HIVE的在ETL中的应用

1. 创建表及加载原始数据（E: extract）
        create table keyllo_log_201703(
            conetnt string
        );
        load xxx;

2. 数据预处理 （T: transafer）
        create table xxx AS select xxx
        python 
            输入 >> content
            |
            |regex
            |
            输出 << ip,req_url,http_ref


3. 子表加载数据（L: load）
        load

用户自定义函数（UDF）

UDF是用户自定义函数允许用户扩展HiveSQL的功能，具体有三种UDF（UDF、UDAF、UDTF）

UDF：一进一出
UDAF：User Defined Aggregation Function，聚集函数，多进一出，类似于 count/min/max
UDTF：User Defined Table-Generating Function，一进多出，如lateral view explore()

下面以自定义一个转化小写函数进行说明：

//1. 需要的jar包
<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-exec</artifactId>
    <version>${hive.version}</version>
</dependency>
<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-jdbc</artifactId>
    <version>${hive.version}</version>
</dependency>


//2. 编码定义自己的UDF
package com.keyllo.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
/**
 * 定义自己的Hive function
 *   [注意事项]  
 *      a. UDF必须要有返回类型，可以返回null，但是返回类型不能是void 
 *      b. UDF中常用mapreduce中的Text/LongWritable 等类型，不推荐使用java类型
 */
public class MyLowerUDF extends UDF {
    // 继承UDF类，实现evaluate函数（支持重载）
    public Text evaluate(Text str) {
        if (str == null) {
            return null;
        }
    return new Text(str.toString().toLowerCase());
}


//3. 将刚写的自定义函数打包上传到 本地 或 HDFS文件系统上，并在hive中注册使用自定义函数，使用函数
//方式一：(临时创建，企业中常用，一般放在脚本中执行)
hive(db_hive_01)> add jar /opt/datas/test/lib/hive-udf.jar;
hive(db_hive_01)> create temporary function my_lower1 as 'com.keyllo.hive.udf.MyLowerUDF';
hive(db_hive_01)> select ename,my_lower1(ename) lower_ename from emp limit 5;

//方式二：（永久创建，需要将jar包上传到HDFS文件系统）
hive(db_hive_01)> create function my_lower2 as 'com.keyllo.hive.udf.MyLowerUDF' using jar 'hdfs://ns1/user/ubuntu/lib/hive-udf.jar';
hive(db_hive_01)> select ename,my_lower2(ename) lower_ename from emp limit 5;

hiveserver2 & beeline简介（应用较少）

问题导读

1.hive允许远程客户端使用哪些编程语言？

2.已经存在HiveServer为什么还需要HiveServer2？

3.HiveServer2有哪些优点？

4.hive.server2.thrift.min.worker.threads-最小工作线程数，默认为多少？

5.启动Hiveserver2有哪两种方式？

在之前的学习和实践Hive中，使用的都是CLI或者hive –e的方式，该方式仅允许使用HiveQL执行查询、更新等操作，并且该方式比较笨拙单一。幸好Hive提供了轻客户端的实现，通过HiveServer或者HiveServer2，客户端可以在不启动CLI的情况下对Hive中的数据进行操作，两者都允许远程客户端使用多种编程语言如Java、Python向Hive提交请求，取回结果。

HiveServer或者HiveServer2都是基于Thrift的，但HiveSever有时被称为Thrift server，而HiveServer2却不会。既然已经存在HiveServer为什么还需要HiveServer2呢？这是因为HiveServer不能处理多于一个客户端的并发请求，这是由于HiveServer使用的Thrift接口所导致的限制，不能通过修改HiveServer的代码修正。因此在Hive-0.11.0版本中重写了HiveServer代码得到了HiveServer2，进而解决了该问题。HiveServer2支持多客户端的并发和认证，为开放API客户端如JDBC、ODBC提供了更好的支持。

既然HiveServer2提供了更强大的功能，将会对其进行着重学习，但也会简单了解一下HiveServer的使用方法。在命令中输入hive –service help，结果如下。

$ bin/hive --service help
Usage ./hive <parameters> --service serviceName <service parameters>
Service List: beeline cli help hiveserver2 hiveserver hwi jar lineage metastore metatool orcfiledump rcfilecat schemaTool version 
Parameters parsed:
  --auxpath : Auxillary jars 
  --config : Hive configuration directory
  --service : Starts specific service/component. cli is default
Parameters used:
  HADOOP_HOME or HADOOP_PREFIX : Hadoop install directory
  HIVE_OPT : Hive options
For help on a particular service:
  ./hive --service serviceName --help
Debug help:  ./hive --debug --help

$ bin/hive --service hiveserver -help
Starting Hive Thrift Server
usage: hiveserver
 -h,--help                        Print help information
    --hiveconf <property=value>   Use value for given property
    --maxWorkerThreads <arg>      maximum number of worker threads,
                                  default:2147483647
    --minWorkerThreads <arg>      minimum number of worker threads,
                                  default:100
 -p <port>                        Hive Server port number, default:10000
 -v,--verbose                     Verbose mode

#启动hiveserver服务，默认hiveserver运行在端口10000，最小100工作线程，最大2147483647工作线程。
$ bin/hive --service hiveserver -v
Starting Hive Thrift Server
Starting hive server on port 10000 with 100 min worker threads and 2147483647 max worker threads

接下来学习更强大的hiveserver2。Hiveserver2允许在配置文件hive-site.xml中进行配置管理，具体的参数为：

hive.server2.thrift.min.worker.threads – 最小工作线程数，默认为5。
hive.server2.thrift.max.worker.threads – 最小工作线程数，默认为500。
hive.server2.thrift.port – TCP 的监听端口，默认为10000。
hive.server2.thrift.bind.host – TCP绑定的主机，默认为localhost。

也可以设置环境变量HIVE_SERVER2_THRIFT_BIND_HOST和HIVE_SERVER2_THRIFT_PORT覆盖hive-site.xml设置的主机和端口号。从Hive-0.13.0开始，HiveServer2支持通过HTTP传输消息，该特性当客户端和服务器之间存在代理中介时特别有用。与HTTP传输相关的参数如下：

hive.server2.transport.mode – 默认值为binary（TCP），可选值HTTP。
hive.server2.thrift.http.port – HTTP的监听端口，默认值为10001。
hive.server2.thrift.http.path – 服务的端点名称，默认为 cliservice。
hive.server2.thrift.http.min.worker.threads – 服务池中的最小工作线程，默认为5。
hive.server2.thrift.http.max.worker.threads – 服务池中的最小工作线程，默认为500。

启动Hiveserver2有两种方式:

一种是上面已经介绍过的 bin/hive –service hiveserver2
另一种更为简洁，为 bin/hiveserver2

默认情况下，HiveServer2以提交查询的用户执行查询（true），如果hive.server2.enable.doAs设置为false，查询将以运行hiveserver2进程的用户运行。为了防止非加密模式下的内存泄露，可以通过设置下面的参数为true禁用文件系统的缓存：

fs.hdfs.impl.disable.cache – 禁用HDFS文件系统缓存，默认值为false。
fs.file.impl.disable.cache – 禁用本地文件系统缓存，默认值为false。

简单使用hiveserver2：参考

# 后台进程启动hiveserver2服务端
$ bin/hiveserver2 &


# 查看beeline帮助
/beeline  --help

# 使用beeline连接hiveserver2服务
$ bin/beeline
> !connect jdbc:hive2://localhost:10000 ubuntu pwdxxx
> ...
0: jdbc:hive2://localhost:10000> show databases;
+----------------+
| database_name  |
+----------------+
| db_hive_01     |
| default        |
+----------------+

# 或者使用beeline连接hiveserver2服务
$ bin/beeline -u jdbc:hive2://localhost:10000/db_hive_01 -n ubuntu -p pwdxxx

hiveserver2 jdbc 驱动使用：参考 , 通常使用hive的 select * 快速查询。

package com.keyllo.hive.jdbc;
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
/**
 * hiveserver2 jdbc的用途：
 *      将分析的结果存储在hive结果表【数据量小】中，前端可以通过dao代码进行数据的查询 
 * hiveserver2的缺点：
 *      并发性不好，企业中不常用
 */
public class HiveJdbcClient {
    private static String DRIVERNAME = "org.apache.hive.jdbc.HiveDriver";

    public static void main(String[] args) throws SQLException {
        try {
            Class.forName(DRIVERNAME);
        } catch (ClassNotFoundException e) {
            e.printStackTrace();
            System.exit(1);
        }

        // get connection
        Connection conn = DriverManager.
          getConnection("jdbc:hive2://hadoop01:10000/db_hive_01", "ubuntu", "xxxxxx");


        /*
        //create table
        Statement stmt = conn.createStatement();
        String tableName = "dept";
        stmt.execute("drop table if exists " + tableName);
        stmt.execute("create table " + tableName + " (deptno int, dname string, loc string)");

        // show tables
        String sql = "show tables '" + tableName + "'";
        System.out.println("Running: " + sql);
        ResultSet res = stmt.executeQuery(sql);
        if (res.next()) {
            System.out.println(res.getString(1));
        }

        // describe table
        sql = "describe " + tableName;
        System.out.println("Running: " + sql);
        res = stmt.executeQuery(sql);
        while (res.next()) {
            System.out.println(res.getString(1) + "\t" + res.getString(2));
        }

        // load data into table
        // NOTE: filepath has to be local to the hive server
        // NOTE: '/opt/datas/test/datatap/db_hive_01_dept.txt' is a ctrl-A separated file with two fields per line
        String filepath = "/opt/datas/test/datatap/db_hive_01_dept.txt";
        sql = "load data local inpath '" + filepath + "' into table " + tableName;
        System.out.println("Running: " + sql);
        stmt.execute(sql);
        */

        // select * query
        Statement stmt = conn.createStatement();
        String tableName = "dept";      
        String sql = "select * from " + tableName;
        System.out.println("Running: " + sql);
        ResultSet res = stmt.executeQuery(sql);
        while (res.next()) {
            System.out.println(String.valueOf(res.getInt(1)) + "\t" + res.getString(2));
        }

        // regular hive query
        sql = "select count(1) from " + tableName;
        System.out.println("Running: " + sql);
        res = stmt.executeQuery(sql);
        while (res.next()) {
            System.out.println(res.getString(1));
        }
    }
}

hive 常见数据压缩技术

常见的压缩格式：

压缩格式：  bzip2、gzip、lzo、snappy     等，推荐使用snappy(谷歌开源)
压缩比：    bzip2 > gzip > lzo          bzip2最节省存储空间
解压缩速度： bzip2 < gzip < lzo          lzo解压缩速度是最快的

数据压缩的好处：

减少节点的io负载
节省网络带宽
提高整个job的性能
压缩格式必须具有可分割性（hdfs block决定的）

这里写图片描述

kinglyjn

关注

0
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录