Hive杂谈

构建在Hadoop上的数据仓库平台,为数据仓库管理提供了许多功能
  • 定义了一种类SQL语言HiveQL。可以看成是仍SQL到Map-Reduce的映射器
  • 提供Hive shell、JDBC/ODBC、Thrift客户端等接
先上一个结构图,方便现有一个整体上的认识

image在这里插入图片描述

安装和配置Hive
安装模式:
(*)嵌入模式:不需要使用MySQL,使用Hive自带的Derby数据库存储Hive的元信息
(*)本地模式、远程模式:都需要MySQL
准备工作:
    1、解压  tar -zxvf apache-hive-2.3.0-bin.tar.gz -C ~/training/
    2、设置环境变量 vi  ~/.bash_profile
        HIVE_HOME=/root/training/apache-hive-2.3.0-bin
        export HIVE_HOME

        PATH=$HIVE_HOME/bin:$PATH
        export PATH
         
核心的配置文件  conf/hive-site.xml
1、安装Hive的嵌入模式
特点:(1)使用自带的Derby
              (2)只支持一个连接
              (3)用于开发和测试
        创建hive-site.xml
            <?xml version="1.0" encoding="UTF-8" standalone="no"?>
            <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
            <configuration>
                <property>
                    <name>javax.jdo.option.ConnectionURL</name>
                    <value>jdbc:derby:;databaseName=metastore_db;create=true</value>
                </property>
                 
                <property>
                    <name>javax.jdo.option.ConnectionDriverName</name>
                    <value>org.apache.derby.jdbc.EmbeddedDriver</value>
                </property>   
                 
                <property>
                    <name>hive.metastore.local</name>
                    <value>true</value>
                </property>   
 
                <property>
                    <name>hive.metastore.warehouse.dir</name>
                    <value>file:///root/training/apache-hive-2.3.0-bin/warehouse</value>
                </property>           
            </configuration>
 
        初始化MetaStore:
            schematool -dbType derby -initSchema
2、安装配置MySQL数据库
在虚拟机上安装MySQL:
            yum remove mysql-libs 
            rpm -ivh mysql-community-common-5.7.18-1.el7.x86_64.rpm
            rpm -ivh mysql-community-libs-5.7.18-1.el7.x86_64.rpm
            rpm -ivh mysql-community-client-5.7.18-1.el7.x86_64.rpm
            rpm -ivh mysql-community-server-5.7.18-1.el7.x86_64.rpm
            rpm -ivh mysql-community-devel-5.7.18-1.el7.x86_64.rpm  (可选,但还是装上,后面装HUE的时候会用到。)
                     
          启动MySQL:service mysqld start
          或者:systemctl start mysqld.service
 
          查看root用户的密码:cat /var/log/mysqld.log | grep password
          登录:mysql -uroot -pxxxxx
          登录后修改密码:alter user 'root'@'localhost' identified by 'Welcome_1';
 
        MySQL数据库的配置:
        创建一个新的数据库:create database hive;
        创建一个新的用户:
                create user 'hiveowner'@'%' identified by 'Welcome_1';
        
            给该用户授权
                       grant all on hive.* TO 'hiveowner'@'%'; 
                       grant all on hive.* TO 'hiveowner'@'localhost' identified by 'Welcome_1';          
           
            免费工具:http://www.mysqlfront.de/
3、本地模式、远程模式
地模式、远程模式
        创建hive-site.xml
            <?xml version="1.0" encoding="UTF-8" standalone="no"?>
            <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
            <configuration>
                <property>
                    <name>javax.jdo.option.ConnectionURL</name>
                    <value>jdbc:mysql://localhost:3306/hive?useSSL=false</value>
                </property>               
 
                <property>
                    <name>javax.jdo.option.ConnectionDriverName</name>
                    <value>com.mysql.jdbc.Driver</value>
                </property>   
 
                <property>
                    <name>javax.jdo.option.ConnectionUserName</name>
                    <value>hiveowner</value>
                </property>   
 
                <property>
                    <name>javax.jdo.option.ConnectionPassword</name>
                    <value>Welcome_1</value>
                </property>                   
             
            </configuration>

hive 设计思想:
利用关系型数据库来辅助记录数据的结构元信息(即表结构),真实的数据存储在HDFS上,

四、(最重要)Hive的数据模型

1、内部表:类似MySQL、Oracle中表
默认的列的分隔符是:tab键

    举例:创建一个员工表 emp表
          数据 7654,MARTIN,SALESMAN,7698,1981/9/28,1250,1400,30
           
          create table emp
          (empno int,
           ename string,
           job   string,
           mgr   int,
           hiredate string,
           sal   int,
           comm  int,
           deptno int
          );
           
          创建一个员工表 emp1表,并且指定分隔符是:逗号
          create table emp1
          (empno int,
           ename string,
           job   string,
           mgr   int,
           hiredate string,
           sal   int,
           comm  int,
           deptno int
          )row format delimited fields terminated by ',';  
          
          
          
        导入数据:
        insert语句
        load语句:相当于ctrl+x
            (1)导入HDFS的数据   load data inpath '/scott/emp.csv' into table emp;
            (2)导入本地Linux的数据:load data local inpath '/root/temp/emp.csv' into table emp1;

2、分区表:提高性能
举例:创建一个分区表,根据部门号deptno

         create table emp_part
              (empno int,
               ename string,
               job   string,
               mgr   int,
               hiredate string,
               sal   int,
               comm  int
              )partitioned by (deptno int)
              row format delimited fields terminated by ',';

往分区表中插入数据:子查询

insert into table emp_part partition(deptno=10) select empno,ename,job,mgr,hiredate,sal,comm from emp1 where deptno=10;
insert into table emp_part partition(deptno=20) select empno,ename,job,mgr,hiredate,sal,comm from emp1 where deptno=20;
insert into table emp_part partition(deptno=30) select empno,ename,job,mgr,hiredate,sal,comm from emp1 where deptno=30;
3、外部表:只定义表结构,数据保存在HDFS的某个目录下
    create external table ext_student
    (sid int,sname string,age int)
    row format delimited fields terminated by ','
    location '/students';
4、桶表:类似Hash分区
    根据员工的职位job建立桶表
     
    create table emp_bucket
    (empno int,
     ename string,
     job string,
     mgr int,
     hiredate string,
     sal int,
     comm int,
     deptno int
    )clustered by (job) into 4 buckets
    row format delimited fields terminated by ',';
 
    需要设置环境变量:set hive.enforce.bucketing = true;
    插入数据
    insert into table emp_bucket select * from emp1;
5、视图:View
    (*) 是一个虚(不存数据)表
    (*) 作用:简化复杂的查询
         create view view10
         as
         select * from emp1 where deptno=10;
 
    (*) 补充:物化视图可以缓存数据

*)补充:如何判断提高了性能? ----> 重点讲:关系型数据库(Oracle)、Hive类似方式
----> SQL执行计划

Hive中的执行计划
 
查询10号部门的员工
 
 
情况一:查询内部表
    explain select * from emp1 where deptno=10;
    STAGE DEPENDENCIES:
      Stage-0 is a root stage
 
    STAGE PLANS:
      Stage: Stage-0
        Fetch Operator
          limit: -1
          Processor Tree:
            TableScan
              alias: emp1
              Statistics: Num rows: 1 Data size: 629 Basic stats: COMPLETE Column stats: NONE
              Filter Operator
                predicate: (deptno = 10) (type: boolean)
                Statistics: Num rows: 1 Data size: 629 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: empno (type: int), ename (type: string), job (type: string), mgr (type: int), hiredate (type: string), sal (type: int), comm (type: int), 10 (type: int)
                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
                  Statistics: Num rows: 1 Data size: 629 Basic stats: COMPLETE Column stats: NONE
                  ListSink
 
 
 
情况二:查询分区表
    explain select * from emp_part where deptno=10;
    STAGE DEPENDENCIES:
      Stage-0 is a root stage
 
    STAGE PLANS:
      Stage: Stage-0
        Fetch Operator
          limit: -1
          Processor Tree:
            TableScan
              alias: emp_part
              Statistics: Num rows: 3 Data size: 118 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: empno (type: int), ename (type: string), job (type: string), mgr (type: int), hiredate (type: string), sal (type: int), comm (type: int), 10 (type: int)
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
                Statistics: Num rows: 3 Data size: 118 Basic stats: COMPLETE Column stats: NONE
                ListSink
Hive的自定义函数(UDF)
  1. 在udf包下新建一个类checksal,继承 UDF,重写evaluate方法,实现业务方法
  2. 打jar包,这里命名为myudf.jar
  3. 部署我们jar包(自定义函数),
    把jar加入Hive的Classpath
         add jar /root/temp/myudf.jar;
    
    为自定义函数创建别名
         create temporary function checksal as 'udf.CheckSalaryGrade';    
    

这里只是简单的说了下自定义函数的流程,推荐一片文章,这方面写得比较详细link: https://blog.csdn.net/HG_Harvey/article/details/77688735

这篇文章原理讲得比较清晰推荐看下link:https://www.cnblogs.com/qingyunzong/p/8707885.html

这篇文章讲解的操作比较详细,值得收藏link:http://blog.51cto.com/xpleaf/2084781

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值