Hive 基本使用

最新推荐文章于 2023-04-01 22:39:40 发布

亚存

最新推荐文章于 2023-04-01 22:39:40 发布

阅读量715

点赞数

分类专栏： hadoop

本文链接：https://blog.csdn.net/scjthree/article/details/26683071

版权

hadoop 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

原始日志在这，是俺们部门大神写的。帮他提一下PR值

http://blog.csdn.net/lihm0_1/article/details/17579903

1、生成测试数据

说hive只能识别文件中的单字符分割符，所以只能用不可见字符。好吧。下次搞个响喇叭字符，一打开会不会文件自己唱起歌来了？使用ASCII 06作为分隔符，在bash下用ctrl+v ctrl+f输入（先按ctrl+v，然后按ctrl+F，会出现^F字符。注意不能直接这么输进去）

seq 1 9999999 | awk '{print $1"^F"$1"aaaaaaaaaaaaaaaa"}' > a.txt

我自己生成的供参考

2、显示和切换库

hive能建不同的库，sql语句和mysql也挺类似

show databases;

use test;

结果如下：

3、创建表及插入数据

创建表语法和oracle里类似，但hive没有insert语句，不能直接往表里插数据。数据一般需要从文件中或者sqoop等获得。这里使用文件创建

create table t(id int,msg string) row format delimited fields terminated by '\006' stored as textfile;

关于详细语法，可以看帮助文档或者网上搜一下。这里简单提点一下

row format delimited 指定分隔符，这里指定了006不可见字符作为分隔符

stored用来设置加在的数据类型，一般都是textfile。运行如下：

Hive 基本使用 - scjthree - 亚存的博客

加载数据，已经被成功的分成了两列

4、分区表

Hive也能分区，这点真的非常强大。（如果对分区不了解，建议先看看oracle分区的相关知识）

create table t2(id int,msg string) partitioned by( indate string) row format delimited fields terminated by '\006' stored as textfile;

它使用了indate字段作为分区字段。这样就可以在不同分区里放不同的数据。

导入数据如下，和上述导入方式很类似，指定对应的分区就可以了。

load data local inpath '/tmp/b.txt' overwrite into table t2 partition(indate="20131228");

b.txt是按a.txt的方法生成的文本，数据如下：

访问方式如下

select count(1) from t2 where indate>='20131228' and indate<'20131229'

执行了两遍以方便把结果截图进来。可以看到一个分区内有两条记录

分区原理：还没时间仔细看，粗看看应该是在hdfs里面生成了两个独立的文件，则运行时，只要运行对应的文件就可以了。

不过在我们集群上一般也就用到2-3台机器，就算加点数据，顶多再加个map，运行时间不会显著增加。所以个人感觉分区对我们现在业务帮助不大

5、数据导出

insert overwrite local directory '/tmp/t' select * from t;

数据时导出到指定目录下的文件

Hive 基本使用 - scjthree - 亚存的博客

6、创建视图

HIVE也能创建视图

create view v_t as select * from t where id>5 order by id asc;

创建的时候没用MR，查询的时候启动了，所以可以推测创建时只保存了元数据，没有真正运行。只有查询的时候才启动Hadoop

7、执行计划

hive语句也有执行计划，雷到了

hive (test)> explain select count(1) from t a join t2 b on (a.id=b.id);

ABSTRACT SYNTAX TREE:

(TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF (TOK_TABNAME t) a) (TOK_TABREF (TOK_TABNAME t2) b) (= (. (TOK_TABLE_OR_COL a) id) (. (TOK_TABLE_OR_COL b) id)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_FUNCTION count 1)))))

STAGE DEPENDENCIES:

Stage-1 is a root stage

Stage-2 depends on stages: Stage-1

Stage-0 is a root stage

STAGE PLANS:

Stage: Stage-1

Map Reduce

Alias -> Map Operator Tree: --map阶段

TableScan

alias: a

Reduce Output Operator

key expressions:

expr: id

type: int

sort order: +

Map-reduce partition columns:

expr: id

type: int

tag: 0

TableScan

alias: b

Reduce Output Operator

key expressions:

expr: id

type: int

sort order: +

Map-reduce partition columns:

expr: id

type: int

tag: 1

Reduce Operator Tree: --reduce阶段

Join Operator

condition map:

Inner Join 0 to 1

condition expressions:

handleSkewJoin: false

Select Operator

Group By Operator

aggregations:

expr: count(1)

bucketGroup: false

mode: hash

outputColumnNames: _col0

File Output Operator

compressed: false

GlobalTableId: 0

table:

input format: org.apache.hadoop.mapred.SequenceFileInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

Stage: Stage-2

Map Reduce

Alias -> Map Operator Tree:

hdfs://hc1:9000/tmp/hive-hc/hive_2013-12-30_14-05-40_946_5561877030225943803/-mr-10002

Reduce Output Operator

sort order:

tag: -1

value expressions:

expr: _col0

type: bigint

Reduce Operator Tree:

Group By Operator

aggregations:

expr: count(VALUE._col0)

bucketGroup: false

mode: mergepartial

outputColumnNames: _col0

Select Operator

expressions:

expr: _col0

type: bigint

outputColumnNames: _col0

File Output Operator

compressed: false

GlobalTableId: 0

table:

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Stage: Stage-0

Fetch Operator

limit: -1

Time taken: 0.576 seconds

这块还不是很熟，先放着

8、自定义函数

这个还是很有用的，比如我们这里有个业务场景需要用到不等连接，HIVE不支持。

业务场景，一个ip库ip_temp表，三个字段如下

startip endip country

123 456 China

还有张访问表session_temp，捕获用户的ip,通过IP表去获得这个用户所在的国家

222

在oracle里关联如下

select * from session_temp t1,ip_temp t2 where t2.startip<=t1.ip and t2.endip>=t1.ip;

HIVE里只能用自定义函数的方式实现

首先要编写个java的函数类，继承自UDF类

   
   package com.zy.hive.function;  
  
import java.util.ArrayList;  
import java.util.List;  
  
import org.apache.hadoop.hive.ql.exec.UDF;      
public class range extends UDF{  
    //构建IP查询库  
    private static List<IpRange> ipLib = new ArrayList<IpRange>();  
    static{  
        for(int i=0;i<5695104;i++){  
            ipLib.add(new IpRange(i,i+5,"USA"+i));  
        }  
    }  
    //调用时执行的函数  
    public String evaluate(int ip){  
        IpRange ir;  
        for(int i=0;i<ipLib.size();i++){  
            ir = ipLib.get(i);  
            if(ip>=ir.getStartip() && ip <= ir.getEndip()){  
                return ir.getCountry();               
            }  
        }  
        return null;  
    }  
      
    public static void main(String[] args) {  
        range a = new range();  
        for(int i=0;i<100;i++)  
        System.out.println(a.evaluate(2));  
    }  
}  
  
class IpRange{  
    private int startip;  
    private int endip;  
    private String country;  
    public IpRange(int startip, int endip, String country) {  
        this.startip = startip;  
        this.endip = endip;  
        this.country = country;  
    }  
    public int getStartip() {  
        return startip;  
    }  
    public void setStartip(int startip) {  
        this.startip = startip;  
    }  
    public int getEndip() {  
        return endip;  
    }  
    public void setEndip(int endip) {  
        this.endip = endip;  
    }  
    public String getCountry() {  
        return country;  
    }  
    public void setCountry(String country) {  
        this.country = country;  
    }  
}

然后导出成jar类，放到hive的一个节点下

加载、创建函数，调用即可

Hive总的来说还是很强大的，不过用之前也要做充分的调研。

亚存

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hive 基本使用

原始日志在这，是俺们部门大神写的。帮他提一下PR值http://blog.csdn.net/lihm0_1/article/details/175799031、生成测试数据说hive只能识别文件中的单字符分割符，所以只能用不可见字符。好吧。下次搞个响喇叭字符，一打开会不会文件自己唱起歌来了？使用ASCII 06作为分隔符，在bash下用ctrl+v ctrl+f输入（先按ctrl+v，然后按ctr
复制链接

扫一扫