CentOS7下Hive的安装使用

31 篇文章 0 订阅
20 篇文章 0 订阅

目录

背景

安装配置运行

命令

插入数据

表分桶

正则serde存储格式

排序和聚集

子查询

视图

自定义函数

写UDF

写UDAF聚集函数

表生成函数UDTF通用版

表分区

删除列

动态分区

导出表到文件系统

查询复合结构中的元素

内置函数

case...when...then.else语句

嵌套select语句

浮点数比较

like和rlike

group by分组

优化选项

本地模式

并行执行

严格模式

压缩

解压缩分区

宏命令

RCFile

自定义IO格式(input为例)

xpath

输出所有id

输出指定标签下指定id的子标签内容

计算指定标签下表达式的值

词频统计

tez引擎的安装

步骤

Presto可视化组件安装

步骤

结语

背景

记录下hive在CentOS7下的安装配置和使用,请先装好Hadoop2,可以参见文章Hadoop2.5.0在CentOS7下的安装部署

安装配置运行

安装很简单,就是解压压缩包

配置的话,需要修改hive-site.xml和hive-env.sh,它们都在hive的conf目录下

hive-site.xml修改后内容如下

<?xml version="1.0" encoding="utf-8" standalone="no"?>

<configuration>
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
	<value>jdbc:mysql://192.168.0.102:3306/test</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
	<value>com.mysql.jdbc.Driver</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
	<value>root</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
	<value>root</value>
  </property>

  <property>
    <name>hive.metastore.local</name>
	<value>false</value>
  </property>
  <property>
    <name>hive.metastore.schema.verification</name>
	<value>false</value>
  </property>

  <property>
    <name>hive.metastore.uris</name>
    <value>thrift://192.168.57.141:9083</value>
    <description>IP address (or fully-qualified domain name) and port of the metastore host</description>
  </property>

  <property>
    <name>hive.support.concurrency</name>
    <value>true</value>
  </property>
    <property>
    <name>hive.enforce.bucketing</name>
    <value>true</value>
  </property>
    <property>
    <name>hive.exec.dynamic.partition.mode</name>
    <value>nonstrict</value>
  </property>
  <property>
    <name>hive.txn.manager</name>
    <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
  </property>
    <property>
    <name>hive.compactor.initiator.on</name>
    <value>true</value>
  </property>
  <property>
    <name>hive.compactor.worker.threads</name>
    <value>1</value>
  </property>

  <property>
    <name>hive.server2.transport.mode</name>
    <value>http</value>
  </property>

  <property>
    <name>hive.server2.thrift.http.port</name>
    <value>10001</value>
  </property>

  <property>
    <name>hive.server2.thrift.http.path</name>
    <value>cliservice</value>
  </property>

  <property>
    <name>datanucleus.schema.autoCreateTables</name>
    <value>true</value>
  </property>

  <property>
    <name>hive.exec.compress.intermediate</name>
    <value>true</value>
  </property>
  
  <property>
    <name>hive.exec.compress.output</name>
    <value>true</value>
  </property>

  <property>
    <name>mapred.output.compression.type</name>
    <value>RECORD</value>
  </property>

  <property>
    <name>hive.archive.enabled</name>
    <value>true</value>
  </property>

  <property>
    <name>mapred.job.trakcer</name>
    <value>192.168.57.141:10020</value>
  </property>

  <property>
    <name>hive.exec.mode.local.auto</name>
    <value>true</value>
  </property>

  <property>
    <name>hive.exec.parallel</name>
    <value>true</value>
  </property>

  <property>
    <name>hive.mapred.mode</name>
    <value>nostrict</value>
  </property>

  <property>
    <name>hive.archive.enabled</name>
    <value>true</value>
  </property>

  <property>
	<name>datanucleus.schema.autoCreateTables</name>
	<value>true</value>
  </property>
  <property>
    <name>datanucleus.schema.autoCreateAll</name>
    <value>true</value>
 </property>

  <property>
    <name>hive.exec.mode.local.auto</name>
    <value>false</value>
 </property>
</configuration>

其中192.168.0.102是windows的ip,也是mysql的ip;192.168.57.141是CentOS的ip,也是hive的地址

hive-env.sh文件内容如下

export HADOOP_HOME=/home/szc/hadoop-2.7.2

export HIVE_CONF_DIR=/home/szc/apache-hive-2.3.7/conf

然后把mysql的jar包复制到hive的lib目录下,再启动mysql,为了让时区保持一致,需要在mysql的命令行下设置时区

> set global time_zone='+8:00';

这样就可以启动hive的metastore了

[root@localhost bin]# hive --service metastore &

完成后,运行hive即可进入命令行界面

命令

以下命令是以前在windows10下测试的,但也适配于CentOS7

插入数据

先建表,命令如下

create table sales(widget_id int, qty int, street string, city string, state string, zip int, sale_date string) row format delimited fields terminated by ',';

后面的

row format delimited fields terminated by ',';

意思是,读取数据文件的每一行时,用,分割数据,分成每一列

然后从本地文件插入数据

load data local inpath 'C:\data.txt' into table sales;

最后,调用查询语句查看结果

select * from sales;

结果如下

而导入的data.txt文件内容如下所示

1,15,HuaShan Street,Anyang Henan,China,455000,2019-11-03
2,15,Dongfeng Road,Anyang Henan,China,455000,2019-11-03
3,15,Wenfeng Street,Anyang Henan,China,455000,2019-11-03
3,15,Zhangde Street,Anyang Henan,China,455000,2019-11-03

表分桶

分桶可以方便查询和采样。

先创建一个分桶表

create table sales_bkt(widget_id int, qty int, street string, city string, state string, zip int, sale_date string) clustered by(widget_id) into 4 buckets;

意思是创建sales_bkt表,按照widget_id分成4个桶。

然后从已有的表里导数据

insert overwrite table sales_bkt select * from sales;

然后在表目录里可以看到有四个分桶文件

hadoop fs -ls /user/hive/warehouse/sales_bkt

对其中某一个进行查看,得到的结果如下所示

这是因为,我们按照widget_id把数据分成4桶,那么hive就会对数据的widget_id对4取余,取余结果即为分到的桶编号。

之后,如果我们要对数据进行采样,可以输入如下语句

select * from sales_bkt tablesample (bucket 1 out of 3 on widget_id);

也就是取1/3的数据,结果如下所示。

对分桶的数据集进行采样,只会扫描部分桶,而不会扫描整个数据集,因而效率高。

正则serde存储格式

serde是序列化反序列化的缩写。当我们要按照一定格式读取数据文件时,可以用正则表达式来规定序列化方法,相应的建表语句如下所示

create table stations(usaf string, wban string, name string)
    row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
    with serdeproperties (
        'input.regex' = '(\\d{6}) (\\d{5}) (.{29}) .*'
);

其中,row format serde规定用哪个SerDe类来序列化或反序列化,这里要填入完整类名;下面的with serdeproperties指明额外的属性,这里是RegexSerDe的input.regex属性,即匹配的正则表达式。

表达式的内容含义为:有三个捕获组(空格分隔),分别捕获长度为6和长度为5的整数,以及长度为29的任意字符。捕获组分别和表的列对应。

然后导入数据

load data inpath '/user/szc/station_fixed_width.txt' overwrite into table stations;

显示5条数据

select * from stations limit 5;

结果如下所示

排序和聚集

排序可以用sort by语句,而distribute by语句用来指定把哪些行分配到哪些reducer中,以便后面的聚集。

如果sort by语句的列和distribute by语句的列相同,那么可以合写为cluster by子句。

一个实例如下

select widget_id, qty, street from sales_test distribute by street sort by qty DESC, widget_id ASC;

注意,sort by子句中,列的顺序决定排序的优先级。输出如下所示

子查询

子查询是嵌套在select语句中的select语句。比如要查询每条街每天的最高收入均值,可以写出如下语句

select street, sale_date, AVG(max_qty) from (
    select street, sale_date, MAX(qty) as max_qty 
    from sales_test 
    where qty >= 0
    group by street, sale_date)mq 
group by street, sale_date;

其中,因为外层查询要像访问表一样访问内层查询,所以要给内层查询起一个名字,比如mq。

运行结果如下所示

视图

为了简化查询语句和查询安全,我们可以用视图来进行复杂的查询。

比如,用视图实现上面的子查询,可以这么做:

首先,创建一个视图,进行有效记录的查询

create view valid_record as select * from sales_test where qty >= 0;

然后,再创建一个视图,进行每条街道每天最大值的查询,这里会用到第一个视图

create view max_qty (street, sale_date, max_qty) as select street, sale_date, MAX(qty) from valid_record group by street, sale_date;

最后,利用第二个视图,对每条街每天最大值的均值进行查询

select street, sale_date, AVG(max_qty) from max_qty group by street, sale_date;

结果如下所示

自定义函数

自定义函数必须用java语言写,然后编译成jar包,在hive启动时引入。引入命令如下

hive --auxpath D:/develop/ideaWorkspace/szc_avro_test.jar

写UDF

自定义UDF类必须继承org.apache.hadoop.hive.ql.exec.UDF,然后实现evaluate()方法。示例代码如下

package hive;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class Strip extends UDF {
    private Text mResult = new Text(); // 使用Writable可以重用对象,节约开支

    public Text evaluate(Text str) {

        if (str == null) {
            return null;
        }

        mResult.set(StringUtils.strip(str.toString()));

        return mResult;
    }


    public Text evaluate(Text str, String stripChars) {
        if (str == null) {
            return null;
        }

        mResult.set(StringUtils.strip(str.toString(), stripChars));

        return mResult;
    }
}

编译成jar包并导入后,在hive的使用方法如下:注册Strip函数并用create function为其命名

create function strip as 'hive.Strip';

然后就可以像使用内置函数一样使用strip了

select strip('   szc  ') from sales;

select strip('abszcba', 'ab') from sales;

效果均如下图所示

写UDAF聚集函数

UDAF比UDF难写,因为值的聚集是在块内进行的,然后对每一个块的结果进行合并,最后输出最终的值。

下面的示例代码显示了一个自定义UDAF的写法,其中一个UDAFEvaluator静态内部类对应一个计算函数类,里面的init()、iterate()、terminatePartial()、merge()和terminate()函数都是必须要实现的。

package hive;

import org.apache.hadoop.hive.ql.exec.UDAF;
import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;
import org.apache.hadoop.io.IntWritable;

public class Maximum extends UDAF {
    public static class MaximumIntUDAFEvaluator implements UDAFEvaluator {
        private IntWritable mResult;

        @Override
        public void init() {
            // 初始化计算函数,重设内部状态

            mResult = null;
        }

        public boolean iterate(IntWritable value) {
            // 对每一个新值进行聚集计算时,都会调用这个方法。计算函数根据结果更新内部状态
            // 输入类型对应用户输入类型

            if (value == null) {
                return true;
            }

            if (mResult == null) {
                mResult = new IntWritable(value.get());
            } else {
                mResult.set(Math.max(mResult.get(), value.get()));
            }

            return true;
        }

        public IntWritable terminatePartial() {
            // hive需要部分聚集结果时,会调用此方法,返回一个封装了当前聚集状态的对象
            // 返回类型对应merge()方法的入参类型

            return mResult;

        }

        public boolean merge(IntWritable other) {
            // hive合并一个部分聚集值和另一部分聚集值时,会调用这个方法
            // 此方法的入参类型必须和terminatePartial()方法的返回值一直

            return iterate(other);
        }

        public IntWritable terminate() {
            // hive需要最终聚集结果时会调用这个方法,返回计算函数的内部状态对象。
            // 返回类型对应用户看到的结果类型

            return mResult;
        }
    }
}

而merge()函数的逻辑和iterate()函数一样,这是巧合。如果我们要计算doule类型的均值时,merge()函数和iterate()函数的逻辑就不一样了,因为两部分的均值不能直接相加除以2,实现代码如下所示

package hive;

import org.apache.hadoop.hive.ql.exec.UDAF;
import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;
import org.apache.hadoop.hive.serde2.io.DoubleWritable;

public class Mean extends UDAF {
    public static class MeanDoubleUDAFEvaluator implements UDAFEvaluator {
        public static class PartialResult {
            double sum;
            long count;
        }

        private PartialResult mResult;

        @Override
        public void init() {
            mResult = null;
        }

        public boolean iterate(DoubleWritable value) {
            if (value == null) {
                return true;
            }

            if (mResult == null) {
                mResult = new PartialResult();
            }

            mResult.sum += value.get();
            mResult.count += 1;

            return true;
        }

        public PartialResult terminatePartial() {
            return mResult;
        }

        public boolean merge(PartialResult other) {
            if (other == null) {
                return true;
            }

            if (mResult == null) {
                mResult = new PartialResult();
            }

            mResult.sum += other.sum;
            mResult.count += other.count;

            return true;
        }

        public DoubleWritable terminate() {
            if (mResult == null) {
                return null;
            }

            return new DoubleWritable(mResult.sum / mResult.count);
        }
    }
}

先运行Maximum,首先注册函数

create function maximum as 'hive.Maximum';

然后运行函数

select maximum(qty) from sales_test;

结果如下图所示

再运行Mean,同样经过注册函数、运行函数

create function mean_double as 'hive.Mean';

select mean_double(qty) from sales_test;

得到的结果如下图所示

通用UDAF写法(实现多列拼接成一行)

package udaf;

import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator;
import org.apache.hadoop.hive.serde2.objectinspector.*;
import udaf.buffer.MkAggregationBuffer;

import java.util.ArrayList;
import java.util.List;

public class GenericMkListEvaluator extends GenericUDAFEvaluator {
    private PrimitiveObjectInspector mInputOI; // 基本类型的SerDe

    private StandardListObjectInspector mListOI, mInternalMergeOI;
    // 最终输出和中间输出的SerDe,类型都是List

    @Override
    public ObjectInspector init(Mode m, ObjectInspector[] parameters)
            throws HiveException { // init()方法的模板式写法
        super.init(m, parameters);

        if (m == Mode.PARTIAL1) {
            mInputOI = (PrimitiveObjectInspector) parameters[0];

            return ObjectInspectorFactory.getStandardListObjectInspector(
                    ObjectInspectorUtils.getStandardObjectInspector(mInputOI)
            );
        } else if (parameters[0] instanceof StandardListObjectInspector) {
            mInternalMergeOI = (StandardListObjectInspector) parameters[0];

            mInputOI = (PrimitiveObjectInspector) mInternalMergeOI.getListElementObjectInspector();

            mListOI = (StandardListObjectInspector) ObjectInspectorUtils.getStandardObjectInspector(mInternalMergeOI);

            return mListOI;
        } else {
            mInputOI = (PrimitiveObjectInspector) ObjectInspectorUtils.getStandardObjectInspector(parameters[0]);

            return ObjectInspectorFactory.getStandardListObjectInspector(
                    ObjectInspectorUtils.getStandardObjectInspector(mInputOI));
        }
    }

    @Override
    public AggregationBuffer getNewAggregationBuffer() throws HiveException {
        MkAggregationBuffer buffer = new MkAggregationBuffer();

        reset(buffer);

        return buffer;
    }

    @Override
    public void reset(AggregationBuffer aggregationBuffer) throws HiveException {
        ((MkAggregationBuffer) aggregationBuffer).setContainer(new ArrayList<>());
    }

    @Override
    public void iterate(AggregationBuffer aggregationBuffer, Object[] objects)
            throws HiveException {
        if (objects == null || objects.length != 1) {
            return;
        }

        Object p = objects[0];  // 列元素,目前只支持一个参数的函数

        if (p != null) {
            MkAggregationBuffer buffer = (MkAggregationBuffer) aggregationBuffer;
            putIntoList(p, buffer);
        }
    }

    private void putIntoList(Object p, MkAggregationBuffer buffer) {
        Object copy = ObjectInspectorUtils.copyToStandardObject(p, mInputOI);

        buffer.getContainer().add(copy);
    }

    @Override
    public Object terminatePartial(AggregationBuffer aggregationBuffer)
            throws HiveException {
        MkAggregationBuffer buffer = (MkAggregationBuffer) aggregationBuffer;

        return new ArrayList<Object>(buffer.getContainer());
    }

    @Override
    public void merge(AggregationBuffer aggregationBuffer, Object o)
            throws HiveException {
        if (o == null) {
            return;
        }

        MkAggregationBuffer buffer = (MkAggregationBuffer) aggregationBuffer;

        List<Object> partialResult = (List<Object>) mInternalMergeOI.getList(o);

        for (Object ob : partialResult) {
            putIntoList(ob, buffer);
        }
    }

    @Override
    public Object terminate(AggregationBuffer aggregationBuffer) throws HiveException {
        MkAggregationBuffer buffer = (MkAggregationBuffer) aggregationBuffer;

        return new ArrayList<Object>(buffer.getContainer());
    }
}

用法同上,打包成jar,然后--auxpath加入jar包,create function ... as ...注册函数,使用即可

PS C:\WINDOWS\system32> hive --auxpath D:/hive_ext.jar

...
hive> create temporary function collectx as 'udaf.GenericCollect';
...
OK
20/01/23 22:58:00 INFO ql.Driver: OK
...
Time taken: 0.699 seconds
20/01/23 22:58:00 INFO CliDriver: Time taken: 0.699 seconds
...

hive> select collectx(street) from sales;
...
["Wenfeng Street","Huasha Road","Xiange Street","Dongfeng Road","Huasha Road","Xiange Street","Huasha Road","Xiange Street","Dongfeng Road"]

Time taken: 10.405 seconds, Fetched: 1 row(s)
...

表生成函数UDTF通用版

package udtf;

import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableConstantIntObjectInspector;
import org.apache.hadoop.io.IntWritable;

import java.util.ArrayList;

public class GenericFor extends GenericUDTF {
    private IntWritable mStart, mEnd, mInc;
    private Object[] mForwardObj;

    @Override
    public StructObjectInspector initialize(ObjectInspector[] argOIs) throws UDFArgumentException {
        mStart = ((WritableConstantIntObjectInspector) argOIs[0]).getWritableConstantValue();
        mEnd = ((WritableConstantIntObjectInspector) argOIs[1]).getWritableConstantValue();

        if (argOIs.length == 3) {
            mInc = ((WritableConstantIntObjectInspector)argOIs[2]).getWritableConstantValue();
        } else {
            mInc = new IntWritable(1);
        }

        mForwardObj = new Object[2]; // 每一行数据,包含两列
        ArrayList<String> fieldNames = new ArrayList<String>();
        ArrayList<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>();

        fieldNames.add("col0"); // 生成第一列
        fieldNames.add("col1"); // 生成第二列,列名随意

        fieldOIs.add(PrimitiveObjectInspectorFactory.getPrimitiveJavaObjectInspector(PrimitiveObjectInspector.PrimitiveCategory.INT)); // 生成第一列的数据类型
        fieldOIs.add(PrimitiveObjectInspectorFactory.getPrimitiveJavaObjectInspector(PrimitiveObjectInspector.PrimitiveCategory.STRING)); // 生成第二列的数据类型

        return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);
    }

    public void process(Object[] objects) throws HiveException {
        for (int i = mStart.get(); i < mEnd.get(); i += mInc.get()) {
            mForwardObj[0] = i; // 生成第一列的值
            mForwardObj[1] = "value " + i; // 生成第二列的值
            forward(mForwardObj); // 添加一行数据
        }
    }

    public void close() throws HiveException { }
}

打包、添加jar、注册后,调用方法如下

hive> select forx(1, 5, 2) from sales;

输出

1       value 1
3       value 3
1       value 1
3       value 3
1       value 1
3       value 3
1       value 1
3       value 3
1       value 1
3       value 3
1       value 1
3       value 3
1       value 1
3       value 3
1       value 1
3       value 3
1       value 1
3       value 3

Time taken: 3.675 seconds, Fetched: 18 row(s)

sales表内九条数据,每一条都调用process()方法生成两条数据。

表分区

可以在创建表时指定分区列,分区列不能是表中已有的列

create table employees(
    name string, 
    salary float, 
    subordinates array<string>, 
    deductions map<string, float>, 
    address struct<
        street:string, 
        city:string, 
        zip:int>
) partitioned by(country string, state string);

加载数据时要指定分区

load data local inpath 'C:\Users\songzeceng\Desktop\data\employees' into table employees partition (country='US', state='CA');

通过describe formatted命令可以查看表的详细信息,并格式化输出

describe formatted employees;

输出结果如下

可以通过show partitions命令输出表的分区信息

show partitions employees;

输出如下

除了加载数据时指定分区外,我们也可以通过修改表的形式添加分区

alter table employees add partition(country='CN', state='Henan');

再度show partitions的输出如下

还可以查看某一分区的所在路径

describe extended employees partition(country='CN', state='Henan');

输出如下

删除列

删除列可以用replace语句来实现,假设原来的建表语句如下

create table sales_out(widget_id int, quantity int, city_out string, street_out string, zip int, sale_date string, state string) partitioned by(city string, street string);

如果我们要删除city_out和street_out两列,可用以下语句实现

alter table sales_out replace columns(widget_id int, quantity int, state string, sale_date string, zip int);

动态分区

首先确定输出表sales_out是分区的,而且分区键是城市和街道,然后开启动态分区,并把动态分区模式设置为非严格

set hive.exec.dynamic.partition = true;
set hive.exec.dynamic.partition.mode = nonstrict;

然后可用以下语句实现完全动态分区插入,注意源表的城市列和街道列要按顺序放到最后,并且select分句中的剩余字段要和输出表sales_out中的顺序一致。

from sales ss insert overwrite table sales_out partition(city, street) select ss.widget_id, ss.quantity, ss.state, ss.sale_date, ss.zip, ss.city, ss.street;

如果出现报错少列,就补全所缺的列即可。比如当查询语句是

from sales ss insert overwrite table sales_out partition(city = 'Anyang', street) select ss.widget_id, ss.quantity, ss.zip, ss.sale_date, ss.state;

时,出现报错

ERROR parse.CalcitePlanner: CBO failed, skipping CBO.
org.apache.hadoop.hive.ql.parse.SemanticException: Line 1:37 Cannot insert into target table because column number/types are different 'street': Table insclause-0 has 6 columns, but query has 5 columns.

说明少了一列,那么只需要在select语句中多加一列即可。

导出表到文件系统

insert overwrite local directory 'file:\C:\Users\songzeceng\Desktop\input' select * from sales_out where sales_out.street = 'Huashan Road';

查询复合结构中的元素

查询数组元素

select name, subordinates[0] from employees;

结果:

John Doe        Mary Smith
Mary Smith      Bill King
Todd Jones      NULL
Bill King       NULL
Boss Man        John Doe
Fred Finance    Stacy Accountant
Stacy Accountant        NULL

Time taken: 2.112 seconds, Fetched: 7 row(s)

查询映射中的元素

select name, deductions["Insurance"] from employees;

结果:

John Doe        0.1
Mary Smith      0.1
Todd Jones      0.1
Bill King       0.1
Boss Man        0.05
Fred Finance    0.05
Stacy Accountant        0.1

Time taken: 1.776 seconds, Fetched: 7 row(s)

查询结构体中的元素

select name, address.street from employees;

结果:

John Doe        1 Michigan Ave.
Mary Smith      100 Ontario St.
Todd Jones      200 Chicago Ave.
Bill King       300 Obscure Dr.
Boss Man        1 Pretentious Drive.
Fred Finance    2 Pretentious Drive.
Stacy Accountant        300 Main St.

Time taken: 1.749 seconds, Fetched: 7 row(s)

内置函数

upper()函数小写转大写,round()函数就近取整

select upper(name), salary, deductions["Federal Taxes"], round(salary * (1 - deductions["Federal Taxes"])) from employees;

count()函数计数,avg()函数求均值

select count(*) avg(salary) from employees;

distinct关键字去重

select count(distinct street), count(distinct widget_id) from sales;

explode()展开复合结构。当我们使用select语句查询subordinates时,结果如下

hive> select subordinates as sub from employees;
.....

["Mary Smith","Todd Jones"]
["Bill King"]
[]
[]
["John Doe","Fred Finance"]
["Stacy Accountant"]
[]

Time taken: 1.846 seconds, Fetched: 7 row(s)

使用explode()函数时结果如下

hive> select explode(subordinates) as sub from employees;
....

Mary Smith
Todd Jones
Bill King
John Doe
Fred Finance
Stacy Accountant

Time taken: 1.833 seconds, Fetched: 6 row(s)

20/01/19 15:04:41 INFO CliDriver: Time taken: 1.833 seconds, Fetched: 6 row(s)

case...when...then.else语句

类似于if-else,以下语句可用来对薪水进行分层

select name, salary, case when salary < 50000 then 'low' when salary >= 50000 and salary < 70000 then 'middle' when salary >= 70000 and salary < 100000 then 'high' else 'very high' end as bracket from employees;

结果如下

John Doe        100000.0        very high
Mary Smith      80000.0 high
Todd Jones      70000.0 high
Bill King       60000.0 middle
Boss Man        200000.0        very high
Fred Finance    150000.0        very high
Stacy Accountant        60000.0 middle

Time taken: 1.921 seconds, Fetched: 7 row(s)

20/01/19 15:19:01 INFO CliDriver: Time taken: 1.921 seconds, Fetched: 7 row(s)

可通过以下设置来尽量避免MapReduce作业,但如果有错就不能在historyserver上查看日志了,所以慎用

set hive.exec.mode.local.auto = true;

嵌套select语句

可以通过嵌套select语句来减少表达式的重复问题

select e.* from (select name, salary, deductions["Federal Taxes"], salary * (1 - deductions["Federal Taxes"]) as paied from employees) e where e.paied > 70000;

浮点数比较

我们可以利用类型转换来实现浮点数的精确比较

select name, salary, deductions["Federal Taxes"] from employees where deductions["Federal Taxes"] > cast(0.2 as float);

like和rlike

like可用于正则匹配,以下输出address.street中包含Ch的内容

select name, address.street from employees where address.street like '%Ch%';

rlike则是like的升级,以下输出address.street中包含Ch或On的内容

select name, address.street from employees where address.street rlike '.*(Ch|On).*';

group by分组

group by分组,having进行分组过滤。以下语句按街道分组,并输出数量均值大于10的街道分组

select street, avg(quantity) from sales group by street having avg(quantity) > 10.0;

优化选项

本地模式

我们可以通过设置mapred.job.trakcer为true来让hive命令在本地执行,从而缩小任务在小数据集上的执行时间。也可以设置mapred.exec.mode.local.auto为true,来让hive根据情况自行执行本地执行。但是如前文所述,不建议开启自动本地模式(mapred.exec.mode.local.auto = true),因为可能会导致不能追踪任务日志。对应命令如下

set mapred.job.trakcer = local;

set hive.exec.mode.local.auto = true;

并行执行

hive会把查询分成几个阶段,当阶段之间没有依赖关系时,就可以并行执行。可以使用以下命令来开启并行执行

set hive.exec.parallel = true;

严格模式

通过以下命令启动严格模式。

set hive.mapred.mode = strict;

严格模式可以防止三种情况发生:

1)、where子句中没有对分区字段进行限制,从而扫描整个分区。

select * from employees; // error

select * from employees where country = 'US'; // correct

2)、order by子句没有limit语句。

select * from sales order by widget_id; // error

select * from sales order by widget_id limit 100; // correct

3)、限制笛卡尔积的连接操作

select * from sales join sales_test where (sales.widget_id = sales_test.wideget_id); // error

select * from sales join sales_test on (sales.widget_id = sales_test.wideget_id); // correct

以上配置都可以写到hiveHome\conf\hive-site.xml文件中,如下所示 

 <property>
    <name>mapred.job.trakcer</name>
    <value>local</value>
  </property>

  <property>
    <name>hive.exec.mode.local.auto</name>
    <value>true</value>
  </property>


  <property>
    <name>hive.exec.parallel</name>
    <value>true</value>
  </property>

  <property>
    <name>hive.mapred.mode</name>
    <value>strict</value>
  </property>

压缩

可在hiveHome\conf\hive-site.xml中设置相应属性来开启对mapreduce中间结果和输出结果的压缩

  <property>
    <name>hive.exec.compress.intermediate</name>
    <value>true</value>
  </property>

  <property>
    <name>hive.exec.compress.output</name>
    <value>true</value>
  </property>

默认编码器用的是DefaultCodec,可以设置以下参数指定编码器

  <property>
    <name>mapred.output.compression.codec</name>
    <value>org.apache.hadoop.io.compress.GZipCodec</value>
  </property>

Hadoop中的序列文件可以把文件划分成多个块,然后用可分割的方式对每个块进行解压缩。可以在建表时,添加stored as sequencefile来对这个表使用序列文件存储格式。另外,我们也可以设置如下选项指定序列文件的压缩方式

  <property>
    <name>mapred.output.compression.type</name>
    <value>block</value>
  </property>

解压缩分区

先把hadoop-archives-2.8.5.jar(Hadoop2对应的jar应该都可以)放到hiveHome/lib目录下,然后在hive-site.xml中开启归档分区

  <property>
    <name>hive.archive.enabled</name>
    <value>true</value>
  </property>

对于一个分区表,压缩命令如下

alter table employees archive partition(country='US');

运行成功后,在hive中查看分区目录压缩的结果

dfs -ls  /user/hive/warehouse/employees/country=US

输出如下:

对一个已经压缩的分区进行解压的命令如下

alter table employees unarchive partition(country='US');

然后在hive中查看结果

dfs -ls  /user/hive/warehouse/employees/country=US

输出如下

宏命令

宏命令定义如下,比如定义一个sigmoid()函数

create temporary macro sigmoid (x double) 1.0 / (1.0 + exp(-x));

使用如下

select sigmoid(5) from sales limit 1;

RCFile

RCFile是hive支持的使用列式存储的文件格式。可以用以下命令创建存储格式为RCFile的表,指定序列化反序列化格式以及IO格式。

create table columnTable (key int, value int) row  format serde 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe' stored as inputformat 'org.apache.hadoop.hive.ql.io.RCFileInputFormat' outputformat 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat';

插入数据后,可以通过rcfilecat来查看rcFile的内容

PS C:\WINDOWS\system32> hive --service rcfilecat /user/hive/warehouse/columntable/000000_0
...

4       5

3       2

自定义IO格式(input为例)

通过实现InputFormat、InputSplit和RecordReader来分别实现自定义的InputFormat、InputSplit和RecordReader,最终实现一个只能返回调用函数结果,但不能显示表内数据的表。

自定义输入格式,实现InputFormat接口:

package ioFormat;

import org.apache.hadoop.mapred.*;

import java.io.IOException;

public class MyInputFormat implements InputFormat {

    @Override
    public InputSplit[] getSplits(JobConf jobConf, int i) throws IOException {
        InputSplit[] inputSplits = new InputSplit[1];
        inputSplits[0] = new MyInputSplit();
        return inputSplits;
    }

    @Override
    public RecordReader getRecordReader(InputSplit inputSplit, JobConf jobConf, Reporter reporter) throws IOException {
        return new MyRecordReader();
    }
}

自定义输入划分,实现InputSplit接口。只返回一行数据,所以不用序列化和反序列化

package ioFormat;

import org.apache.hadoop.mapred.InputSplit;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class MyInputSplit implements InputSplit {
    @Override
    public long getLength() throws IOException {
        return 1; // 只返回一行数据
    }

    @Override
    public String[] getLocations() throws IOException {
        return new String[] {
                "192.168.57.141"
        };
    }

    @Override
    public void write(DataOutput dataOutput) throws IOException { }

    @Override
    public void readFields(DataInput dataInput) throws IOException { }
}

自定义RecordReader,实现RecordReader接口,同样只返回一条数据。

package ioFormat;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.RecordReader;

import java.io.IOException;

public class MyRecordReader implements RecordReader<Text, Text> {
    private boolean mHasNext = true;

    @Override
    public boolean next(Text text, Text text2) throws IOException {

        // 只返回一行
        if (mHasNext) {
            mHasNext = false;
            return true;
        }

        return false;
    }

    @Override
    public Text createKey() {
        return new Text("");
    }

    @Override
    public Text createValue() {
        return new Text("");
    }

    @Override
    public long getPos() throws IOException {
        return 0;
    }

    @Override
    public void close() throws IOException { }

    @Override
    public float getProgress() throws IOException {
        return mHasNext ? 1F : 0F;
    }
}

打包后,把jar包添加进hive里

hive> add jar file:///D:/hive_ext.jar;

然后创建表,使用自定义的输入格式,输出格式则为默认的IgnoreKeyTextOutputFormat

hive> create table dualText(fake string) stored as inputformat 'ioFormat.MyInputFormat' outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

加载数据后,测试输出表内数据和函数调用

hive> load data local inpath 'C:\Users\songzeceng\Desktop\input.txt' into table dualText;

hive> select * from dualText;
...

Time taken: 2.002 seconds, Fetched: 1 row(s)

20/01/24 15:50:31 INFO CliDriver: Time taken: 2.002 seconds, Fetched: 1 row(s)

...

hive> select count(*) from dualText;
....

2

Time taken: 5.989 seconds, Fetched: 1 row(s)

xpath

xpath是hive用来解析xml的内置函数,下面是几个例子

输出所有id

hive> select xpath('<a><b id="szc">b1</b><b id="songzeceng">b2</b></a>', '//@id') from sales limit 1;

...

["szc","songzeceng"]

...

输出指定标签下指定id的子标签内容

以下是输出<a>标签下id=szc的子标签内容

hive> select xpath('<a><b id="szc">b1</b><b id="songzeceng">b2</b><b id="szc">b3</b></a>', 'a/*[@id="szc"]/text()') from sales limit 1;
...

["b1","b3"]

...

计算指定标签下表达式的值

下面计算的是标签a下( b + d) * c的值

hive> select xpath_double('<a><b id="szc">3</b><d>5</d><c>4</c></a>', '(a/b + a/d) * a/c') from sales limit 1;
...

32.0

...

词频统计

create table docs(line string); // 为文档创建原始表

load data inpath '/user/szc/file.txt' overwrite into table docs; // 加载数据

create table word_count as select word, count(1) as count from (select explode(split(line, " ")) as word from docs) w group by word order by word;
// 词频统计,按单词排序

按空格区分单词,然后调用explode()函数把多个单词转换成多个列。后面的w是查询(select explode(split(line, " ")) as word from docs)的别名,语句嵌套时必须定义别名,否则无法解析。

tez引擎的安装

tez引擎比mr要快很多,值得安装一下,下面的命令是在CentOS7下尝试的

步骤

1、解压tez的压缩包

[root@localhost szc]# tar -zxvf apache-tez-0.9.1-bin.tar.gz

2、重命名解压后的目录

[root@localhost szc]# mv apache-tez-0.9.1-bin/ apache-tez-0.9.1

3、进入hiveHome,重命名conf/hive-env.sh.template为conf/hive-env.sh

[root@localhost apache-hive-2.3.7]# cp conf/hive-env.sh.template conf/hive-env.sh

4、修改hive-env.sh

[root@localhost apache-hive-2.3.7]# vim conf/hive-env.sh

添加以下内容

export TEZ_HOME=/home/szc/apache-tez-0.9.1

export TEZ_JARS=""

for jar in `ls $TEZ_HOME | grep jar`; do
    export TEZ_JARS=$TEZ_JARS:$TEZ_HOME/$jar
done

for jar in `ls $TEZ_HOME/lib`; do
    export TEZ_JARS=$TEZ_JARS:$TEZ_HOME/lib/$jar
done

export HIVE_AUX_JARS_PATH=$HADOOP_HOME/share/hadoop/common/hadoop-lzo-0.4.20.jar$TEZ_JARS

5、新建conf/tez-site.xml

[root@localhost apache-hive-2.3.7]# vim conf/tez-site.xml

添加以下内容

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
    <property>
        <name>tez.lib.uris</name>
        <value>${fs.defaultFS}/tez/apache-tez-0.9.1/lib,${fs.defaultFS}/tez/apache-tez-0.9.1</value>
    </property>

    <property>
        <name>tez.lib.uris.classpath</name>
        <value>${fs.defaultFS}/tez/apache-tez-0.9.1/lib,${fs.defaultFS}/tez/apache-tez-0.9.1</value>
    </property>

    <property>
        <name>tez.use.cluster.hadoop-libs</name>
        <value>true</value>
    </property>

    <property>
        <name>tez.history.logging.service.class</name>
        <value>org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService</value>
    </property>
</configuration>

6、把tez解压目录下的share/tez.tar.gz上传到hdfs的/tez/share/lib中

[root@localhost bin]# ./hadoop fs -mkdir  /tez

[root@localhost bin]# ./hadoop fs -mkdir /tez/share

[root@localhost bin]# ./hadoop fs -mkdir /tez/share/lib

[root@localhost bin]# ./hadoop fs -put /home/szc/apache-tez-0.9.1/share/tez.tar.gz /tez/share/lib

7、修改conf/hive-site.xml,使能tez,关闭本地mr

[root@localhost apache-hive-2.3.7]# vim conf/hive-site.xml

添加配置

<property>
    <name>hive.execution.engine</name>
    <value>tez</value>
</property>

<property>
    <name>hive.exec.mode.local.auto</name>
    <value>false</value>
</property>

7、启动metastore和hive即可

Presto可视化组件安装

Presto是针对hive的可视化组件,以下命令在CentOS7下运行

步骤

1、准备server和可视化客户端yanagishima的压缩包,将其上传到CentOS7服务器

2、先部署服务器server,解压,改目录,切换进去

[root@localhost szc]# tar -zxvf presto-server-0.196.tar.gz

[root@localhost szc]# mv presto-server-0.196 presto

[root@localhost szc]# cd presto/

3、新建data目录和etc目录,前者用于存放数据,后者用于存放配置文件

[root@localhost presto]# mkdir data

[root@localhost presto]# mkdir etc

4、进入etc目录,新建jvm.config配置jvm选项

[root@localhost presto]# cd etc

[root@localhost etc]# vim jvm.config

设置服务器模式运行jvm,最大堆内存、使用的GC、G1的分区大小等,如果对这些JVM参数感兴趣,可以参见文章JVM学习之垃圾回收和垃圾回收器

-server

-Xmx2G

-XX:+UseG1GC

-XX:G1HeapRegionSize=512K

-XX:+UseGCOverheadLimit

-XX:+ExplicitGCInvokesConcurrent

-XX:+HeapDumpOnOutOfMemoryError

-XX:+ExitOnOutOfMemoryError

5、新建catalog目录,用来存放hive的配置文件,切入进去后,新建一个hive.properties文件,配置连接的hive

[root@localhost etc]# mkdir catalog

[root@localhost etc]# cd catalog/

[root@localhost catalog]# vim hive.properties

主要配置hive的metastore地址

connector.name=hive-hadoop2

hive.metastore.uri=thrift://192.168.57.141:9083

6、返回etc目录,新建node.properties,配置结点信息(运行环境、结点id、数据目录)

node.environment=production

node.id=1

node.data=/home/szc/presto/data

7、新建config.properties,配置协调器和工作器

[root@localhost etc]# vim config.properties

主要设置当前结点是否是协调器,协调器是否同时是工作器,以及协调器的地址等

coordinator=true

node-scheduler.include-coordinator=true

http-server.http.port=8881

query.max-memory=1GB

discovery-server.enabled=true

discovery.uri=http://192.168.57.141:8881

8、至此,服务端配置完成,切换到服务器根目录,运行服务器

[root@localhost presto]# bin/launcher run

9、配置客户端yanagishima,同样先解压,就不改名了,直接切进去

[root@localhost szc]# unzip yanagishima-18.0.zip

[root@localhost szc]# cd yanagishima-18.0/

10、修改conf/yanagishima.properties文件

[root@localhost yanagishima-18.0]# vim conf/yanagishima.properties

主要配置server的地址和客户端的端口号

jetty.port=7080


# 30 minutes. If presto query exceeds this time, yanagishima cancel the query.

presto.query.max-run-time-seconds=1800


# 1GB. If presto query result file size exceeds this value, yanagishima cancel the query.

presto.max-result-file-byte-size=1073741824

# you can specify freely. But you need to specify same name to presto.coordinator.server.[...] and presto.redirect.server.[...] and catalog.[...] and schema.[...]

presto.datasources=szc-presto
auth.your-presto=false

# presto coordinator url

presto.coordinator.server.szc-presto=http://192.168.57.141:8881

# almost same as presto coordinator url. If you use reverse proxy, specify it

presto.redirect.server.szc-presto=http://192.168.57.141:8881


# presto catalog name

catalog.szc-presto=hive

# presto schema name

schema.szc-presto=default

# if query result exceeds this limit, to show rest of result is skipped

select.limit=500


# limit to convert from tsv to values query

to.values.query.limit=500

# authorization feature

check.datasource=false
sql.query.engines=presto

# 1GB. If hive query result file size exceeds this value, yanagishima cancel the query.
hive.max-result-file-byte-size=1073741824

cors.enabled=false

11、启动yanagishima

[root@localhost yanagishima-18.0]# bin/yanagishima-start.sh

12、暴露7080端口后,在windows浏览器里输入192.168.57.141:7080,就可以看到可视化的hive界面

在上面运行hql语句,再点击输入框右下角的run,就可以得到输出结果

13、注意事项

presto中比较时间戳时,要显示加上timestamp关键字

SELECT t FROM a WHERE t > timestamp '2017-01-01 00:00:00';

presto不支持insert overwrite,只能先delete,再insert into

14、理论

presto架构如下图所示

优缺点

结语

以上,内容比较多,如有问题,欢迎在评论区讨论

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值