Hive--HiveServer2 命令行代码连接、Hive常用命令、自定义UDF函数、排序

目录

1 Hive--HiveServer2 命令行代码连接

1.1 配置HiveServer2 WEB 参数

1.2 开启HiveServer2

1.3 使用Beeline连接HiveServer2

1.4 使用代码查询HiveServer2

1.5 使用DBeaver连接Hive

 

2 Hive--Hive常用命令

2.1 Hive 命令

2.2 Hive Shell 命令

 

3 Hive--自定义UDF函数(User-Defined Functions)

3.1 Pom里面需要依赖

3.2 需求

3.3 MyLower Code

3.4 临时使用UDF函数

3.5 永久注册UDF函数

3.6 Hive Shell初始化UDF函数

3.7 UDF函数转成Hive build-in 函数

3.8 自定义UDTF代码:实现spilt功能,第一个参数传入需要分割的字符串,第二个参数传入分隔符

4 Hive--排序

4.1 在Hive中创建一个表

4.2 修改Reduce Task 个数

4.3 四大By解析



 

  • 以下所说的都是针对 Hive 为 1.1.0-cdh5.16.2版本

1 Hive--HiveServer2 命令行代码连接

1.1 配置HiveServer2 WEB 参数

  • HiveServer2是一个服务端,开启之后可以通过JDBC连接
  • 还对应一个WEB服务,开启WEB服务可以看到当前有哪些Session进行连接,以及执行了哪些语句等详细信息
  • 开启HiveServer2 WEB服务需要在hive-site.xml配置
<property>
	<name>hive.server2.webui.host</name>
	<value>you hostname</value>
</property>

<property>
	<name>hive.server2.webui.port</name>
	<value>19990</value>
</property>

1.2 开启HiveServer2

$HIVE_HOME/bin/hiveserver2
OR
$HIVE_HOME/bin/hive --service hiveserver2
  • 查看hiveserver2命令帮助
[work@bigdatatest01 ~]$ $HIVE_HOME/bin/hiveserver2 -H
usage: hiveserver2
    --deregister <versionNumber>   Deregister all instances of given
                                   version from dynamic service discovery
 -H,--help                         Print help information
    --hiveconf <property=value>    Use value for given property
  • 后面加hiveconf参数可以直接指定hive-site.xml里面的参数:K=V
  • 修改HiveServer2端口
$HIVE_HOME/bin/hiveserver2 --hiveconf hive.server2.thrift.port=14000

1.3 使用Beeline连接HiveServer2

Hive--HiveServer2+Clients

  • hostname:HiveServer2开启服务的机器的Hostname
  • port: HiveServer2服务开启的端口
  • uesername:用户名
  • passwd: 密码
[work@bigdatatest01 ~]$ $HIVE_HOME/bin/beeline -u jdbc:hive2://hostname:port/default -n hadoop

OR

[work@bigdatatest01 ~]$ $HIVE_HOME/bin/beeline
beeline> !connect jdbc:hive2://hostname:port/default uesername passwd

1.4 使用代码查询HiveServer2

1.4.1 需求

使用代码查询所有的databases

1.4.2 Code

1.4.2.1 pom依赖

<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-jdbc</artifactId>
    <version>1.1.0-cdh5.16.2</version>
</dependency>

1.4.2.2 HiveJDBCClinet Code

package com.xk.bigdata.hive.jdbc;

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.Statement;

public class HiveJDBCClinet {

    private static String driverName = "org.apache.hive.jdbc.HiveDriver";

    public static void main(String[] args) throws Exception {
        Class.forName(driverName);
        //replace "hive" here with the name of the user the queries should run as
        Connection con = DriverManager.getConnection("jdbc:hive2://bigdatatest03:10000/default", "hive", "");
        Statement stmt = con.createStatement();
        // show databases
        String sql = "show databases";
        System.out.println("Running: " + sql);
        ResultSet res = stmt.executeQuery(sql);
        while (res.next()) {
            System.out.println(res.getString(1));
        }
    }
}

HiveJDBCClinet Code

1.4.3 结果

1.4.3.1 控制台输出

Running: show databases
bigdata
default
ods_bigdata
test

1.5 使用DBeaver连接Hive

  • 之前文章已经写了,可以参考一下

DBeaver连接HIVE

 

2 Hive--Hive常用命令

Hive 命令操作

usage: hive
 -d,--define <key=value>          Variable substitution to apply to Hive
                                  commands. e.g. -d A=B or --define A=B
 -e <quoted-query-string>         SQL from command line
 -f <filename>                    SQL from files
 -H,--help                        Print help information
 -h <hostname>                    Connecting to Hive Server on remote host
    --hiveconf <property=value>   Use value for given property
    --hivevar <key=value>         Variable substitution to apply to hive
                                  commands. e.g. --hivevar A=B
 -i <filename>                    Initialization SQL file
 -p <port>                        Connecting to Hive Server on port number
 -S,--silent                      Silent mode in interactive shell
 -v,--verbose                     Verbose mode (echo executed SQL to the
                                  console)

2.1 Hive 命令

  • 常用命令有如下
  • ${HIVE_HOME}/bin/hive -e + ${SQL 语句}
[work@bigdatatest02 ~]$ ${HIVE_HOME}/bin/hive -e "show databases"
bigdata
default
ods_bigdata
test
  • ${HIVE_HOME}/bin/hive -f + ${SQL 文件}
    • Hive 0.14 版本之后还支持HDFS上面的sql文件
[work@bigdatatest02 hive]$ vim demo.sql
show databases;
[work@bigdatatest02 hive]$ ${HIVE_HOME}/bin/hive -f demo.sql 
bigdata
default
ods_bigdata
test
  • ${HIVE_HOME}/bin/hive -i : 初始化
    • 一般是在自定义UDF函数的时候可以用到
  • ${HIVE_HOME}/bin/hive --hiveconf <property=value>
    • 一般会在Hive参数传递时候使用,可以把参数传递到SQL 语句或者SQL文件

2.2 Hive Shell 命令

  • 退出命令行
quit
exit

Use quit or exit to leave the interactive shell.
hive> exit;
[work@bigdatatest02 hive]$ 
  • 查看或者设置某个参数
  • 只对当前Session有效
set <key>=<value>
Sets the value of a particular configuration variable (key).
Note: If you misspell the variable name, the CLI will not show an error.
# 查看hive.cli.print.current.db参数
hive> set hive.cli.print.current.db;
hive.cli.print.current.db=false
# 把hive.cli.print.current.db改为true
hive> set hive.cli.print.current.db=true;
# 查看hive.cli.print.current.db参数
hive (default)> set hive.cli.print.current.db;
hive.cli.print.current.db=true
  • 添加、列表、删除文件或者jar
    • jar或者文件可以是本地系统也可以是HDFS
add FILE[S] <filepath> <filepath>*
add JAR[S] <filepath> <filepath>*
add ARCHIVE[S] <filepath> <filepath>*

Adds one or more files, jars, or archives to the list of resources in the distributed cache. See Hive Resources below for more information.

list FILE[S]
list JAR[S]
list ARCHIVE[S]

Lists the resources already added to the distributed cache. See Hive Resources below for more information.

delete FILE[S] <filepath>*
delete JAR[S] <filepath>*
delete ARCHIVE[S] <filepath>*

Removes the resource(s) from the distributed cache.
hive (default)> add FILE /home/work/data/demo.txt;
Added resources: [/home/work/data/demo.txt]
hive (default)> list FILE;
/home/work/data/demo.txt
hive (default)> delete FILE /home/work/data/demo.txt;
hive (default)> list FILE;

hive (default)> add FILE hdfs://nameservice1/data/wc/demo.txt;
Added resources: [hdfs://nameservice1/data/wc/demo.txt]
hive (default)> list FILE;
/tmp/c26afbf8-601a-4590-8d21-57ba20ce5f58_resources/demo.txt
hive (default)> delete FILEs;
hive (default)> list FILE;

 

3 Hive--自定义UDF函数(User-Defined Functions)

Hive 官网UDF函数

  • UDF分为以下几类
    • UDF:一条数据进去返回一条数据
    • UDAF:多条数据进入函数,只有一条数据返回
    • UDTF:一条数据进入函数,返回多条数据
A User-defined function (UDF) for use with Hive.
<p>
New UDF classes need to inherit from this UDF class (or from {@link
org.apache.hadoop.hive.ql.udf.generic.GenericUDF GenericUDF} which provides more flexibility at
the cost of more complexity).
<p>
Requirements for all classes extending this UDF are:
<ul>
<li>Implement one or more methods named {@code evaluate} which will be called by Hive (the exact
way in which Hive resolves the method to call can be configured by setting a custom {@link
UDFMethodResolver}). The following are some examples:
  • 自定义UDF函数,官网上面需要继承UDF这个类,目前UDF这个类已经被标记过时,也可以继承org.apache.hadoop.hive.ql.udf.generic.GenericUDF这个类来实现UDF函数
  • 进入和输出数据类型可以是普通的数据类型也支持Hadoop 自带的数据类型
  • 自定义的类里面实现的函数名必须要是evaluate

3.1 Pom里面需要依赖

<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-exec</artifactId>
    <version>1.1.0-cdh5.16.2</version>
</dependency>

3.2 需求

自定义实现大写字母转小写字母

3.3 MyLower Code

package com.xk.bigdata.hive.udf.lower;

import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

/**
 * MyLower
 */
@Description(name = "my_lower",
        value = "_FUNC_(str) - returns lower str",
        extended = "Example:\n"
                + "  > SELECT _FUNC_('HADOOP') FROM src LIMIT 1;\n"
                + "  'hadoop'")

public class MyLower extends UDF {
    public Text evaluate(final Text s) {
        if (s == null) {
            return null;
        }
        return new Text(s.toString().toLowerCase());
    }
}

MyLower Code

3.4 临时使用UDF函数

3.4.1 打包并把jar放在Linux某个目录下面

[work@bigdatatest02 hive]$ pwd
/home/work/lib/hive
[work@bigdatatest02 hive]$ ll
total 8
-rw-r--r-- 1 work work 4257 Dec 15 16:40 hive-basic-1.0.jar

3.4.2 注册临时UDF

  • 临时UDF只对当前Session有效
hive (default)> add JAR /home/work/lib/hive/hive-basic-1.0.jar;
Added [/home/work/lib/hive/hive-basic-1.0.jar] to class path
Added resources: [/home/work/lib/hive/hive-basic-1.0.jar]
hive (default)> create temporary function my_lower as 'com.xk.bigdata.hive.udf.lower.MyLower';
OK
Time taken: 0.328 seconds
  • 通过 show functions 命令查询my_lower函数,可以找到
  • 使用my_lower测试
hive (default)> select my_lower("HADOOP");
OK
hadoop
Time taken: 0.167 seconds, Fetched: 1 row(s)
  • 发现是可以使用这个函数的

3.5 永久注册UDF函数

  • CREATE FUNCTION my_lower AS "${类的全名}" USING JAR "${自定义UDF jar的地址}"
hive> CREATE FUNCTION sayhello AS "com.xk.bigdata.hive.udf.lower.MyLower" USING JAR "/home/work/lib/hive/hive-basic-1.0.jar";
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask. Hive warehouse is non-local, but /home/work/lib/hive/hive-basic-1.0.jar specifies file on local filesystem. Resources on non-local warehouse should specify a non-local scheme/path
  • 需要把自定义UDF jar放在HDFS上面
hive> CREATE FUNCTION my_lower AS "com.xk.bigdata.hive.udf.lower.MyLower" USING JAR "hdfs://nameservice1/lib/hive-basic-1.0.jar";
Added [/tmp/893136b4-bc59-44b0-8caa-a99961f8bd34_resources/hive-basic-1.0.jar] to class path
Added resources: [hdfs://nameservice1/lib/hive-basic-1.0.jar]
OK
Time taken: 0.414 seconds
  • 发现可以找到my_lower函数,测试是否可以使用
hive>  select my_lower("HADOOP");
OK
hadoop
Time taken: 0.675 seconds, Fetched: 1 row(s)
  • 退出当前Session再进行测试是否可以使用
  • 发现可以通过 show functions 命令找到my_lower函数,并可以正常使用
  • 删除永久函数
hive> drop function default.my_lower;
OK
Time taken: 0.122 seconds

3.6 Hive Shell初始化UDF函数

[work@bigdatatest02 hive]$ vim hive-init.sql
add JAR /home/work/lib/hive/hive-basic-1.0.jar;
create temporary function my_lower as 'com.xk.bigdata.hive.udf.lower.MyLower';
[work@bigdatatest02 hive]$ hive -i hive-init.sql 
  • 相当于每次打开Hive Shell的时候通过 -i参数会把自定义的UDF注册为临时函数
  • 发现可以通过 show functions 命令找到my_lower函数,并可以正常使用

3.7 UDF函数转成Hive build-in 函数

  • 修改源代码
  • 修改 org.apache.hadoop.hive.ql.exec.FunctionRegistry 中的静态代码块
  • 增加以下代码
system.registerUDF("my_lower", MyLower.class, false);
  • 然后重新打包,把集群环境中的 hive-exec-1.1.0-cdh5.16.2.jar 替换成新的jar

3.8 自定义UDTF代码:实现spilt功能,第一个参数传入需要分割的字符串,第二个参数传入分隔符

3.8.1 Code

package com.xk.bigdata.hive.udtf;

import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;

import java.util.ArrayList;
import java.util.List;

public class MySplitUDTF extends GenericUDTF {

    List<String> list = new ArrayList<>();

    /**
     * 增加输出数据的schema
     *
     * @param argOIs
     * @return
     * @throws UDFArgumentException
     */
    @Override
    public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException {
        List<String> structFieldNames = new ArrayList<>();
        List<ObjectInspector> structFieldObjectInspectors = new ArrayList<>();
        structFieldNames.add("word");
        structFieldObjectInspectors.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
        return ObjectInspectorFactory.getStandardStructObjectInspector(structFieldNames, structFieldObjectInspectors);
    }

    /**
     * args[0] : words 输入的数据
     * args[1] : 传入的第二个参数,即分隔符
     *
     * @param args
     * @throws HiveException
     */
    @Override
    public void process(Object[] args) throws HiveException {
        if (args.length == 2) {
            String words = args[0].toString();
            String spiltFile = args[1].toString();
            String[] spilts = words.split(spiltFile);
            for (String word : spilts) {
                list.clear();
                list.add(word);
                forward(list);
            }
        } else {
            throw new HiveException("parameter error");
        }

    }

    @Override
    public void close() throws HiveException {

    }
}

MySplitUDTF Code

3.8.2 加载临时函数

hive>  add JAR /home/work/lib/hive/hive-basic-1.0.jar;
Added [/home/work/lib/hive/hive-basic-1.0.jar] to class path
Added resources: [/home/work/lib/hive/hive-basic-1.0.jar]
hive> create temporary function my_split as 'com.xk.bigdata.hive.udtf.MySplitUDTF';
OK
Time taken: 0.038 seconds

3.8.3 结果展示

hive> select * from bigdata.wc;
OK
hadoop,spark,flink
hbase,hadoop,spark,flink
spark
hadoop
hadoop,spark,flink
hbase,hadoop,spark,flink
spark
hadoop
hbase,hadoop,spark,flink

hive> select my_split(words,',') from bigdata.wc;
hadoop
spark
flink
hbase
hadoop
spark
flink
spark
hadoop
hadoop
spark
flink
hbase
hadoop
spark
flink
spark
hadoop
hbase
hadoop
spark
flink

 

4 Hive--排序

4.1 在Hive中创建一个表

CREATE TABLE IF NOT EXISTS bigdata.emp(
emp_no String,
emp_name String,
dept_no String
)ROW FORMAT 
DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
  • 加载数据
hive> load data local inpath '/home/work/data/hive/emp.txt' overwrite into table bigdata.emp;
Loading data to table bigdata.emp
OK
Time taken: 3.162 seconds
hive> select * from bigdata.emp;
OK
7369	SMITH	20
7499	ALLEN	30
7521	WARD	30
7566	JONES	20
7654	MARTIN	30
7698	BLAKE	30
7782	CLARK	10
7788	SCOTT	20
7839	KING	10
7844	TURNER	30
7876	ADAMS	20
7900	JAMES	30
7902	FORD	20
7934	MILLER	10
Time taken: 0.641 seconds, Fetched: 14 row(s)

4.2 修改Reduce Task 个数

hive> set mapred.reduce.tasks;
mapred.reduce.tasks=-1
hive> set mapred.reduce.tasks=3;
hive> set mapred.reduce.tasks;
mapred.reduce.tasks=3
The default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas Hive uses -1 as its default value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers.
  • 默认为-1,会自动计算Reduce Task 的个数

4.3 四大By解析

  • Hive 排序分为以下几种

4.3.1 order by

  • order by :全排
  • 最终会把所有的数据发送到一个Reduce,只会在一个Reduce上面进行排序,Reduce Task 为1
hive> insert overwrite local directory '/tmp/hive/sort' select * from bigdata.emp order by emp_no desc;
Query ID = work_20201215180311_b4845985-f372-4112-8c9a-9d19236a3910
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1608016084001_0004, Tracking URL = http://bigdatatest02:8088/proxy/application_1608016084001_0004/
Kill Command = /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop job  -kill job_1608016084001_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-12-15 18:03:28,726 Stage-1 map = 0%,  reduce = 0%
2020-12-15 18:03:36,970 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.39 sec
2020-12-15 18:03:43,169 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.53 sec
MapReduce Total cumulative CPU time: 6 seconds 530 msec
Ended Job = job_1608016084001_0004
Moving data to local directory /tmp/hive/sort
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 6.53 sec   HDFS Read: 7440 HDFS Write: 196 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 530 msec
OK
Time taken: 34.088 seconds
  • 在严格模式下,如果使用Order by 需要增加Limit
  • 修改模式为严格模式
hive> set hive.mapred.mode=strict;
hive> set hive.mapred.mode;
hive.mapred.mode=strict
  • 再运行之前的Order By
hive> select * from bigdata.emp order by emp_no desc;
FAILED: SemanticException 1:35 Order by-s without limit are disabled for safety reasons. If you know what you are doing, please set hive.strict.checks.orderby.no.limit to false and make sure that hive.mapred.mode is not set to 'strict' to proceed. Note that you may get errors or incorrect results if you make a mistake while using some of the unsafe features.. Error encountered near token 'emp_no'
  • 在Order By后面加上 Limit
hive> select * from bigdata.emp order by emp_no desc limit 1;
Query ID = work_20201215181124_d047171a-8fd9-4b83-b68b-6f0675aae83a
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1608016084001_0005, Tracking URL = http://bigdatatest02:8088/proxy/application_1608016084001_0005/
Kill Command = /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop job  -kill job_1608016084001_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-12-15 18:11:43,143 Stage-1 map = 0%,  reduce = 0%
2020-12-15 18:11:52,553 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.04 sec
2020-12-15 18:11:58,794 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.01 sec
MapReduce Total cumulative CPU time: 6 seconds 10 msec
Ended Job = job_1608016084001_0005
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 6.01 sec   HDFS Read: 7913 HDFS Write: 114 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 10 msec
OK
7934	MILLER	10
Time taken: 35.612 seconds, Fetched: 1 row(s)

4.3.2 sort by

  • sort by:局部(分区)排序
hive> insert overwrite local directory '/tmp/hive/sort' select * from bigdata.emp sort by emp_no desc;
Query ID = work_20201215181405_80904231-c85b-481b-8ee3-ce14a09bb4d9
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 3
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1608016084001_0006, Tracking URL = http://bigdatatest02:8088/proxy/application_1608016084001_0006/
Kill Command = /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop job  -kill job_1608016084001_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 3
2020-12-15 18:14:24,793 Stage-1 map = 0%,  reduce = 0%
2020-12-15 18:14:33,072 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.26 sec
2020-12-15 18:14:42,335 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 15.21 sec
MapReduce Total cumulative CPU time: 15 seconds 210 msec
Ended Job = job_1608016084001_0006
Moving data to local directory /tmp/hive/sort
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 3   Cumulative CPU: 15.21 sec   HDFS Read: 14162 HDFS Write: 196 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 15 seconds 210 msec
OK
Time taken: 39.026 seconds
  • 查看文件
[work@bigdatatest02 sort]$ cat 000000_0 
7844TURNER30
7839KING10
7788SCOTT20
7782CLARK10
7698BLAKE30
7654MARTIN30
[work@bigdatatest02 sort]$ cat 000001_0 
7934MILLER10
7900JAMES30
7876ADAMS20
7566JONES20
7521WARD30
7499ALLEN30
[work@bigdatatest02 sort]$ cat 000002_0 
7902FORD20
7369SMITH20
  • 看到分区内部是依据emp_no(第一个字段)排序的

4.3.3 distribute by

  • distribute by:不是做排序用的,类似于MapReduce 的分区概念
  • 根据emp_name的长度来进行分区
hive> insert overwrite local directory '/tmp/hive/distribute' select * from bigdata.emp distribute by length(emp_name);
Query ID = work_20201215194310_ac24b005-4385-44b6-b785-b5beb3f67227
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 3
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1608016084001_0007, Tracking URL = http://bigdatatest02:8088/proxy/application_1608016084001_0007/
Kill Command = /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop job  -kill job_1608016084001_0007
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 3
2020-12-15 19:43:26,280 Stage-1 map = 0%,  reduce = 0%
2020-12-15 19:43:34,496 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.61 sec
2020-12-15 19:43:40,687 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 13.91 sec
MapReduce Total cumulative CPU time: 13 seconds 910 msec
Ended Job = job_1608016084001_0007
Moving data to local directory /tmp/hive/distribute
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 3   Cumulative CPU: 13.91 sec   HDFS Read: 14729 HDFS Write: 196 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 13 seconds 910 msec
OK
Time taken: 32.279 seconds
  • 查看输出文件
[work@bigdatatest02 distribute]$ ls
000000_0  000001_0  000002_0
[work@bigdatatest02 distribute]$ cat 000000_0 
7934MILLER10
7844TURNER30
7654MARTIN30
[work@bigdatatest02 distribute]$ cat 000001_0 
7902FORD20
7839KING10
7521WARD30
[work@bigdatatest02 distribute]$ cat 000002_0 
7782CLARK10
7900JAMES30
7876ADAMS20
7788SCOTT20
7698BLAKE30
7566JONES20
7499ALLEN30
7369SMITH20

4.3.4 cluster by

  • cluster by == distribute by + sort by
  • 如果distribute by 和 sort by是同一个字段,则等于cluster by该字段
hive> insert overwrite local directory '/tmp/hive/distributeandsort' select * from bigdata.emp distribute by emp_no sort by emp_no;
Query ID = work_20201215194756_b5359ad0-df8b-4523-9ca7-51b9cbbad376
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 3
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1608016084001_0008, Tracking URL = http://bigdatatest02:8088/proxy/application_1608016084001_0008/
Kill Command = /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop job  -kill job_1608016084001_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 3
2020-12-15 19:48:12,418 Stage-1 map = 0%,  reduce = 0%
2020-12-15 19:48:21,652 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.39 sec
2020-12-15 19:48:29,858 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 12.81 sec
MapReduce Total cumulative CPU time: 12 seconds 810 msec
Ended Job = job_1608016084001_0008
Moving data to local directory /tmp/hive/distributeandsort
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 3   Cumulative CPU: 12.81 sec   HDFS Read: 14280 HDFS Write: 196 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 12 seconds 810 msec
OK
Time taken: 34.157 seconds
hive> insert overwrite local directory '/tmp/hive/cluster' select * from bigdata.emp cluster by emp_no;
Query ID = work_20201215194933_6479ba1e-d9e9-41b4-991d-a9e599e45410
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 3
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1608016084001_0009, Tracking URL = http://bigdatatest02:8088/proxy/application_1608016084001_0009/
Kill Command = /opt/cloudera/parcels/CDH/lib/hadoop/bin/hadoop job  -kill job_1608016084001_0009
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 3
2020-12-15 19:49:53,537 Stage-1 map = 0%,  reduce = 0%
2020-12-15 19:50:03,407 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.16 sec
2020-12-15 19:50:12,660 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 15.76 sec
MapReduce Total cumulative CPU time: 15 seconds 760 msec
Ended Job = job_1608016084001_0009
Moving data to local directory /tmp/hive/cluster
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 3   Cumulative CPU: 15.76 sec   HDFS Read: 14220 HDFS Write: 196 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 15 seconds 760 msec
OK
Time taken: 39.878 seconds
  • 查看两个文件
[work@bigdatatest02 hive]$ cd distributeandsort/
[work@bigdatatest02 distributeandsort]$ ls
000000_0  000001_0  000002_0
[work@bigdatatest02 distributeandsort]$ cat 000000_0 
7521WARD30
7566JONES20
7698BLAKE30
7782CLARK10
7788SCOTT20
7839KING10
7902FORD20
[work@bigdatatest02 distributeandsort]$ cat 000001_0 
7369SMITH20
7654MARTIN30
7876ADAMS20
7900JAMES30
[work@bigdatatest02 distributeandsort]$ cat 000002_0 
7499ALLEN30
7844TURNER30
7934MILLER10
[work@bigdatatest02 distributeandsort]$ cd ../cluster/
[work@bigdatatest02 cluster]$ ls
000000_0  000001_0  000002_0
[work@bigdatatest02 cluster]$ cat 000000_0 
7521WARD30
7566JONES20
7698BLAKE30
7782CLARK10
7788SCOTT20
7839KING10
7902FORD20
[work@bigdatatest02 cluster]$ cat 000001_0 
7369SMITH20
7654MARTIN30
7876ADAMS20
7900JAMES30
[work@bigdatatest02 cluster]$ cat 000002_0 
7499ALLEN30
7844TURNER30
7934MILLER10
  • 发现两个文件一摸一样

 

  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
HiveServer2是Hive的一个服务端,通过它可以通过JDBC连接Hive,并且可以通过一个WEB服务查看当前有哪些Session进行连接,以及执行了哪些语句等详细信息。要开启HiveServer2的WEB服务,需要在hive-site.xml文件中配置相应参数,如hive.server2.webui.host和hive.server2.webui.port。使用$HIVE_HOME/bin/hiveserver2命令可以开启HiveServer2,也可以使用$HIVE_HOME/bin/hive --service hiveserver2命令来开启。要查看hiveserver2命令的帮助信息,可以使用$HIVE_HOME/bin/hiveserver2 -H命令。可以使用--hiveconf参数来直接指定hive-site.xml文件中的参数,比如修改HiveServer2的端口可以使用$HIVE_HOME/bin/hiveserver2 --hiveconf hive.server2.thrift.port=14000命令。使用Beeline工具可以连接HiveServer2,连接命令形式为$HIVE_HOME/bin/beeline -u jdbc:hive2://hostname:port/default -n hadoop,其中hostname是HiveServer2开启服务的机器的主机名,port是HiveServer2服务开启的端口,默认是10000。连接成功后可以执行相应的Hive代码查询数据。<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* *3* [Hive--HiveServer2 命令行代码连接Hive常用命令自定义UDF函数排序](https://blog.csdn.net/qq_41301707/article/details/111238833)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_1"}}] [.reference_item style="max-width: 50%"] - *2* [hive-exec-1.2.1.spark2.jar](https://download.csdn.net/download/u010374412/86745169)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_1"}}] [.reference_item style="max-width: 50%"] [ .reference_list ]

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值