大数据之hive（三） --- 导入导出，排序聚集查询，hive上实现WordCount，Transactions，View，MapJoin，hive调优

最新推荐文章于 2024-05-09 00:15:00 发布
疯狂学习的白菜
最新推荐文章于 2024-05-09 00:15:00 发布
阅读量204
点赞数
分类专栏：大数据 Hive
本文链接：https://blog.csdn.net/xcvbxv01/article/details/82593714
版权
大数据同时被 2 个专栏收录
142 篇文章 3 订阅
订阅专栏
Hive
5 篇文章 0 订阅
订阅专栏
一、export/import 导出表/导入表
---------------------------------------------------
    $hive> EXPORT TABLE p1 TO '/user/ubuntu/data/ex';       //导出到hdfs上，而且目录必须为空
    $hive> IMPORT FROM '/user/ubuntu/data/ex';              //将数据导入到数据库中，以导出时候的表名。所以，如果表存在的话会报错


二、order/sort/distribute/cluster  全排序/map端排序/分区/聚集
---------------------------------------------------------------
    1.ORDER BY (ASC|DESC)  -- 全局的排序，只使用一个reducer
        $hive> SELECT name FROM customer ORDER BY NAME DESC;

    2.SORT BY (ASC|DESC) -- 对某一列进行排序,对map端的输出进行排序，在数据进入reducer之前进行的排序，本地排序。当reduce = 1 的时候，== order by
        $hive> SELECT name FROM customer SORT BY NAME DESC;

    3.DISTRIBUTE BY : -- 将列值相匹配的数据，分配到同一个reduce中。必须存在于Sort By之前。单独使用不会保证排序，查询的字段必须出现在select中
        $hive> SELECT cid FROM orders DISTRIBUTE BY id;            //报错。因为id没有出现在select的查询列中

    4.DISTRIBUTE BY + SORT BY
        $hive> SELECT name, id FROM customer DISTRIBUTE BY id SORT BY name;

    5.CLUSTER BY : = DISTRIBUTE BY + SORT BY : 对同一列进行DISTRIBUTE + SORT
        $hive> SELECT id FROM customer DISTRIBUTE BY id SORT BY id;     //都是对id,就可以直接使用CLUSTER By
        $hive> SELECT id FROM customer CLUSTER BY id;                   //都是对id,就可以直接使用CLUSTER By



三、设置作业参数
-----------------------------------------------------------------
    $hive> set hive.exec.reducers.bytes.per.reducer=xxx     //设置reducetask的字节数。
    $hive> set hive.exec.reducers.max=0                     //设置reduce task的最大任务数
    $hive> set mapreduce.job.reduces=0                      //设置reducetask个数。


四、函数：hive支持tab键快速补全
--------------------------------------------------------------------
    $hive> select concat('100','tom');          //字符串连接
    ...


五、聚合查询
---------------------------------------------------------------------
    //查询表orders, 用cid分组并过滤掉记录数小于等于1的组，显示cid count max_paice
    $hive> select cid, count(*) c ,max(price) from orders group by cid having c>1;


六、hive上实现单词统计WordCount
-----------------------------------------------------------------
    1.创建一张表doc
        $hive> CREATE TABLE doc(line string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ';

    2.加载文档到表中
        $hive> LOAD DATA LOCAL INPATH '/mnt/hgfs/share/wc' OVERWRITE INTO TABLE doc;

       +------------------+--+
       |     doc.line     |
       +------------------+--+
       | hello tom ad     |
       | ads hello tom    |
       | hello ads        |
       | tom ads ads ads  |
       +------------------+--+


    3.使用切割函数 --> 按照空格切割 -- 切割成数组 [" " , " ", " "]
        $hive> select split(line,' ') from doc;

       +----------------------------+--+
       |             c0             |
       +----------------------------+--+
       | ["hello","tom","ad"]       |
       | ["ads","hello","tom"]      |
       | ["hello","ads"]            |
       | ["tom","ads","ads","ads"]  |
       +----------------------------+--+


    4.使用炸裂函数 --> 将数组元素炸开成单条记录
        $hive> select explode (split(line,' ')) from doc;

         +--------+--+
         |  col   |
         +--------+--+
         | hello  |
         | tom    |
         | ad     |
         | ads    |
         | hello  |
         | tom    |
         | hello  |
         | ads    |
         | tom    |
         | ads    |
         | ads    |
         | ads    |
         +--------+--+


    5.实现单词统计
         $hive> select explode (split(line,' ')) as word from doc;

         $hive> select t.word from ((select explode (split(line,' ')) as word from doc)  as t);

         $hive> select t.word,count(*) from ((select explode(split(line,' ')) as word from doc) as t) group by t.word;

         $hive> select t.word,count(*) from ((select explode(split(line,' ')) as word from doc where word = 'tom') as t) group by t.word;

         $hive> select t.word,count(*) c from ((select explode(split(line, ' ')) as word from doc) as t) group by t.word order by c desc limit 2;

         $hive> select t.word,count(*) c from ((select explode (split(line,' ')) as word from doc)  as t) where t.word != 'tom' group by t.word order by c desc limit 2;

        +---------+----+--+
        | t.word  | c  |
        +---------+----+--+
        | ads     | 5  |
        | hello   | 3  |
        +---------+----+--+


    6.实现统计结果的输出
         $hive> create table wcresult as select t.word,count(*) c from ((select explode (split(line,' ')) as word from doc)  as t) where t.word != 'tom' group by t.word order by c desc limit 2;

        +----------------+-------------+--+
        | wcresult.word  | wcresult.c  |
        +----------------+-------------+--+
        | ads            | 5           |
        | hello          | 3           |
        +----------------+-------------+--+


七、Transactions 事务
---------------------------------------------------------------------
    1.条件
        a.hive版本> 0.13
        b.使用的是桶表
        c.只支持orc格式
        d.所有的事务必须是自动提交
        e.参数必须设置好
        f.创建桶表的时候，要添加tblproperties('transactional'='true');

    2.设置hive参数，修改配置文件[hive-site.xml]
        SET hive.support.concurrency = true;                    //
        SET hive.enforce.bucketing = true;                      //强制桶表
        SET hive.exec.dynamic.partition.mode = nonstrict;       //动态分区
        SET hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;  //事务管理器
        SET hive.compactor.initiator.on = true;
        SET hive.compactor.worker.threads = 1;

    3.创建表
        create table tx(id int,name string,age int)
        clustered by (id) into 3 buckets
        row format delimited
        fields terminated by ','
        stored as orc
        tblproperties ('transactional'='true');

    4.使用事务，增删改查
        $hive> insert into tx (id,name,age) values(1,'tom',11);
        $hive> updata table tx name = 'tom111';
        $hive> delete from tx where id = 1;
        $hive> select * from tx;


八、View 视图（虚表）,只存在于逻辑上
---------------------------------------------------------------------
    0.当查询很频繁，而且查询语句很长的时候，就可以使用视图，将一行查询语句的结果存储成一个虚构的表，以后再想查看同样的查询结果，就可以直解查询视图

    1.创建视图
        a.语法
        CREATE VIEW [IF NOT EXISTS] view_name [(column_name [COMMENT column_comment], ...) ]
        [COMMENT table_comment]
        AS SELECT ...

        b.命令 将左外连接查询的结果存储成一个虚表，以后可以直接select * from v1 达到查询左外的目的
        $hive> create view v1 as select a.id aid,a.name,b.id bid, b.orderno from customer a left outer join orders b on a.id = b.cid;
        $hive> select * from  v1;

    2.删除视图
        DROP VIEW [IF EXISTS] view_name

    3.修改视图
        ALTER VIEW view_name


九、特殊连接 MapJoin
-------------------------------------------------------------
    1.连接只发生在map端，没有reduce.将小表加载到内存，与大表进行连接查询

    2.$hive> hive.auto.convert.join = true   //自动开始mapjoin,默认是开启的

    3. 连接暗示 /*+mapjoin(customer)*/ ，实现mapjoin  -- customer为小表，要加载到内存中
        $hve> select /*+mapjoin(customer)*/ a.id aid,a.name,b.id bid, b.orderno from customer a left outer join orders b on a.id = b.cid;


十、hive调优
-------------------------------------------------------------------
    1.explain：使用explain查看查询计划 -- 详细的流程用于分析语句
        a.$hive> explain select count(*) from customer;

            +------------------------------------------------------------------------------------------------------+--+
            |                                               Explain                                                |
            +------------------------------------------------------------------------------------------------------+--+
            | STAGE DEPENDENCIES:                                                                                  |
            |   Stage-1 is a root stage                                                                            |
            |   Stage-0 depends on stages: Stage-1                                                                 |
            |                                                                                                      |
            | STAGE PLANS:                                                                                         |
            |   Stage: Stage-1                                                                                     |
            |     Map Reduce                                                                                       |
            |       Map Operator Tree:                                                                             |
            |           TableScan                                                                                  |
            |             alias: customer                                                                          |
            |             Statistics: Num rows: 1 Data size: 88 Basic stats: COMPLETE Column stats: COMPLETE       |
            |             Select Operator                                                                          |
            |               Statistics: Num rows: 1 Data size: 88 Basic stats: COMPLETE Column stats: COMPLETE     |
            |               Group By Operator                                                                      |
            |                 aggregations: count()                                                                |
            |                 mode: hash                                                                           |
            |                 outputColumnNames: _col0                                                             |
            |                 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE    |
            |                 Reduce Output Operator                                                               |
            |                   sort order:                                                                        |
            |                   Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE  |
            |                   value expressions: _col0 (type: bigint)                                            |
            |       Reduce Operator Tree:                                                                          |
            |         Group By Operator                                                                            |
            |           aggregations: count(VALUE._col0)                                                           |
            |           mode: mergepartial                                                                         |
            |           outputColumnNames: _col0                                                                   |
            |           Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE          |
            |           File Output Operator                                                                       |
            |             compressed: false                                                                        |
            |             Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE        |
            |             table:                                                                                   |
            |                 input format: org.apache.hadoop.mapred.SequenceFileInputFormat                       |
            |                 output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat             |
            |                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe                            |
            |                                                                                                      |
            |   Stage: Stage-0                                                                                     |
            |     Fetch Operator                                                                                   |
            |       limit: -1                                                                                      |
            |       Processor Tree:                                                                                |
            |         ListSink                                                                                     |
            |                                                                                                      |
            +------------------------------------------------------------------------------------------------------+--+


        b.explain extened 查看更加详尽的信息


    2.Limit Tuning：limit优化
        a.避免limit的时候先进行全表范围的查询然后再输出limit之后的信息，可以使用 Limit Tuning，一种采用采样方式进行优化的limit手段

        b.通过配置属性开启 $hive> hive.limit.optimize.enable = true
            <property>
                <name>hive.limit.optimize.enable</name>
                <value>true</value>
                <description>Whether to enable to optimization to
                    try a smaller subset of data for simple LIMIT first.
                </description>
            </property>

        c.采样处理
            <property>
                <name>hive.limit.row.max.size</name>
                <value>100000</value>
                <description>When trying a smaller subset of data for simple LIMIT,
                how much size we need to guarantee each row to have at least.
                </description>
            </property>

            <property>
                <name>hive.limit.optimize.limit.file</name>
                <value>10</value>
                <description>When trying a smaller subset of data for simple LIMIT,
                maximum number of files we can sample.</description>
            </property>


    3.启用local模式（开发，调试，测试时使用）
        a.针对小作业量的任务，仅在本机上开启MR作业运行。

        b.$hive> set mapred.job.tracker=local;

        c.也可以让hive自动执行此策略,亲测好用 $hive> set hive.exec.mode.local.auto = true;
            [hive-site.xml]
            <property>
                <name>hive.exec.mode.local.auto</name>
                <value>true</value>
                <description>
                Let hive determine whether to run in local mode automatically
                </description>
            </property>


    4.并行执行
        a.hive默认执行MR任务是串行的，先...然后...，然而一些工作之间是没有依赖的，这样的工作就可以并发的执行了

        b.开启并行:同时执行没有依赖关系的阶段
             $hive> set hive.exec.parallel = true;

    5.严格模式：主要为了避免大数据大量的查询而设置的
        a.严格模式是Hive中的一种设置，可以防止用户发出可能产生意外和不良影响的查询。

        b.$hive>  set hive.mapred.mode = strict;

        c.注意，在严格模式下
            1) 分区表必须指定分区，否则不让查询(disable : select * from t1) (enable: select * from t1 where id = 3 ，其中id为分区值)
            2) order by 的时候必须使用limit子句进行限定，否则不然查询(disable : select * from t1 odrer by id) (enable: select * from t1 odrer by id limit 2)
            3) 笛卡尔积查询的时候，也是不允许的。(disable : select a.* , b.* from customer a , orders b ) (enable : select a.* , b.* from customer a , orders b where a.id = b.cid)

    6.设置MR的数量
        a.设置单个reduce处理的字节数，默认是1G
            $hive> set hive.exec.reducers.bytes.per.reducer = 750000000;


    7.JVM重用
        a.使jvm可以重复使用，但是缺点是对于一个不平衡的集群作业，很容易导致资源的闲置（为了保证重用数量，而不能回收，直到作业完成）

        b.优选场景：大量小文件，很多小mr作业task

        c.Hadoop’s mapred-site.xml (in $HADOOP_HOME/conf):
          <property>
              <name>mapreduce.job.jvm.numtasks</name>
              <value>10</value>
              <description>
                How many tasks to run per jvm. If set to -1, there is no limit.JVM Reuse
              </description>
          </property>

        d.$hive> set mapreduce.job.jvm.numtasks = 10;   //默认是1，-1表示没有限制