大数据之Hive（笔记二）

最新推荐文章于 2021-10-24 09:59:19 发布

张章章Sam

最新推荐文章于 2021-10-24 09:59:19 发布

阅读量4.2k

点赞数

文章标签：大数据 hive 数据 class

本文链接：https://blog.csdn.net/qq_16103331/article/details/53011591

版权

表的分类
内部表(受控表)
表中数据的生命周期受到的表定义的影响，也就是说当表的定义被删除的时候，表中数据随之被删除。
内部表的字段表示：table_type: managed_table
可以通过desc extended tblName;来查看表的详细信息
外部表
表中数据的生命周期不受的表定义的影响，也就是说当表的定义被删除的时候，表中的数据依然存在。
外部表的字段表示：table_type: external_table
create external table t_external(id int);
drop table t_external;的时候发现，在hdfs上的数据依然存在==>
可以提高数据的安全性
create external table t_external_1(id int);
给外部表添加数据，基于引用的方式：
alter table t_external_1 set location “/data/hive-external.txt”;
在创建表的时候就来指定外部表的数据：
create external table t_external_2(id int) location “/data/”;
需要说明的是：在创建表的时候指定的数据路径，必须是一个目录。
外部表的好处：
可以提高数据的安全性
可以共享数据
外加两种功能表
分区表
/user/hive/warehouse/t
access.log.2016-10-30
access.log.2016-10-31
access.log.2016-11-01
access.log.2016-11-02
==>我们要向查询出access.log.2016-11-01，需要将这t目录下面的所有的数据加载到内存中，
然后通过where条件过滤出2016-11-01的数据，这样是非常消耗资源的。怎么来解决问题？
==>在/user/hive/warehouse/t目录下面，每天建立一个文件夹，同时把当天的access.log放到
这个文件夹下面
/user/hive/warehouse/t
/dt=2016-10-30
access.log.2016-10-30
/dt=2016-10-31
access.log.2016-10-31
/dt=2016-11-01
access.log.2016-11-01
/dt=2016-11-02
access.log.2016-11-02
create table t_partition(id int comment “学生ID”, name string comment “学生姓名”)
partitioned by(dt date comment “按照日期分区: yyyy-MM-dd”)
row format delimited
fields terminated by ‘\t’;
向分区表中加载数据
load data local inpath ‘/opt/data/hive-data/hive-partition.txt’ into table
t_partition partition(dt=”2016-10-31”);//他会自动的创建分区
查看某一个分区的数据
select * from t_partition where dt=”2016-10-31”;
将分区字段当做表的一个普通列来进行操作就可以了。
分区的DDL
查看分区：
show partitions t_partition;
手动添加分区：
alter table t_partition add partition(dt=”2016-11-03”);
手动删除分区：
alter table t_partition drop partition(dt=”2016-11-03”);
(删除分区，该分区的数据也将会被删除)
多个分区
*
create table t_partition_1(
id int comment “学生ID”,
name string comment “学生姓名”)
partitioned by(year int comment “入学年份”, class string comment “所学课程”)
row format delimited
fields terminated by ‘\t’;
加载数据：：
load data local inpath ‘/opt/data/hive-data/hive-partition.txt’ into table t_partition_1 partition(year=2016, class=”linux”);
查询数据：
select * from t_partition_1 where year=2015;
select * from t_partition_1 where class=”bigdata” and year=2016;
桶表
桶表是对数据进行哈希取值，然后放到不同文件中存储。
create table t_bucket(id int) clustered by(id) into 3 buckets;
insert into table t_bucket select * from t_external_2 limit 5;
这样可以将数据分桶，新的版本中已经对去掉了hive.enforce.bucketing的参数，模式是开启分桶的。

    hive的删除操作是insert 实现的
        insert overwrite table A select id,name from A where id !=2;

==============================================================================================社团（）
本地模式
我们可以通过设置 hive.exec.mode.local.auto=true;开启本地是，会让MR在本地执行，不会到集群里面
去执行，这样可以提高执行的速率。
通常在做测试的时候，来使用本地模式，提高开发和调试的效率。
需要注意：桶表不支持本地模式！

看到我们使用set是对环境变量的临时性设置。只在当前会话有效，离开会话随机失效。

数据的加载和导出
[]==>可选，<> ==>必须
加载
load
load data [local] inpath ‘path’ [overwrite] into table [partition_psc];
local：
有==>从linux本地加载数据
无==>从hdfs加载数据
overwrite
有==>覆盖掉表中原来的数据
无==>在原来的基础上追加新的数据
从其他表加载
insert

group by w.word;

    row_number：分组排序
        查询要求：见hive-demo
            按照部门查询出相应的员工信息，并且按照salary排序，从高到低。
                select
                 e.id, e.name, if(e.sex == 0, '女', '男') gender,d.name,
                 s.salary, row_number() over(partition by e.deptid order by s.salary desc) rank
                from t_dept d
                left join t_employee e on d.id = e.deptid
                left join t_salary s on e.id = s.empid
                where s.salary is not null;
            在上述基础之上，求出每个部门salary最高的两位
                只需要在上述的基础之上，将上述产生的结果作为中间表，过滤求出前两名
                select
                   temp.*
                from
                (select
                 e.id, e.name, if(e.sex == 0, '女', '男') gender,d.name,
                 s.salary, row_number() over(partition by e.deptid order by s.salary desc) rank
                from t_dept d
                left join t_employee e on d.id = e.deptid
                left join t_salary s on e.id = s.empid
                where s.salary is not null) temp
                where temp.rank < 3;

其中员工信息包含:emp-id,name,salary,sex(需要使用男或女)，部门名称

    collect_set
    3°、行转列
        第一步：做表关联，分析结果
            select u.id, u.name, a.address from user u join address a on u.id = a.uid;
        第二步：对出现的多个结果按照id来拼接字符串
            select u.id, max(u.name), concat_ws("," collect_set(a.address)) as addr 
                from t11_user u join address a
                on u.id = a.id group by u.id;
    4°、列转行
        准备数据
        create table t_user_addr as
            select u.id, max(u.name), concat_ws("," collect_set(a.address)) as addr 
                    from t11_user u join address a
                    on u.id = a.id group by u.id;           
        就使用explode行数就可以了
        select explode(split(addr, ",") from t_user_addr;
        查看多个字段

select id, name, address from user_addr lateral view explode(split(addr, “,”)) a as address;

自定义的函数
    1°、编写自定义的函数类，同时需要继承Hive里面的自定义函数(UDF)
    2°、编写evaluate()完成具体的业务逻辑
    3°、打包好(demo.jar)放到hive服务器上
    4°、在hive客户端里面使用add jar /path/demo.jar将我们写好的demo.jar加载到hive的classpath(临时)
    5°、在hive客户端创建一个临时的函数
        create tempapory function 'function_name' as '自定义类的全类名';
    6°、drop tempapory function 'function_name';
    自定义函数来统计每个人生日所对应的星座和生肖

    pom.xml
        <properties>
           <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
           <hive-api.version>2.1.0</hive-api.version>
           <hadoop-api.version>2.6.4</hadoop-api.version>
           <hadoop-core.version>1.2.1</hadoop-core.version>
        </properties>

      <dependencies>
        <dependency>
          <groupId>junit</groupId>
          <artifactId>junit</artifactId>
          <version>4.12</version>
          <scope>test</scope>
        </dependency>
        <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-common</artifactId>
          <version>${hadoop-api.version}</version>
        </dependency>
        <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-mapreduce-client-core</artifactId>
          <version>${hadoop-api.version}</version>
        </dependency>
        <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-core</artifactId>
          <version>${hadoop-core.version}</version>
        </dependency>
        <dependency>
          <groupId>org.apache.hive</groupId>
          <artifactId>hive-exec</artifactId>
          <version>${hive-api.version}</version>
        </dependency>
        <dependency>
          <groupId>org.apache.hive</groupId>
          <artifactId>hive-serde</artifactId>
          <version>${hive-api.version}</version>
        </dependency>
        <dependency>
          <groupId>org.apache.hive</groupId>
          <artifactId>hive-service</artifactId>
          <version>${hive-api.version}</version>
        </dependency>
        <dependency>
          <groupId>org.apache.hive</groupId>
          <artifactId>hive-metastore</artifactId>
          <version>${hive-api.version}</version>
        </dependency>
        <dependency>
          <groupId>org.apache.hive</groupId>
          <artifactId>hive-common</artifactId>
          <version>${hive-api.version}</version>
        </dependency>
        <dependency>
          <groupId>org.apache.hive</groupId>
          <artifactId>hive-cli</artifactId>
          <version>${hive-api.version}</version>
        </dependency>
        <dependency>
          <groupId>org.apache.hive</groupId>
          <artifactId>hive-jdbc</artifactId>
          <version>${hive-api.version}</version>
        </dependency>
        <dependency>
          <groupId>org.apache.thrift</groupId>
          <artifactId>libfb303</artifactId>
          <version>0.9.0</version>
        </dependency>

Hive中文件的存储类型
hive中文件的存储类型，我们可以通过hive-site.xml配置文件的配置项（hive.default.fileformat）来获取

hive.default.fileformat
TextFile

Expects one of [textfile, sequencefile, rcfile, orc].
Default file format for CREATE TABLE statement. Users can explicitly override it by CREATE TABLE ... STORED AS [FORMAT]

1、TextFile
Hive默认格式，数据不做压缩，磁盘开销大，数据解析开销大。可结合Gzip、Bzip2、Snappy等使用（系统自动检查，执行查询时自动解压），但使用这种方式，hive不会对数据进行切分，
从而无法对数据进行并行操作。
在执行之前开启压缩输出
set hive.exec.compress.output=true;
set mapred.output.compress=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;/snappy/lz4/biz/default
set io.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
create table t_text(
line string
) stored as textfile;
2、SequenceFile
SequenceFile是Hadoop API提供的一种二进制文件支持，其具有使用方便、可分割、可压缩的特点。
SequenceFile支持三种压缩选择：NONE，RECORD，BLOCK。Record压缩率低，一般建议使用BLOCK压缩。

    set mapred.output.compression.type=BLOCK;(在上述的基础上来做的)
    create table t_sequence(
        line string
    ) stored as sequencefile;
    insert overwrite table t_sequence select * from t_text;
    发现数据被大幅度的压缩
3、RCFile
    RCFILE是一种行列存储相结合的存储方式。首先，其将数据按行分块，保证同一个record在一个块上，避免读一个记录需要读取多个block。其次，块数据列式存储，有利于数据压缩和快速的列存取。
    create table t_rcfile(
        line string
    ) stored as rcfile;
    insert overwrite table t_rcfile select * from t_text;

总结：
    textfile 存储空间消耗比较大，并且压缩的text 无法分割和合并 查询的效率最低,可以直接存储，加载数据的速度最高
    sequencefile 存储空间消耗大,压缩的文件可以分割和合并 查询效率高，需要通过text文件转化来加载
    rcfile 存储空间最小，查询的效率最高 ，需要通过text文件转化来加载，加载的速度最低
    相比TEXTFILE和SEQUENCEFILE，RCFILE由于列式存储方式，数据加载时性能消耗较大，但是具有较好的压缩比和查询响应。数据仓库的特点是一次写入、多次读取，因此，整体来看，RCFILE相比其余两种格式具有较明显的优势。

在上述的过程中遇到一个问题：
    SequenceFile doesn't work with GzipCodesc without native-hadoop code!
    原因：
        这是因为默认的hadoop安装包是不支持压缩的，需要我们编译过之后才可以，
        当然我们也可以使用命令hadoop checknative -a检测一下可以使用的压缩依赖lib库,结果发现，都不支持，令人非常沮丧！！！
    解决：
        我们就可以将编译过的hadoop的lib($HADOOP_HOME/lib/native)包覆盖掉原来的（最好把之前的备份），
        再来执行就可以了。   

select count(1) from t_txt;
                104.5 M         36.189 
select count(1) from t_seq;
                582.6 K         34.92
select count(1) from t_rc;
                527.3 K         35.659 
select count(1) from t_orc;
                4.5 K           36.562
==>1000005

张章章Sam

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
大数据之Hive（笔记二）

表的分类内部表(受控表) 表中数据的生命周期受到的表定义的影响，也就是说当表的定义被删除的时候，表中数据随之被删除。内部表的字段表示：table_type: managed_table 可以通过desc extended tblName;来查看表的详细信息外部表表中数据的生命周期不受的表定义的影响，也
复制链接

扫一扫