Hive学习笔记（一）-- Hive简介及基本概念

最新推荐文章于 2024-09-05 12:30:40 发布

Mr_Wuuuuuuu

最新推荐文章于 2024-09-05 12:30:40 发布

阅读量392

点赞数 2

分类专栏： Hive 文章标签： Hive

本文链接：https://blog.csdn.net/wwyzxb/article/details/87346691

版权

Hive 专栏收录该内容

7 篇文章 2 订阅

订阅专栏

文章目录

一、Hive简介
二、Hive的基本概念

一、Hive简介

1.1 Hive是什么

hive是一个构建在Hadoop之上的数据仓库
和传统的数据仓库一样的点
- 主要用来访问和管理数据（作为数据仓库，供存放各种上报的数据）
- 同样提供了类sql查询语言（会将sql解析成相应的MR任务）
和传统的数据仓库不一样的点
- 可以处理超大规模的数据
- 可以扩展并且容错性非常强（使用基于yarn来执行任务，可以水平增加机器提高运算能力）

1.2 Hive可以做什么

传统的数据仓库任务
- ETL
- 报表生成
- Ad-hoc(点对点)数据分析
大规模数据分析
- 批处理程序

1.3 Hive典型的应用场景

日志分析
- 统计一个网站一个时间段内的pv/uv
- 多维度数据分析
- 大部分互联网公司使用Hive进行日志分析
其它场景
- 海量结构化数据离线分析
- 低成本进行数据分析（不直接编写MR）
在我们项目组中的应用
- 作为数据仓库使用，业务方触发ROM或各应用中的埋点从而进行数据上报，数据会通过Flume进行预处理之后进入Hive（会根据日期ds和应用的appid进行数据分区），一般作为ETL的原始表和中间表来使用，这些数据可以通过Presto来进行即席查询

1.4 Hive不能做什么

Hive不是一个OLTP系统
- 响应时间慢（可以使用presto读取hive的数据，提高响应）
- 无法实时更新数据
Hive的表达能力有限
- 不支持迭代式计算
- 有些复杂运算用sql不易表达

二、Hive的基本概念

2.1 Hive的数据模型

Databases
- 和关系型数据库中的数据库一样（对应hdfs上warehouse下的一个目录）
Tables
- 和关系型数据库中的表一样（对应hdfs上warehouse下的一个目录）
Partitions（可选）
- 一些特殊的列，用于优化数据的存储和查询
- 为减少不必要的暴力数据扫描，可以对表进行分区
- 为避免产生过多小文件，建议只对离散的字段进行分区

分区示例如下所示：分区字段示例
分区会以目录的形式存在：
在这里插入图片描述

注：
1.会根据分区建立相应的目录结构如下所示，在查询的时候可以指定分区从而过滤掉非指定目录；
2.一般不易将分区建设置的过多，因为这样会产生很多小文件，一般只建议对离散的数据进行分区。

Buckets（可选）
- 一种特殊的分区数据组织方式
- 对于值较多的字段，可以将其分成若干个bucket，可以结合clustered by与bucket使用

注：
当user_id非常多的时候，我们不能使用user_id进行分区，但是可以分桶，将user_id hash成96个桶，这样我们可以把需要扫描的数据减少到1/96。

在这里插入图片描述

注：
这样插数据的时候，hive会自动将user_id的数据放入相应的桶下

Files
实际数据的物理存储单元，有多种存储格式
- 由用户自定义
  - 默认是文本文件（textfile）
  - 文本文件，用户需要显示指定分隔符
```
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE
```
- 其它已支持的格式
  - SequenceFile
  - Avro
  - RC/ORC/Parquet
  - 用户自定义（InputFormat与OutputFormat）
- 支持数据压缩
  - Bzip、Gzip
  - LZO
  - Snappy

2.2 数据类型

基本数据类型
集合数据类型

示例：

CREATE TABLE employees (
    name STRING,
    salary FLOAT,
    subordinates ARRAY <STRING>,
    deductions MAP <STRING, FLOAT>,
    address STRUCT <street:STRING, city:STRING, state:STRING, zip:INT > 
) PARTITIONED BY ( country STRING, state STRING );

-- 查询array
select subordinates[0] from employees;
-- 查询map
select deductions['Federal Taxes'] from employees;
-- 查询结构体struct
select address.state from employees;

2.3 客户端与命令

Hive CLI
Hive Beeline（推荐使用）

beeline的命令参数如下所示：

Usage: java org.apache.hive.cli.beeline.BeeLine 
   -u <database url>               the JDBC URL to connect to
   -n <username>                   the username to connect as
   -p <password>                   the password to connect as
   -d <driver class>               the driver class to use
   -i <init file>                  script file for initialization
   -e <query>                      query that should be executed
   -f <exec file>                  script file that should be executed
   -w (or) --password-file <password file>  the password file to read password from
   --hiveconf property=value       Use value for given property
   --hivevar name=value            hive variable name and value
                                   This is Hive specific settings in which variables
                                   can be set at session level and referenced in Hive
                                   commands or queries.
   --color=[true/false]            control whether color is used for display
   --showHeader=[true/false]       show column names in query results
   --headerInterval=ROWS;          the interval between which heades are displayed
   --fastConnect=[true/false]      skip building table/column list for tab-completion
   --autoCommit=[true/false]       enable/disable automatic transaction commit
   --verbose=[true/false]          show verbose error messages and debug info
   --showWarnings=[true/false]     display connection warnings
   --showNestedErrs=[true/false]   display nested errors
   --numberFormat=[pattern]        format numbers using DecimalFormat pattern
   --force=[true/false]            continue running script even after errors
   --maxWidth=MAXWIDTH             the maximum width of the terminal
   --maxColumnWidth=MAXCOLWIDTH    the maximum width to use when displaying columns
   --silent=[true/false]           be more silent
   --autosave=[true/false]         automatically save preferences
   --outputformat=[table/vertical/csv2/tsv2/dsv/csv/tsv]  format mode for result display
                                   Note that csv, and tsv are deprecated - use csv2, tsv2 instead
  --truncateTable=[true/false]    truncate table column when it exceeds length
   --delimiterForDSV=DELIMITER     specify the delimiter for delimiter-separated values output format (default: |)
   --isolation=LEVEL               set the transaction isolation level
   --nullemptystring=[true/false]  set to true to get historic behavior of printing null as empty string
   --addlocaldriverjar=DRIVERJARNAME Add driver jar file in the beeline client side
   --addlocaldrivername=DRIVERNAME Add drvier name needs to be supported in the beeline client side
   --help                          display this message
Beeline version 1.2.1.spark2 by Apache Hive

我们环境中使用到的beeline的命令为：

/usr/lib/hive-current/bin/beeline -n hadoop -u "jdbc:hive2://emr-header-1:2181,emr-header-2:2181,emr-header-3:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2"

beeline详解请参见：

[1]:Hive官方使用手册——新Hive CLI(Beeline CLI)