大数据开发教程——Apache Hive进阶

比屋大数据

已于 2022-06-30 09:54:40 修改

阅读量151

点赞数

分类专栏：大数据架构师源码零基础教程文章标签： hive big data apache

于 2022-06-13 10:41:32 首次发布

本文链接：https://blog.csdn.net/qq_42285599/article/details/125255336

版权

大数据架构师源码零基础教程专栏收录该内容

16 篇文章 4 订阅

订阅专栏

想要视频学习资料和软件安装包的，戳⬇⬇⬇
免费领取500节大数据开发课程

Hive 元数据管理

To support features like schema(s) and data partitioning Hive keeps its metadata in a Relational Database (为了支持schema和数据分区等功能，Hive将元数据保存在关系型数据库中)
By default, Hive is Packaged with Derby (默认情况下，Hive与Derby打包在一起)
- Default Derby based is good for evaluation an testing (基于缺省Derby的非常适合测试)
- Schema is not shared between users as each user has their own instance of embedded Derby (用户之间不共享架构，因为每个用户都有自己的嵌入式Derby实例)
- Stored in .metastore_db directory which resides in the directory that hive was started from (存储在.metastore_db目录中，该目录位于启动配置单元的目录中)
Can easily switch another SQL installation such as MySQL, Oracle (可以轻松切换另一个SQL安装，如MySQL，Oracle)
HCatalog as part of Hive exposures Hive metadata to other ecosystem (作为Hive的一部分的HCatalog将Hive元数据暴露给其他生态系统)
Hive3.0以上的版本，元数据是默认保存在Hbase里面，解决了HA的问题

Hive 体系架构

Note：解释一下经常遇到的 hiveServer1、hiveServer2 ? 早期版本的 hiveServer(即 hiveServer1)因使用Thrift接口的限制，不能处理多于一个客户端的并发请求，在hive-0.11.0版本中重写了hiveServer代码(hiveServer2)，支持了多客户端的并发和认证，并且为开放API客户端如JDBC、ODBC提供了更好的支持。

用户接口主要有三个：CLI(command line interface)命令行，JDBC 和 Web UI, CLI是开发过程中常用的接口，在hive Server2提供新的命令beeline，使用sqlline语法，会有单独的章节来介绍。
metaStore: hive的元数据结构描述信息库，可选用不同的关系型数据库来存储，通过配置文件修改、查看数据库配置信息。
Driver: 解释器、编译器、优化器完成HQL查询语句从词法分析、语法分析、编译、优化以及查询计划的生成。生成的查询计划存储在HDFS中，并在随后由MapReduce调用执行。
Hive的数据存储在HDFS中，大部分的查询、计算由MapReduce完成
- select * from emp; -> 操作不会执行mapreduce
- select count(*) emp; ->执行mapreduce操作

Hive Interface – CLi和Beeline模式的区别

有两种工具：Beeline和命令行（CLI）
有两种模式：命令行模式和交互模式

在这里插入图片描述

Purpose HiveServer2 Beeline HiveServer1 CLI

熟悉HDP-Hive环境

--进入Hive cli
hive -e ：执行指定的SQL语句
hive -f ：执行指定的SQL脚本

hive -e "show databases”
echo "show databases" > demo.sql && hive -f demo.sql && rm -f demo.sql

--进入hive beeline
hive --service hiveserver2 开启服务
beeline -u jdbc:hive2://hadoop5:10000/db10 -n root -- 使用beeline连接hive

Hive Interface – 其他使用环境

Hive Web Interface (As part of Apache Hive)
Hue (Cloudera)
Ambari Hive View (Hortonworks)
Zeppelin – Hive
JDBC/ODBC （ETL 工具，商业智能工具，集成开发环境）

Data Type – Primitive Type 原始类型

在这里插入图片描述

Note：紫色标注的为重点。

Data Type – 复杂数据类型

ARRAY has same type for all the elements – equal to MAP uses sequence from 0 as keys
MAP has same type of key value pairs
STRUCT likes table/records

在这里插入图片描述

Hive Meta Data Structure (元数据总览)

在这里插入图片描述

Hive Database

The database is a collection of tables that are used for a similar purpose or belong to the same group (数据库是用于类似目的或属于同一组的表的集合)
If the database is not specified (use database_name), the default database is used by default (如果未指定数据库（使用database_name），则默认使用默认数据库default)
Hive creates a directory for each database at /user/hive/warehouse, which can be defined through hive.metastore.warehouse.dir property except default database (默认数据库表直接建立在该目录下）

hive-site.xml
<property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/data/wapage/hive/warehouse</value>
    <description>location of default database for the warehouse</description>
</property>
数据库的语法操作：
create database if not exists myhivebook;
use myhivebook;
show databases;
describe database default; --more details than ’show’, such as location
alter database myhivebook set owner user zs;
--级联删除，如果数据库下面有表的话，也可以删除
drop database if exists myhivebook cascade;

Hive Tables

External Tables
- Data is kept in the HDFS path specified by LOCATION keywords. The data is not managed by Hive fully since DROP the table (metadata) will not delete the data (数据保存在LOCATION关键字指定的HDFS路径中。由于DROP表（元数据）不会删除数据，因此Hive不会完全管理数据)
Internal Tables/Managed Table
- Data is kept in the default path, such as /user/hive/warehouse/employee. The data is fully managed by Hive since DROP the table (metadata) will ALSO DELETE the data (数据保存在默认路径中，例如/user/hive/warehouse/employee。 (数据完全由Hive管理，因为DROP表（元数据）将删除数据)
最大的区别：删除表的时候会不会删除数据。

对Hive表的三联问！！！

What is internal and external tables? 90%
What is key difference between them? 80%
What is best practice to use them? 20% (最佳实践)
- 用来处理原始数据和客户给出的数据(不能修改数据)，使用外部表
- 需要进行共享数据的时候会使用外部表
- 对数据清洗和转换的时候会使用内部表

Hive 建表基础语句

CREATE EXTERNAL TABLE IF NOT EXISTS employee_external (
    name string,
    work_place ARRAY<string>,
    sex_age STRUCT<sex:string,age:int>,
    skills_score MAP<string,int>,
    depart_title MAP<STRING,ARRAY<STRING>>
)

COMMENT 'This is an external table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' 
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':'
STORED AS TEXTFILE
LOCATION '/user/data/employee_data/';

--查看表结构
desc table
desc formatted table
show create table

Delimiters(分隔符)

The default delimiters in Hive are as follows: (在Hive中默认的分隔符如下)

Field delimiter: Can be used with Ctrl + A or ^A (Use \001 when creating the table)(字段默认分隔符)
Collection item delimiter: Can be used with Ctrl + B or ^B (\002)(collection默认分隔符)
Map key delimiter: Can be used with Ctrl + C or ^C (\003)(map默认分隔符)

建表和数据类型实践

Create database called hive_demo
Create an external table called employee on the data file on HDFS path above using schema in previous slide
Query the fields with different types
查找表命令练习

show tables;
show tables '*sam*'; show tables '*sam|lily*' ;
show table extended like 'o*';
desc [formatted|extended] table_name
show create table table_name;
select work_place,work_place[1] from employee_external;
select sex_age.age from employee_external;
select name, skills_score['DB'] from employee_external;
select ame, depart_title['Product'][0] from employee_external;
复杂数据类型操作作为了解就可以了。

视频课程戳⬇⬇⬇

领取更多大数据开发学习教程

比屋大数据

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
大数据开发教程——Apache Hive进阶

想要视频学习资料和软件安装包的，戳⬇⬇⬇免费领取500节大数据开发课程To support features like schema(s) and data partitioning Hive keeps its metadata in a Relational Database (为了支持schema和数据分区等功能，Hive将元数据保存在关系型数据库中)By default, Hive is Packaged with Derby (默认情况下，Hive与Derby打包在一起)Can easily
复制链接

扫一扫