day15Hive

最新推荐文章于 2024-11-08 20:05:09 发布

lhh123lhh123

最新推荐文章于 2024-11-08 20:05:09 发布

阅读量2k

点赞数

文章标签：数据仓库 hive big data

本文链接：https://blog.csdn.net/lhhaini/article/details/122679582

版权

一.Hive

数据仓库（Data Warehouse）：存储各种数据源，分析数据。存储需要采集工具
数仓的特征：主题性（ETL：数据抽取）；集成性：数据源多，抽取清洗转换；稳定性：历史数据周期内不允许修改；时变性：定期更新（月，季度，年）；
数据库：联机分析处理（OLAP）
数据仓库的分层：源数据—>数据仓库—>数据应用

源数据层：（ODS）数据比较乱
数据仓库层：（DW）数据不会被修改，一致的准确的干净的数据；对源数据进行了清洗后的数据。
数据应用层：（DA）部门数据或主题数据
ETL：Extract transform load（抽取，转换，加载）
Hive：基于hadoop的框架，将结构化数据文件映射为一张数据库表，并提供sql查询功能。本质是MapReduce，
hive的特点：不适用实时计算的场景，适用于离线分析。支持sparkz分布式计算引擎
hive架构：

hive适合做批量数据统计
hive按装
9.1内嵌：使用的是内嵌的derby数据库来存储元数据，也不需要额外起metastore服务。服务都嵌入在主HiveServer进程中，但是一次只能一个客户端连接。主要用于实验。
9.2本地模式：通过数据库存储元数据
9.3远程模式：需要单独起metastore服务，每个客户端都在配置文件里配置连接到metastore服务，远程模式的metastore和hive运行在不同的进程里。生产环境建议用远程模式来配置hive metastore。
hive安装：提前安装mysql

vim hive-site.xml

在lib目录上传mysql-connector-java-5.1.41-bin.jar的jar包然后执行

cp /export/server/hive-2.1.0/jdbc/hive-jdbc-2.1.0-standalone.jar /export/server/hive-2.1.0/lib/

配置hive环境变量：

export HIVE_HOME=/export/server/hive-2.1.0
export PATH=:$HIVE_HOME/bin:$PATH

初始化元数据：

cd /export/server/hive-2.1.0/

bin/schematool -dbType mysql -initSchema

hive的使用
11.1 bin/hive或hive进入hive，quit退出

11.2hive -e “sql命令”或hive -f 脚本（脚本中写sql）
11.3Beeline Client（第二代客户端）：
先修改hadoop的配置文件vim core-site.xml

<property>
    <name>hadoop.proxyuser.root.hosts</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.root.groups</name>
    <value>*</value>
</property>

分发：

scp core-site.xml node2:$PWD

启动meta store

nohup /export/server/hive-2.1.0/bin/hive --service metastore &
nohup /export/server/hive-2.1.0/bin/hive --service hiveserver2 &

在这里插入图片描述

beeline
 !connect jdbc:hive2://node3:10000

在这里插入图片描述

hive报错：
查看mysql是否启动能否登录。
查看hadoop是否启动。
查看metastore和hiveserver2是否启动
hive的一键进入：
写expect脚本
安装expect
yum -y install expect
写入：

#!/bin/expect
spawn beeline 
set timeout 5
expect "beeline>"
send "!connect jdbc:hive2://node3:10000\r"
expect "Enter username for jdbc:hive2://node3:10000:"
send "root\r"
expect "Enter password for jdbc:hive2://node3:10000:"
send "123456\r"
interact

mysql一键启动
在这里插入图片描述
chmod 777 beenline.exp
chmod 777 mysql.exp
expect mysql.exp

hive 的使用
14.1创建数据库：create database if not exists myhive;

查数据库的详细信息：desc database 库名；

删除数据库：drop database 库名；（cascade强删）
14.2创建数据库表：
表类型：内部表：create后不加external，私有的独立拥有源数据，如果删除内部表，源数据跟着删除。一般自己分析的中间的结果表。不适合共享
外部表：外部表的源文件数据是共享的，当删除外部表时，只是将表和文件的映射关系删除。原始数据和转换后的数据表。
指定分隔符

create table if not exists stu3(id int,name string) row format delimited fields terminated by '\t';

将查询到的表作为新表：

create table stux as select * from stu3;

复制其他表的表结构：

create table stux like stu3;

查询表的类型：

desc formatted  stu2;

已有数据和hive中表关联

hadoop fs -put data_flow.dat  /user/hive/warehouse/myflow.db/flow

14.3外部表操作：
分别创建老师与学生外部表，并向表中加载数据

use myhive;
create external table  teach(tid int,tname string) row format delimited
    fields terminated by ' ';
create external table  student(sid int,sname string,sbirth string,ssex string) row format delimited
    fields terminated by ' ';

从本地向表中加载数据：

load data local inpath '/export/data/hivedata/student.data' into table student;
load data local inpath '/export/data/hivedata/teacher.data' into table  teach;

覆盖表的内容：
在这里插入图片描述
从hdfs向表中加载数据
将数据上传到hdfs（从hdfs加载本质是原文件剪切到表目录文件）

load data inpath 'hdfs中文件的位置' into table 表名；

多表共享数据：

create external table  student(sid int,sname string,sbirth string,ssex string) row format delimited
    fields terminated by ' ' location '数据目录'；

复杂类型：
15.1 array类型：

create external  table  hive_array(
    name string,
    city array<string>
)row format delimited fields terminated by '\t'
collection items terminated by ',';
load data  local inpath '/export/data/hivedata/arr.data' into table hive_array ;

数据

zhangsan	  beijing,shanghai,tianjin,hangzhou
wangwu   	changchun,chengdu,wuhan,beijin

数组的长度：size

select  name ,size(city) v from hive_array;

是否包含：array_contains

select  name from hive_array where array_contains(city,"tianjin");

在这里插入图片描述
15.2 map类型

create  external  table  hive_map(
    id int ,
    name string,
    membbers map<string,string>,
    age int
)row format delimited fields terminated by ','
collection items terminated by '#'
map keys terminated by ':';
load data  local inpath '/export/data/hivedata/map.data'into table hive_map;
select * from hive_map;

在这里插入图片描述
可以根据键查询

select  name ,age,membbers from hive_map where membbers['father']="xiaoming";

获取所有的键：map_keys（members）
15.3 struct类似于Javabeen

create external  table  hive_struct(
    ip string,
    info struct<name:string,age:int>
)row format delimited fields terminated by '#'
collection items terminated by ':';
load data local inpath '/export/data/hivedata/struct.data'into table hive_struct;
select  * from hive_struct;
select ip from hive_struct where info.name="zhangsan";

分区表
16.1概念：将元数据分到不同的文件夹分类存储
16.2作用：数据分类管理，提高查询速度；
16.3分类：内部分区表，外部分区表
创建分区表的关键字：partitioned by（文件夹前缀 string）

create table score(
    sid int,
    cid int,
    score int
)partitioned by (month string) row format delimited fields terminated by'\t';
--分区表在加载数据时要制定数据放在哪个文件夹下
load data  local inpath '/export/data/hivedata/score.data'into table score
    partition (month='202202');
    根据month查
select  * from score where month=202201;

注：hive中的分区是分文件夹，MapReduce的分区是分文件
16.4多级分区：

create table score2(
    sid int,
    cid int,
    score int
)partitioned by (year string,month string,day string) row
    format delimited fields terminated by'\t';
--分区表在加载数据时要制定数据放在哪个文件夹下
load data  local inpath '/export/data/hivedata/score.data'into table score2
    partition (year='2022',month='01',day='1');
select  * from score2 where month=202201;
desc score2;

在这里插入图片描述
查看表的所有分区：

show partitions score2;

添加分区

添加一个
alter  table  score add partition (month='20201')

添加多个分区

alter  table  score add partition (month='20201')partition (month='202012')

删除分区：

alter  table  score drop partition (month='20201')

--分区表插入数据
insert into table score partition (month='202201')values (1,1,1);

lhh123lhh123

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫