项目day01

快乐咸鱼Y

已于 2022-10-10 18:53:50 修改

阅读量643

点赞数

分类专栏：大数据文章标签：大数据

于 2022-10-08 23:14:24 首次发布

本文链接：https://blog.csdn.net/yygyj/article/details/127214241

版权

大数据专栏收录该内容

30 篇文章 2 订阅

订阅专栏

TTS项目
1.flume的hive-sink
2.hive的事务表机制
3.RCFile，ORCFile文件格式
4.通过SpringDataJPA规范操作Hive和Mysql
5.数仓的建模和设计

在这里插入图片描述

Hive中存储json格式的数据

1.数据准备

[root@hadoop01 home]# vim jsontest.json

{"location":"(32.121，45.262)","ip":"2.12.1.211","action":"click button","uid":"694666ee-8918-48f8-bb92-0a756a3f1f31","time":"1496999163163"}
{"location":"(32.121，45.262)","ip":"7.12.1.211","action":"click button","uid":"868be529-f59e-4f16-8ea5-08c4612ede9a","time":"1496999164165"}
{"location":"(32.121，45.262)","ip":"2.12.1.211","action":"click button","uid":"49b63380-2a5a-453d-aa5b-57e74cd1a8d7","time":"1496999165170"}
{"location":"(32.121，45.262)","ip":"3.12.1.211","action":"click button","uid":"60fd093a-9f6a-4716-81c6-72a5f07d3b4f","time":"1496999166171"}
{"location":"(32.121，45.262)","ip":"3.12.1.211","action":"click button","uid":"056cf558-56b1-4312-aba1-Ob0eb71ba78e","time":"1496999167172"}
{"location":"(32.121，45.262)","ip":"8.12.1.211","action":"click button","uid":"a2ecd34c-5458-43a2-b851-89eb6b3985ae","time":"1496999168174"}

登录hive

[root@hadoop01 bin]# sh hive

⒉.创建可以处理JSON格式数据的Hive表
1 )加载hcatalog包，此包已经包含在hive中:

hive> add jar ../hcatalog/share/hcatalog/hive-hcatalog-core-1.2.0.jar;

2)创建可以处理json格式数据的表:

hive> create table jsonTest(location string,ip string,action string,uid string,time string) row format serde 'org.apache.hive.hcatalog.data.JsonSerDe' stored as textfile;

3)插入数据︰

load data local inpath '/home/jsontest.json' into table jsonTest;

hive> select * from jsontest;
OK
(32.121，45.262)	2.12.1.211	click button	694666ee-8918-48f8-bb92-0a756a3f1f31	1496999163163
(32.121，45.262)	7.12.1.211	click button	868be529-f59e-4f16-8ea5-08c4612ede9a	1496999164165
(32.121，45.262)	2.12.1.211	click button	49b63380-2a5a-453d-aa5b-57e74cd1a8d7	1496999165170
(32.121，45.262)	3.12.1.211	click button	60fd093a-9f6a-4716-81c6-72a5f07d3b4f	1496999166171
(32.121，45.262)	3.12.1.211	click button	056cf558-56b1-4312-aba1-Ob0eb71ba78e	1496999167172
(32.121，45.262)	8.12.1.211	click button	a2ecd34c-5458-43a2-b851-89eb6b3985ae	1496999168174
Time taken: 0.089 seconds, Fetched: 6 row(s)

准备数据

[root@hadoop01 home]# vim vip.json
{"student_email":"pq@tedu.cn","student_name":"pq","student_gender":"male","student_phone":"110","student_code":"0","student_education_background":"xx","student_order_code":"112233","class_code":"123321","series_class_code":"jsd1902","center_code":"999","course_id":"jsd","study_mode":"vip","account_state":"normal","create_time":"2019-02-18","operation_mode":"update","modified_fields":[{"student_email":"pq@tedu.cn"},{"student_phone":"110"}],"before_class_study_mode":"vip","after_class_study_mode":"vip","before_course_id":"big","after_course_id":"jsd"}
{"student_email":"mxb@tedu.cn","student_name":"mxb","student_gender":"male","student_phone":"110","student_code":"1","student_education_background":"xx","student_order_code":"112233","class_code":"123321","series_class_code":"big1808","center_code":"999","course_id":"big","study_mode":"vip","account_state":"normal","create_time":"2019-02-18","operation_mode":"update","modified_fields":[{"student_email":"mxb@tedu.cn"},{"student_phone":"110"}],"before_class_study_mode":"vip","after_class_study_mode":"vip","before_course_id":"jsd","after_course_id":"big"}
{"student_email":"swk@tedu.cn","student_name":"swk","student_gender":"male","student_phone":"119","student_code":"2","student_education_background":"xx","student_order_code":"112233","class_code":"123321","series_class_code":"big1808","center_code":"999","course_id":"big","study_mode":"vip","account_state":"normal","create_time":"2019-02-18","operation_mode":"add"}
{"student_email":"zbj@qq.cn","student_name":"zbj","student_gender":"female","student_phone":"110","student_code":"3","student_education_background":"xx","student_order_code":"112233","class_code":"123321","series_class_code":"big1808","center_code":"999","course_id":"big","study_mode":"vip","account_state":"normal","create_time":"2019-02-18","operation_mode":"update","modified_fields":[ {"student_email":"zbj@qq.cn"},{"student_phone":"110"}],"before_class_study_mode":"vip","after_class_study_mode":"vip","before_course_id":"big","after_course_id":"jsd"}
{"student_email":"swj@163.com","student_name":"swj","student_gender":"female","student_phone":"110","student_code":"4","student_education_background":"xx","student_order_code":"112233","class_code":"123321","series_class_code":"big1808","center_code":"999","course_id":"big","study_mode":"vip","account_state":"normal","create_time":"2019-02-18","operation_mode":"update","modified_fields":[ {"student_email":"swj@163.com"},{"student_phone":"110"} ],"before_class_study_mode":"vip","after_class_study_mode":"vip","before_course_id":"big","after_course_id":"jsd"}
{"student_email":"ts@qq.cn","student_name":"ts","student_gender":"female","student_phone":"110","student_code":"5","student_education_background":"xx","student_order_code":"112233","class_code":"123321","series_class_code":"big1808","center_code":"999","course_id":"big","study_mode":"vip","account_state":"audition","create_time":"2019-02-18","operation_mode":"update","modified_fields":[ {"student_email":"ts@qq.cn"},{"student_phone":"110"} ],"before_class_study_mode":"vip","after_class_study_mode":"vip","before_course_id":"jsd","after_course_id":"big"}
{"student_email":"blm@souhu.cn","student_name":"blm","student_gender":"female","student_phone":"110","student_code":"6","student_education_background":"xx","student_order_code":"112233","class_code":"123321","series_class_code":"big1808","center_code":"999","course_id":"big","study_mode":"vip","account_state":"audition","create_time":"2019-02-18","operation_mode":"add"}
{"student_email":"bgj@souhu.cn","student_name":"bgj","student_gender":"female","student_phone":"110","student_code":"7","student_education_background":"x","student_order_code":"112233","class_code":"123321","series_class_code":"big1808","center_code":"999","course_id":"bilg","study_mode":"vip","account_state":"normal","create_time":"2019-02-18","operation_mode":"add"}
{"student_email":"zzj@qq.com","student_name":"zzj","student_gender":"female","student_phone":"110","student_code":"8","student_education_background":"x","student_order_code":"112233","class_code":"123321","series_class_code":"jsd1808","center_code":"999","course_id":"jsd","study_mode":"vip","account_state":"normal","create_time":"2019-02-18","operation_mode":"add"}
{"student_email":"hxg@qq.com","student_name":"hxg","student_gender":"female","student_phone":"110","student_code":"9","student_education_background":"xx","student_order_code":"112233","class_code":"123321","series_class_code":"jsd1808","center_code":"999","course_id":"jsd","study_mode":"vip","account_state":"audition","create_time":"2019-02-18","operation_mode":"add"}
{"student_email":"Ig@qq.cn","student_name":"lg","student_gender":"female","student_phone":"110","student_code":"10","student_education_background":"xx","student_order_code":"112233","class_code":"123321","series_class_code":"big1808","center_code":"999","course_id":"big","study_mode":"vip","account_state":"audition","create_time":"2019-02-18","operation_mode":"update","modified_fields":[ {"student_email":"Ig@qq.cn"},{"student_phone":"110"}],"before_class_study_mode":"offline","after_class_study_mode":"vip","before_course_id":"big","after_course_id":"big"}
{"student_email":"xxw@qq.cn","student_name":"xxw","student_gender":"female","student_phone":"110","student_code":"11","student_education_background":"x","student_order_code":"112233","class_code":"123321","series_class_code":"big1808","center_code":"999","course_id":"big","study_mode":"vip","account_state":"audition","create_time":"2019-02-18","operation_mode":"update","modified_fields":[ {"student_email":"xxw@qq.cn"},{"student_phone":"110"} ],"before_class_study_mode":"offline","after_class_study_mode":"vip","before_course_id":"big","after_course_id":"big"}

⒉.创建可以处理JSON格式数据的Hive表

#加载hcatalog包，此包已经包含在hive中
add jar ../hcatalog/share/hcatalog/hive-hcatalog-core-1.2.0.jar;
#创建可以处理json格式数据的表
CREATE TABLE vip(student_email string,student_name string,student_gender string,student_phone string,student_code string,student_education_background string,student_order_code string,class_code string,series_class_code string,center_code string,course_id string,study_mode string,account_state string,create_time string,operation_mode string,modified_fields array<struct<student_email:string,student_name:string,student_gender:string,student_phone:string,education_background:string,student_order_code:string,class_code:string,series_class_code:string,center_code:string,study_mode:string,course_id:string,operation_mode:string,create_time:string,study_type:string>>,before_class_study_mode string,after_class_study_mode string,before_course_id string,after_course_id string)ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';

加载数据

load data local inpath '/home/vip.json' into table vip;

hive> select * from vip tablesample(1 ROWS);
OK
pq@tedu.cn	pq	male	110	0	xx	112233	123321	jsd1902	999	jsd	vip	normal	2019-02-18	update	[{"student_email":"pq@tedu.cn","student_name":null,"student_gender":null,"student_phone":null,"education_background":null,"student_order_code":null,"class_code":null,"series_class_code":null,"center_code":null,"study_mode":null,"course_id":null,"operation_mode":null,"create_time":null,"study_type":null},{"student_email":null,"student_name":null,"student_gender":null,"student_phone":"110","education_background":null,"student_order_code":null,"class_code":null,"series_class_code":null,"center_code":null,"study_mode":null,"course_id":null,"operation_mode":null,"create_time":null,"study_type":null}]	vivip	big	jsd
Time taken: 0.051 seconds, Fetched: 1 row(s)

利用flume接受数据，保存到hive
1.配置hive和hcatlog的环境变量
由于hive sink需要用到hcatalog , hcatalog本身在hive中就已经存在，需要将hcatlog的jar包拷贝给
flume。
进入hcatlog的jar包目录，如下

[root@hadoop01 hcatalog]# pwd
/home/presoftware/apache-hive-1.2.0-bin/hcatalog/share/hcatalog
[root@hadoop01 hcatalog]# cp ./* /home/presoftware/apache-flume-1.6.0-bin/lib/

2.初始化hive的元数据库
默认情况下，mysql会自动建立的hive元数据库，但是这种方式会导致创建的Hive的元表不全，所以需要手动初始化hive的元数据库。
1)删除mysql之前的hive元数据库以及HDFS上hive的数据目录

mysql> drop database hive;
Query OK, 35 rows affected (1.24 sec)

删除Hadoop hdfs下的hive 目录和tmp目录

2)在mysql下手动建立hive元数据库，

create database hive character set latin1;

3 )在hive的bin目录下执行∶—建立元表（生效后hive数据库下有53个表）

./schematool -initSchema -dbType mysql

3.启动hive的远程元数据库服务
Flume中hive sink 操作hive时需要用到hive中的元数据信息，所以需要连接连接hive的远程元数据服务，需要先启动此服务。
进入hive的bin目录，执行:

sh hive --service metastore

此服务默认监听9083端口，可以通过thrift://ip:9083方式连接。

进入hive客户端，重新建表

[root@hadoop01 bin]# sh hive
hive> create database ttsdb;
hive> use ttsdb;

CREATE TABLE vip(student_email string,student_name string,student_gender string,student_phone string,student_code string,student_education_background string,student_order_code string,class_code string,series_class_code string,center_code string,course_id string,study_mode string,account_state string,create_time string,operation_mode string,modified_fields array<struct<student_email:string,student_name:string,student_gender:string,student_phone:string,education_background:string,student_order_code:string,class_code:string,series_class_code:string,center_code:string,study_mode:string,course_id:string,operation_mode:string,create_time:string,study_type:string>>,before_class_study_mode string,after_class_study_mode string,before_course_id string,after_course_id string)
partitioned by (reporttime string)
clustered by (student_order_code) into 4 buckets
stored as orc
TBLPROPERTIES('transactional'='true');

注︰1.必须是分桶表;2必须是存储为orc文件格式;3.必须是事务表
4.采用orc格式存储时由于数据被压缩不再是json格式，所以不要配置这句话︰
ROW FORMAT SERDE ‘org.apache.hive.hcatalog.data.JsonSerDe’

5.配置flume

a1.sources=r1
a1.channels=c1
a1.sinks=s1
a1.sources.r1.type=spooldir
a1.sources.r1.spoolDir=/home/vipdata
a1.sources.r1.interceptors=i1
a1.sources.r1.interceptors.i1.type=timestamp

a1.channels.c1.type=memory
a1.channels.c1.capacity=1000
a1.channels.c1.transactionCapacity=100

a1.sinks.s1.type=hive
a1.sinks.s1.hive.metastore=thrift://hadoop01:9083
a1.sinks.s1.hive.database=ttsdb
a1.sinks.s1.hive.table=vip
a1.sinks.s1.serializer=JSON
a1.sinks.s1.hive.partition=%Y-%m-%d
a1.sources.r1.channels=c1
a1.sinks.s1.channel=c1

注∶如果报错，可能是flume缺少hive的依赖包，需要将hive的ib下的jar包拷贝到flume的lib下

[root@hadoop01 lib]# pwd
/home/presoftware/apache-hive-1.2.0-bin/lib
[root@hadoop01 lib]# cp ./*.jar /home/presoftware/apache-flume-1.6.0-bin/lib/

新建目录

[root@hadoop01 home]# mkdir vipdata

启动flume

[root@hadoop01 data]# ../bin/flume-ng agent -n a1 -c ./ -f ./hive-sink.conf -Dflum.root.logger=INFO,console

整体流程
在这里插入图片描述

1.让flume加载hcatlog的jar包环境
2.让flume加载hive的jar包环境
3.配置flume 的hive sink
4.在hive建立对应的表结构(分区，分桶，orc格式，开启事务)
5.启动Hive的远程元数据服务
6.启动flume测试

#excel
#函数（随机字符）
=CHAR(RANDBETWEEN(100,110))
#多列拼接
=F21&G21&H21

Hadoop支持的文件格式
1.TextFile 普通的文本格式
2.Sequence 进制格式
3.RCFile
4.ORCFile
5.Parquet

RCFile文件格式:
集行存储和列存储的优点于一身

在这里插入图片描述

优点:
可以确保同一条记录在同一个block数据块，避免跨节点访问。
缺点:
在读取数据时，生产环境都是按列读取，会产生冗余列，性能很低。

在这里插入图片描述
优点:查询时不存在冗余列
缺点:同一条数据可能分散到不同的服务节点，读取时可能会产生跨节点访问，性能也不高。

综上，引出RCFile的设计思想:
先按行的方向分割成一个一个的行组(4MB)，然后在按列进行存储。
即结合了行存储和列存储的优点。

RCFile设计思想

在这里插入图片描述

RCFile格式
每个HDFS block块中，RCFile以行组为基本单位来组织记录。也就是说，存储在一个HDFS块中的所有记录被划分为多个行组;对于一张表所有行组大小都相同，一个HDFS块会有一个或多个行组。
一个行组包括三个部分∶
1)第一部分是行组头部的同步标识，主要用于分隔HDFS块（128M）中的两个连续行组;
2）第二部分是行组的元数据头部，用于存储行组单元的信息，包括行组中的记录数、每个列的字节数、列中每个域的字节数﹔
3 )第三部分是表格数据段，即实际的列存储数据。在该部分中，同一列的所有域顺序存储。从上图可以看出，首先存储了列A的所有域，然后存储列B的所有域等。

RCFile的每个行组中，元数据头部和表格数据段（每个列被独立压缩）分别进行压缩，RCFile使用重量级的Gzip压缩算法，是为了获得较好的压缩比。另外在由于Lazy压缩策略，当处理一个行组时，RCFile只需要解压使用到的列，因此相对较高的Gzip解压开销可以减少。

RCFile的Lazy解压机制
select列3 from table where 列1>1
缓存中:加载了Metadata 的Gzip —解压
列1的Gzip ——解压
如果没有符合条件的数据
列3的Gzip ——不解压

在这里插入图片描述

select id, name from stu where age>60
内存中:
行组1的id列，name列，age列(解压)->有符合条件，所以解压id和name列
行组2的id列，name列，age列(解压)->没有符合条件，不解压id和name列

RCFile底层为了进一步压缩数据，使用游程编码，比如某一列的数据:

1111111232222444445555
(7,1)(1,2)(1,3)(4,2)(5,4)(4,5)

这种编码算法，当存在大量的连续的数据时，可以节省很多的存储开销。如果数据不连续，则用这种算法开销可能会可更大。
所以RCFile底层做出了优化

1111111232222444445555
1 ~6 2 3  ~3 4 ~4 5

=========================

ORC File
这种文件格式实际上就是改进版的RCFile

ORC File格式的优点
1 )每个task只输出单个文件，这样可以减少NameNode的负载;RCFile每个Task可能会输出多个小文件
2)支持各种复杂的数据类型，比如: datetime, decimal, 以及一些复杂类型(struct, list, map, and union);
3)在文件中存储了一些轻量级的索引数据;
4 ）基于数据类型的块模式压缩:a、integer类型的列用行程长度编码(run-length encoding);b、String类型的列用字典编码(dictionary encoding) ;

注:
把编码位置设置到输入数据流的起始位
查找窗口中最长的匹配串
以“(Pointer, Length)Characters”的格式输出，其中Pointer是指向窗口中匹配串的指针，Length表示匹配字符的长度，Characters是前向缓冲存储器中的不匹配的第1个字符。
如果前向缓冲存储器不是空的，则把编码位置和窗口向前移(Length+1)个字符，然后返回到步骤2

在这里插入图片描述

5)用多个互相独立的RecordReaders并行读相同的文件;
6 ) metadata的存储是用Protocol Buffers的，所以它支持添加和删除数据

配置hive的事务


[root@hadoop01 conf]# vim hive-site.xml 
<!--开启hive表锁机制-->

<property>
        <name>hive.support.concurrency</name>
        <value>true</value>
</property>

<property>
        <name>hive.exec.dynamic.partition.mode</name>
        <value>nonstrict</value>
</property>
<property>
        <name>hive.txn.manager</name>
        <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
</property>
<property>
        <name>hive.compactor.initiator.on</name>
        <value>true</value>
</property>
<property>
        <name>hive.compactor.worker.threads</name>
        <value>1</value>
</property>

hive目前主要有两种锁，SHARED(共享锁S)和Exclusive (排他锁x)。
1）查询操作使用共享锁，共享锁是可以多重、并发使用的。
2）修改表操作使用独占锁，它会阻止其他的查询、修改操作。
3)可以对分区使用锁。

删除数据

hive> delete from vip where student_email rlike '(^\\w+@tedu.cn$)|(^\\w+@tarena.com.cn$)';

快乐咸鱼Y

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录