TTS项目
1.flume的hive-sink
2.hive的事务表机制
3.RCFile,ORCFile文件格式
4.通过SpringDataJPA规范操作Hive和Mysql
5.数仓的建模和设计
Hive中存储json格式的数据
1.数据准备
[root@hadoop01 home]# vim jsontest.json
{"location":"(32.121,45.262)","ip":"2.12.1.211","action":"click button","uid":"694666ee-8918-48f8-bb92-0a756a3f1f31","time":"1496999163163"}
{"location":"(32.121,45.262)","ip":"7.12.1.211","action":"click button","uid":"868be529-f59e-4f16-8ea5-08c4612ede9a","time":"1496999164165"}
{"location":"(32.121,45.262)","ip":"2.12.1.211","action":"click button","uid":"49b63380-2a5a-453d-aa5b-57e74cd1a8d7","time":"1496999165170"}
{"location":"(32.121,45.262)","ip":"3.12.1.211","action":"click button","uid":"60fd093a-9f6a-4716-81c6-72a5f07d3b4f","time":"1496999166171"}
{"location":"(32.121,45.262)","ip":"3.12.1.211","action":"click button","uid":"056cf558-56b1-4312-aba1-Ob0eb71ba78e","time":"1496999167172"}
{"location":"(32.121,45.262)","ip":"8.12.1.211","action":"click button","uid":"a2ecd34c-5458-43a2-b851-89eb6b3985ae","time":"1496999168174"}
登录hive
[root@hadoop01 bin]# sh hive
⒉.创建可以处理JSON格式数据的Hive表
1 )加载hcatalog包,此包已经包含在hive中:
hive> add jar ../hcatalog/share/hcatalog/hive-hcatalog-core-1.2.0.jar;
2)创建可以处理json格式数据的表:
hive> create table jsonTest(location string,ip string,action string,uid string,time string) row format serde 'org.apache.hive.hcatalog.data.JsonSerDe' stored as textfile;
3)插入数据︰
load data local inpath '/home/jsontest.json' into table jsonTest;
hive> select * from jsontest;
OK
(32.121,45.262) 2.12.1.211 click button 694666ee-8918-48f8-bb92-0a756a3f1f31 1496999163163
(32.121,45.262) 7.12.1.211 click button 868be529-f59e-4f16-8ea5-08c4612ede9a 1496999164165
(32.121,45.262) 2.12.1.211 click button 49b63380-2a5a-453d-aa5b-57e74cd1a8d7 1496999165170
(32.121,45.262) 3.12.1.211 click button 60fd093a-9f6a-4716-81c6-72a5f07d3b4f 1496999166171
(32.121,45.262) 3.12.1.211 click button 056cf558-56b1-4312-aba1-Ob0eb71ba78e 1496999167172
(32.121,45.262) 8.12.1.211 click button a2ecd34c-5458-43a2-b851-89eb6b3985ae 1496999168174
Time taken: 0.089 seconds, Fetched: 6 row(s)
准备数据
[root@hadoop01 home]# vim vip.json
{"student_email":"pq@tedu.cn","student_name":"pq","student_gender":"male","student_phone":"110","student_code":"0","student_education_background":"xx","student_order_code":"112233","class_code":"123321","series_class_code":"jsd1902","center_code":"999","course_id":"jsd","study_mode":"vip","account_state":"normal","create_time":"2019-02-18","operation_mode":"update","modified_fields":[{"student_email":"pq@tedu.cn"},{"student_phone":"110"}],"before_class_study_mode":"vip","after_class_study_mode":"vip","before_course_id":"big","after_course_id":"jsd"}
{"student_email":"mxb@tedu.cn","student_name":"mxb","student_gender":"male","student_phone":"110","student_code":"1","student_education_background":"xx","student_order_code":"112233","class_code":"123321","series_class_code":"big1808","center_code":"999","course_id":"big","study_mode":"vip","account_state":"normal","create_time":"2019-02-18","operation_mode":"update","modified_fields":[{"student_email":"mxb@tedu.cn"},{"student_phone":"110"}],"before_class_study_mode":"vip","after_class_study_mode":"vip","before_course_id":"jsd","after_course_id":"big"}
{"student_email":"swk@tedu.cn","student_name":"swk","student_gender":"male","student_phone":"119","student_code":"2","student_education_background":"xx","student_order_code":"112233","class_code":"123321","series_class_code":"big1808","center_code":"999","course_id":"big","study_mode":"vip","account_state":"normal","create_time":"2019-02-18","operation_mode":"add"}
{"student_email":"zbj@qq.cn","student_name":"zbj","student_gender":"female","student_phone":"110","student_code":"3","student_education_background":"xx","student_order_code":"112233","class_code":"123321","series_class_code":"big1808","center_code":"999","course_id":"big","study_mode":"vip","account_state":"normal","create_time":"2019-02-18","operation_mode":"update","modified_fields":[ {"student_email":"zbj@qq.cn"},{"student_phone":"110"}],"before_class_study_mode":"vip","after_class_study_mode":"vip","before_course_id":"big","after_course_id":"jsd"}
{"student_email":"swj@163.com","student_name":"swj","student_gender":"female","student_phone":"110","student_code":"4","student_education_background":"xx","student_order_code":"112233","class_code":"123321","series_class_code":"big1808","center_code":"999","course_id":"big","study_mode":"vip","account_state":"normal","create_time":"2019-02-18","operation_mode":"update","modified_fields":[ {"student_email":"swj@163.com"},{"student_phone":"110"} ],"before_class_study_mode":"vip","after_class_study_mode":"vip","before_course_id":"big","after_course_id":"jsd"}
{"student_email":"ts@qq.cn","student_name":"ts","student_gender":"female","student_phone":"110","student_code":"5","student_education_background":"xx","student_order_code":"112233","class_code":"123321","series_class_code":"big1808","center_code":"999","course_id":"big","study_mode":"vip","account_state":"audition","create_time":"2019-02-18","operation_mode":"update","modified_fields":[ {"student_email":"ts@qq.cn"},{"student_phone":"110"} ],"before_class_study_mode":"vip","after_class_study_mode":"vip","before_course_id":"jsd","after_course_id":"big"}
{"student_email":"blm@souhu.cn","student_name":"blm","student_gender":"female","student_phone":"110","student_code":"6","student_education_background":"xx","student_order_code":"112233","class_code":"123321","series_class_code":"big1808","center_code":"999","course_id":"big","study_mode":"vip","account_state":"audition","create_time":"2019-02-18","operation_mode":"add"}
{"student_email":"bgj@souhu.cn","student_name":"bgj","student_gender":"female","student_phone":"110","student_code":"7","student_education_background":"x","student_order_code":"112233","class_code":"123321","series_class_code":"big1808","center_code":"999","course_id":"bilg","study_mode":"vip","account_state":"normal","create_time":"2019-02-18","operation_mode":"add"}
{"student_email":"zzj@qq.com","student_name":"zzj","student_gender":"female","student_phone":"110","student_code":"8","student_education_background":"x","student_order_code":"112233","class_code":"123321","series_class_code":"jsd1808","center_code":"999","course_id":"jsd","study_mode":"vip","account_state":"normal","create_time":"2019-02-18","operation_mode":"add"}
{"student_email":"hxg@qq.com","student_name":"hxg","student_gender":"female","student_phone":"110","student_code":"9","student_education_background":"xx","student_order_code":"112233","class_code":"123321","series_class_code":"jsd1808","center_code":"999","course_id":"jsd","study_mode":"vip","account_state":"audition","create_time":"2019-02-18","operation_mode":"add"}
{"student_email":"Ig@qq.cn","student_name":"lg","student_gender":"female","student_phone":"110","student_code":"10","student_education_background":"xx","student_order_code":"112233","class_code":"123321","series_class_code":"big1808","center_code":"999","course_id":"big","study_mode":"vip","account_state":"audition","create_time":"2019-02-18","operation_mode":"update","modified_fields":[ {"student_email":"Ig@qq.cn"},{"student_phone":"110"}],"before_class_study_mode":"offline","after_class_study_mode":"vip","before_course_id":"big","after_course_id":"big"}
{"student_email":"xxw@qq.cn","student_name":"xxw","student_gender":"female","student_phone":"110","student_code":"11","student_education_background":"x","student_order_code":"112233","class_code":"123321","series_class_code":"big1808","center_code":"999","course_id":"big","study_mode":"vip","account_state":"audition","create_time":"2019-02-18","operation_mode":"update","modified_fields":[ {"student_email":"xxw@qq.cn"},{"student_phone":"110"} ],"before_class_study_mode":"offline","after_class_study_mode":"vip","before_course_id":"big","after_course_id":"big"}
⒉.创建可以处理JSON格式数据的Hive表
#加载hcatalog包,此包已经包含在hive中
add jar ../hcatalog/share/hcatalog/hive-hcatalog-core-1.2.0.jar;
#创建可以处理json格式数据的表
CREATE TABLE vip(student_email string,student_name string,student_gender string,student_phone string,student_code string,student_education_background string,student_order_code string,class_code string,series_class_code string,center_code string,course_id string,study_mode string,account_state string,create_time string,operation_mode string,modified_fields array<struct<student_email:string,student_name:string,student_gender:string,student_phone:string,education_background:string,student_order_code:string,class_code:string,series_class_code:string,center_code:string,study_mode:string,course_id:string,operation_mode:string,create_time:string,study_type:string>>,before_class_study_mode string,after_class_study_mode string,before_course_id string,after_course_id string)ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
加载数据
load data local inpath '/home/vip.json' into table vip;
hive> select * from vip tablesample(1 ROWS);
OK
pq@tedu.cn pq male 110 0 xx 112233 123321 jsd1902 999 jsd vip normal 2019-02-18 update [{"student_email":"pq@tedu.cn","student_name":null,"student_gender":null,"student_phone":null,"education_background":null,"student_order_code":null,"class_code":null,"series_class_code":null,"center_code":null,"study_mode":null,"course_id":null,"operation_mode":null,"create_time":null,"study_type":null},{"student_email":null,"student_name":null,"student_gender":null,"student_phone":"110","education_background":null,"student_order_code":null,"class_code":null,"series_class_code":null,"center_code":null,"study_mode":null,"course_id":null,"operation_mode":null,"create_time":null,"study_type":null}] vivip big jsd
Time taken: 0.051 seconds, Fetched: 1 row(s)
利用flume接受数据,保存到hive
1.配置hive和hcatlog的环境变量
由于hive sink需要用到hcatalog , hcatalog本身在hive中就已经存在,需要将hcatlog的jar包拷贝给
flume。
进入hcatlog的jar包目录,如下
[root@hadoop01 hcatalog]# pwd
/home/presoftware/apache-hive-1.2.0-bin/hcatalog/share/hcatalog
[root@hadoop01 hcatalog]# cp ./* /home/presoftware/apache-flume-1.6.0-bin/lib/
2.初始化hive的元数据库
默认情况下,mysql会自动建立的hive元数据库,但是这种方式会导致创建的Hive的元表不全,所以需要手动初始化hive的元数据库。
1)删除mysql之前的hive元数据库以及HDFS上hive的数据目录
mysql> drop database hive;
Query OK, 35 rows affected (1.24 sec)
删除Hadoop hdfs下的hive 目录和tmp目录
2)在mysql下手动建立hive元数据库,
create database hive character set latin1;
3 )在hive的bin目录下执行∶—建立元表(生效后hive数据库下有53个表)
./schematool -initSchema -dbType mysql
3.启动hive的远程元数据库服务
Flume中hive sink 操作hive时需要用到hive中的元数据信息,所以需要连接连接hive的远程元数据服务,需要先启动此服务。
进入hive的bin目录,执行:
sh hive --service metastore
此服务默认监听9083端口,可以通过thrift://ip:9083方式连接。
进入hive客户端,重新建表
[root@hadoop01 bin]# sh hive
hive> create database ttsdb;
hive> use ttsdb;
CREATE TABLE vip(student_email string,student_name string,student_gender string,student_phone string,student_code string,student_education_background string,student_order_code string,class_code string,series_class_code string,center_code string,course_id string,study_mode string,account_state string,create_time string,operation_mode string,modified_fields array<struct<student_email:string,student_name:string,student_gender:string,student_phone:string,education_background:string,student_order_code:string,class_code:string,series_class_code:string,center_code:string,study_mode:string,course_id:string,operation_mode:string,create_time:string,study_type:string>>,before_class_study_mode string,after_class_study_mode string,before_course_id string,after_course_id string)
partitioned by (reporttime string)
clustered by (student_order_code) into 4 buckets
stored as orc
TBLPROPERTIES('transactional'='true');
注︰1.必须是分桶表;2必须是存储为orc文件格式;3.必须是事务表
4.采用orc格式存储时由于数据被压缩不再是json格式,所以不要配置这句话︰
ROW FORMAT SERDE ‘org.apache.hive.hcatalog.data.JsonSerDe’
5.配置flume
a1.sources=r1
a1.channels=c1
a1.sinks=s1
a1.sources.r1.type=spooldir
a1.sources.r1.spoolDir=/home/vipdata
a1.sources.r1.interceptors=i1
a1.sources.r1.interceptors.i1.type=timestamp
a1.channels.c1.type=memory
a1.channels.c1.capacity=1000
a1.channels.c1.transactionCapacity=100
a1.sinks.s1.type=hive
a1.sinks.s1.hive.metastore=thrift://hadoop01:9083
a1.sinks.s1.hive.database=ttsdb
a1.sinks.s1.hive.table=vip
a1.sinks.s1.serializer=JSON
a1.sinks.s1.hive.partition=%Y-%m-%d
a1.sources.r1.channels=c1
a1.sinks.s1.channel=c1
注∶如果报错,可能是flume缺少hive的依赖包,需要将hive的ib下的jar包拷贝到flume的lib下
[root@hadoop01 lib]# pwd
/home/presoftware/apache-hive-1.2.0-bin/lib
[root@hadoop01 lib]# cp ./*.jar /home/presoftware/apache-flume-1.6.0-bin/lib/
新建目录
[root@hadoop01 home]# mkdir vipdata
启动flume
[root@hadoop01 data]# ../bin/flume-ng agent -n a1 -c ./ -f ./hive-sink.conf -Dflum.root.logger=INFO,console
整体流程
1.让flume加载hcatlog的jar包环境
2.让flume加载hive的jar包环境
3.配置flume 的hive sink
4.在hive建立对应的表结构(分区,分桶,orc格式,开启事务)
5.启动Hive的远程元数据服务
6.启动flume测试
#excel
#函数(随机字符)
=CHAR(RANDBETWEEN(100,110))
#多列拼接
=F21&G21&H21
Hadoop支持的文件格式
1.TextFile 普通的文本格式
2.Sequence 进制格式
3.RCFile
4.ORCFile
5.Parquet
RCFile文件格式:
集行存储和列存储的优点于一身
优点:
可以确保同一条记录在同一个block数据块,避免跨节点访问。
缺点:
在读取数据时,生产环境都是按列读取,会产生冗余列,性能很低。
优点:查询时不存在冗余列
缺点:同一条数据可能分散到不同的服务节点,读取时可能会产生跨节点访问,性能也不高。
综上,引出RCFile的设计思想:
先按行的方向分割成一个一个的行组(4MB),然后在按列进行存储。
即结合了行存储和列存储的优点。
RCFile设计思想
RCFile格式
每个HDFS block块中,RCFile以行组为基本单位来组织记录。也就是说,存储在一个HDFS块中的所有记录被划分为多个行组;对于一张表所有行组大小都相同,一个HDFS块会有一个或多个行组。
一个行组包括三个部分∶
1)第一部分是行组头部的同步标识,主要用于分隔HDFS块(128M)中的两个连续行组;
2)第二部分是行组的元数据头部,用于存储行组单元的信息,包括行组中的记录数、每个列的字节数、列中每个域的字节数﹔
3 )第三部分是表格数据段,即实际的列存储数据。在该部分中,同一列的所有域顺序存储。从上图可以看出,首先存储了列A的所有域,然后存储列B的所有域等。
RCFile的每个行组中,元数据头部和表格数据段(每个列被独立压缩)分别进行压缩,RCFile使用重量级的Gzip压缩算法,是为了获得较好的压缩比。另外在由于Lazy压缩策略,当处理一个行组时,RCFile只需要解压使用到的列,因此相对较高的Gzip解压开销可以减少。
RCFile的Lazy解压机制
select列3 from table where 列1>1
缓存中:加载了Metadata 的Gzip —解压
列1的Gzip ——解压
如果没有符合条件的数据
列3的Gzip ——不解压
select id, name from stu where age>60
内存中:
行组1的id列,name列,age列(解压)->有符合条件,所以解压id和name列
行组2的id列,name列,age列(解压)->没有符合条件,不解压id和name列
RCFile底层为了进一步压缩数据,使用游程编码,比如某一列的数据:
1111111232222444445555
(7,1)(1,2)(1,3)(4,2)(5,4)(4,5)
这种编码算法,当存在大量的连续的数据时,可以节省很多的存储开销。如果数据不连续,则用这种算法开销可能会可更大。
所以RCFile底层做出了优化
1111111232222444445555
1 ~6 2 3 ~3 4 ~4 5
=========================
ORC File
这种文件格式实际上就是改进版的RCFile
ORC File格式的优点
1 )每个task只输出单个文件,这样可以减少NameNode的负载;RCFile每个Task可能会输出多个小文件
2)支持各种复杂的数据类型,比如: datetime, decimal, 以及一些复杂类型(struct, list, map, and union);
3)在文件中存储了一些轻量级的索引数据;
4 )基于数据类型的块模式压缩:a、integer类型的列用行程长度编码(run-length encoding);b、String类型的列用字典编码(dictionary encoding) ;
注:
把编码位置设置到输入数据流的起始位
查找窗口中最长的匹配串
以“(Pointer, Length)Characters”的格式输出,其中Pointer是指向窗口中匹配串的指针,Length表示匹配字符的长度,Characters是前向缓冲存储器中的不匹配的第1个字符。
如果前向缓冲存储器不是空的,则把编码位置和窗口向前移(Length+1)个字符,然后返回到步骤2
5)用多个互相独立的RecordReaders并行读相同的文件;
6 ) metadata的存储是用Protocol Buffers的,所以它支持添加和删除数据
配置hive的事务
[root@hadoop01 conf]# vim hive-site.xml
<!--开启hive表锁机制-->
<property>
<name>hive.support.concurrency</name>
<value>true</value>
</property>
<property>
<name>hive.exec.dynamic.partition.mode</name>
<value>nonstrict</value>
</property>
<property>
<name>hive.txn.manager</name>
<value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>
</property>
<property>
<name>hive.compactor.initiator.on</name>
<value>true</value>
</property>
<property>
<name>hive.compactor.worker.threads</name>
<value>1</value>
</property>
hive目前主要有两种锁,SHARED(共享锁S)和Exclusive (排他锁x)。
1)查询操作使用共享锁,共享锁是可以多重、并发使用的。
2)修改表操作使用独占锁,它会阻止其他的查询、修改操作。
3)可以对分区使用锁。
删除数据
hive> delete from vip where student_email rlike '(^\\w+@tedu.cn$)|(^\\w+@tarena.com.cn$)';