3.1.6 Hive事务操作
Hive开始支持事务,是在Hive0.14之后。HDFS的文件,只能允许新建,删除,对文件中的内容进行更新,不允许单条修改。Hive的文件存储是基于HDFS文件存在的,所以原则上不会直接对HDFS做文件内容的事务更新,只能是采取另外的手段来完成。即用HDFS文件作为原始数据,用delta文件作为操作日志的记录。当访问Hive数据时,根据HDFS文件和delta文件做合并,查询最新的数据。
综上,Hive支持事务的前提是初始化数据和增量操作,分开存储。这种存储方案导致的最终结果是Hive存储的delta文件日益增多,增加NameNode的管理难度,NameNode是设计来管理大容量小数量文件,每一个HDFS的文件目录以及文件都会增大NameNode的内存消耗。
Hive支持事务所需要的前提配置:(1)配置6个属性。(2)在Hive元数据中增加对事务支持所需的表(这一点不知道有没有改善,理论上太复杂)。(3)重新启动Hive服务。(4)建表需要使用ORC存储。(5)该数据库必须具有ACID[1]特性,否则在事务过程当中无法保证数据的正确性,交易过程及极可能达不到交易方的要求。
打开事务属性支持有如下几个配置:
SET hive.support.concurrency = true;
SET hive.enforce.bucketing = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.compactor.initiator.on = true;
SET hive.compactor.worker.threads = 1;
配置好了事务属性之后,可以了解一下关于事务表的操作。首先创建桶表(事务表,优化的列模式文件),在创建好桶表之后,使用desc命令来查看创建的表结构,具体代码以及结果如下所示,
hive> desc formatted tx;
OK
# col_name data_type comment
id int
name string
# Detailed Table Information
Database: default
Owner: laura
CreateTime: Wed Aug 02 18:31:52 PDT 2017
LastAccessTime: UNKNOWN
Retention: 0
Location: hdfs://hadoop0:8020/user/hive/warehouse/tx
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"}
numFiles 0
numRows 0
rawDataSize 0
totalSize 0
transactional true
transient_lastDdlTime 1501723912
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
Compressed: No
Num Buckets: 2
Bucket Columns: [id]
Sort Columns: []
Storage Desc Params:
field.delim \t
line.delim \n
serialization.format \t
Time taken: 0.418 seconds, Fetched: 34 row(s)
将数据插入事务表之前必须打开之前介绍的事务属性的所有配置才可以。然后就可以向表中插入一些数据,
hive> insert into tx(id,name) values(1,'tom');
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = laura_20170802183559_92c64423-63aa-4a82-abbd-6d24039f1ceb
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 2
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1501679648039_0003, Tracking URL = http://hadoop0:8888/ proxy/application_1501679648039_0003/
Kill Command = /home/laura/hadoop-2.7.3/bin/hadoop job -kill job_1501679648039_ 0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 2
2017-08-02 18:36:37,024 Stage-1 map = 0%, reduce = 0%
2017-08-02 18:36:59,492 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.5 sec
2017-08-02 18:37:15,802 Stage-1 map = 100%, reduce = 50%, Cumulative CPU 8.94 sec
2017-08-02 18:37:23,286 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 13.26 sec
MapReduce Total cumulative CPU time: 13 seconds 260 msec
Ended Job = job_1501679648039_0003
Loading data to table default.tx
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 2 Cumulative CPU: 13.26 sec HDFS Read: 11762
HDFS Write: 931 SUCCESS
Total MapReduce CPU Time Spent: 13 seconds 260 msec
OK
Time taken: 88.482 seconds
hive> insert into tx(id,name) values(2,'tomlee');
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = laura_20170802183855_89b897ca-a3d8-4be3-aa7e-50da1ae39ce3
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 2
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1501679648039_0004, Tracking URL = http://hadoop0:8888/ proxy/application_1501679648039_0004/
Kill Command = /home/laura/hadoop-2.7.3/bin/hadoop job -kill job_1501679648039_ 0004
Hadoop job information for Stage-1: number of