hive 0.11 mysql_hive升级至0.11 partition schemal问题

线上由hive0.10升级到hive0.11之后,部分etl job运行出现问题,具体表现为:

在分区表中,当改变table的表的字段类型之后,旧的partition数据可能会出现NULL的情况。

这其实是hive0.11的一个新的特性,每个partition可以有自己的schemal信息。

分析hive0.10和hive0.11 mysql中的元数据信息,发现在table的schemal更改后,两种版本下新的partition都会继承table的schemal,而老的partition不会改变。

在hive0.11这个特性的描述中:

Rows in partitions are now read using partition schema and than made to comply with table schema, instead of being read directly using table schema.

也就是说,hive0.11的partition都是直接读自己的column_v2表中的信息,而hive0.10是读table的信息,这个特性是默认开启的,这就造成在升级到hive0.11之后部分job运行结果有问题。

解决方法:

通过更新mysql中hive元数据的sds表,把partition的cd_id信息都改成对应的table的cd_id的值即可。

实际操作中需要做3个表的join相关操作,在mysql中计算会比较慢,因此把数据load到gp中进行计算,通过python生成update语句做更新(因为涉及到元数据的更新,暂时还是用了手动的方式来更新,另外做了一个监控,当有数据不一致时会有报警邮件产生):

python代码如下:import psycopg2

import time,sys

class gpUtil:

def __init__(self,host,user,passwd,db):

self.host = host

self.user = user

self.passwd = passwd

self.db = db

def db_connect(self):

self.conn = psycopg2.connect(host=self.host,user=self.user,password=self.passwd,database=self.db)

self.cursor = self.conn.cursor()

def db_fetch_all(self,sql):

self.cursor.execute(sql)

try:

result = self.cursor.fetchall()

except:

result = "Wrong"

return result

def db_close(self):

self.cursor.close()

self.conn.close()

def get_list(sql):

conn = gpUtil('xxxx','xxxxx','xxxxx','xxxxxx')

conn.db_connect()

host_list = conn.db_fetch_all(sql)

return host_list

conn.db_close()

def get_item_list():

tble_list = []

d = 0

sql = """

SELECT TBL_NAME,a.TBL_ID,TABLE_CD_ID,PART_NAME,PART_ID,PARTITION_CD_ID,PARITION_SD_ID FROM

(select PART_ID,PART_NAME,TBL_ID,CD_ID AS PARTITION_CD_ID,b.SD_ID AS PARITION_SD_ID from vipbi.partitions a,vipbi.sds b where a.SD_ID=b.SD_ID)

a join (select TBL_ID,TBL_NAME,CD_ID AS TABLE_CD_ID from vipbi.tbls a,vipbi.sds b where a.SD_ID=b.SD_ID) b on a.TBL_ID=b.TBL_ID and a.PARTITION_CD_ID<>b.TABLE_CD_ID

"""

re = get_list(sql)

for line in re:

line = list(line)

table_name = line[0]

table_id = line[1]

table_cd_id = line[2]

part_name = line[3]

part_id = line[4]

part_cd_id = line[5]

part_sd_id = line[6]

tble_list.append([table_name,table_id,table_cd_id,part_name,part_id,part_cd_id,part_sd_id])

return tble_list

def change_partition(tble_list):

if len(tble_list) == 0:

print "no need to update"

sys.exit(1)

now = time.time()

change_content = ''

rollback_content = ''

change_sql = open("/home/hdfs/bin/hadoop_tools/sql/meta_change" + str(int(now)) + ".sql",'w+')

rollback_sql = open("/home/hdfs/bin/hadoop_tools/sql/meta_rollback" + str(int(now)) + ".sql",'w+')

for table_item in tble_list:

table_name = table_item[0]

table_id = table_item[1]

table_cd_id = table_item[2]

part_name = table_item[3]

part_id = table_item[4]

part_cd_id = table_item[5]

part_sd_id = table_item[6]

change_content += """-- table para; partition para\n""" % (table_name,table_id,table_cd_id,part_name,part_id,part_cd_id,part_sd_id)

rollback_content += """-- table para; partition para\n""" % (table_name,table_id,table_cd_id,part_name,part_id,part_cd_id,part_sd_id)

change_content += """ update sds set CD_ID='%s' where SD_ID='%s';\n""" % (table_cd_id,part_sd_id)

rollback_content += """ update sds set CD_ID='%s' where SD_ID='%s';\n""" % (part_cd_id,part_sd_id)

change_sql.write(change_content)

rollback_sql.write(rollback_content)

change_sql.close()

rollback_sql.close()

if __name__ == '__main__':

#change_content = ''

start_time = time.time()

tble_list = []

tble_list = get_item_list()

change_partition(tble_list)

stop_time = time.time()

elapse_time = stop_time - start_time

print "elapse_time is %s" % (str(elapse_time))

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值