hive 0.11 mysql_hive升级至0.11 partition schemal问题

最新推荐文章于 2022-09-21 21:11:25 发布

Jack.qiu

最新推荐文章于 2022-09-21 21:11:25 发布

阅读量97

点赞数

文章标签： hive 0.11 mysql

本文链接：https://blog.csdn.net/weixin_35844772/article/details/113344260

版权

线上由hive0.10升级到hive0.11之后，部分etl job运行出现问题，具体表现为：

在分区表中，当改变table的表的字段类型之后，旧的partition数据可能会出现NULL的情况。

这其实是hive0.11的一个新的特性，每个partition可以有自己的schemal信息。

分析hive0.10和hive0.11 mysql中的元数据信息，发现在table的schemal更改后，两种版本下新的partition都会继承table的schemal，而老的partition不会改变。

在hive0.11这个特性的描述中：

Rows in partitions are now read using partition schema and than made to comply with table schema, instead of being read directly using table schema.

也就是说，hive0.11的partition都是直接读自己的column_v2表中的信息，而hive0.10是读table的信息，这个特性是默认开启的，这就造成在升级到hive0.11之后部分job运行结果有问题。

解决方法：

通过更新mysql中hive元数据的sds表，把partition的cd_id信息都改成对应的table的cd_id的值即可。

实际操作中需要做3个表的join相关操作，在mysql中计算会比较慢，因此把数据load到gp中进行计算，通过python生成update语句做更新(因为涉及到元数据的更新，暂时还是用了手动的方式来更新，另外做了一个监控，当有数据不一致时会有报警邮件产生)：

python代码如下：import psycopg2

import time,sys

class gpUtil:

def __init__(self,host,user,passwd,db):

self.host = host

self.user = user

self.passwd = passwd

self.db = db

def db_connect(self):

self.conn = psycopg2.connect(host=self.host,user=self.user,password=self.passwd,database=self.db)

self.cursor = self.conn.cursor()

def db_fetch_all(self,sql):

self.cursor.execute(sql)

try:

result = self.cursor.fetchall()

except:

result = "Wrong"

return result

def db_close(self):

self.cursor.close()

self.conn.close()

def get_list(sql):

conn = gpUtil('xxxx','xxxxx','xxxxx','xxxxxx')

conn.db_connect()

host_list = conn.db_fetch_all(sql)

return host_list

conn.db_close()

def get_item_list():

tble_list = []

d = 0

sql = """

SELECT TBL_NAME,a.TBL_ID,TABLE_CD_ID,PART_NAME,PART_ID,PARTITION_CD_ID,PARITION_SD_ID FROM

(select PART_ID,PART_NAME,TBL_ID,CD_ID AS PARTITION_CD_ID,b.SD_ID AS PARITION_SD_ID from vipbi.partitions a,vipbi.sds b where a.SD_ID=b.SD_ID)

a join (select TBL_ID,TBL_NAME,CD_ID AS TABLE_CD_ID from vipbi.tbls a,vipbi.sds b where a.SD_ID=b.SD_ID) b on a.TBL_ID=b.TBL_ID and a.PARTITION_CD_ID<>b.TABLE_CD_ID

"""

re = get_list(sql)

for line in re:

line = list(line)

table_name = line[0]

table_id = line[1]

table_cd_id = line[2]

part_name = line[3]

part_id = line[4]

part_cd_id = line[5]

part_sd_id = line[6]

tble_list.append([table_name,table_id,table_cd_id,part_name,part_id,part_cd_id,part_sd_id])

return tble_list

def change_partition(tble_list):

if len(tble_list) == 0:

print "no need to update"

sys.exit(1)

now = time.time()

change_content = ''

rollback_content = ''

change_sql = open("/home/hdfs/bin/hadoop_tools/sql/meta_change" + str(int(now)) + ".sql",'w+')

rollback_sql = open("/home/hdfs/bin/hadoop_tools/sql/meta_rollback" + str(int(now)) + ".sql",'w+')

for table_item in tble_list:

table_name = table_item[0]

table_id = table_item[1]

table_cd_id = table_item[2]

part_name = table_item[3]

part_id = table_item[4]

part_cd_id = table_item[5]

part_sd_id = table_item[6]

change_content += """-- table para; partition para\n""" % (table_name,table_id,table_cd_id,part_name,part_id,part_cd_id,part_sd_id)

rollback_content += """-- table para; partition para\n""" % (table_name,table_id,table_cd_id,part_name,part_id,part_cd_id,part_sd_id)

change_content += """ update sds set CD_ID='%s' where SD_ID='%s';\n""" % (table_cd_id,part_sd_id)

rollback_content += """ update sds set CD_ID='%s' where SD_ID='%s';\n""" % (part_cd_id,part_sd_id)

change_sql.write(change_content)

rollback_sql.write(rollback_content)

change_sql.close()

rollback_sql.close()

if __name__ == '__main__':

#change_content = ''

start_time = time.time()

tble_list = []

tble_list = get_item_list()

change_partition(tble_list)

stop_time = time.time()

elapse_time = stop_time - start_time

print "elapse_time is %s" % (str(elapse_time))

Jack.qiu

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫