2022_06杂记

Anla Likes Sunshine

已于 2022-06-30 11:12:29 修改

阅读量463

点赞数

分类专栏：大数据文章标签：大数据 hive hadoop

于 2022-06-30 11:06:32 首次发布

本文链接：https://blog.csdn.net/AnlaGodness/article/details/125104218

版权

大数据专栏收录该内容

48 篇文章 1 订阅

订阅专栏

本文详述了在大数据处理中遇到的字段值异常、Hive元数据管理、延时上报、Flink集群升级、Holo表DML操作以及Shell脚本错误等实际问题。通过实例分析了Spark引擎对于数据类型转换的限制，Hive元数据的删除注意事项，以及如何处理Flink作业中的异常。此外，还分享了特殊时间需求的处理方法，包括时间计算和业务逻辑判断。

摘要由CSDN通过智能技术生成

1、背景：今天测试工程师测出了一个字段值B值通通为0，=值A*固定值的一个字段，字段值A是存在>0的数据，值A是select逻辑得到的汇总值。
找bug：试了很多天马行空的方法，删表重建、将固定值字段替换为固定值参与运算，甚至猜想乘法功能是不是变成字符串拼接次数功能，无效–以后请不要天马行空。唯一正确的操作是，将select 逻辑拖出来，发现数据是正常的，看来是落表过程出了问题。猜想是 spark引擎的处理和mr引擎不一样。将和任务和spark相关的参数统统注释掉，发现表数据正常。所以得出是spark引擎规则引起的数据问题。但是也不可能以后再次遇到问题通过 mr 这种曲线救任务的方式，太不可持续了。继续观察，观察到select 逻辑拖出来得到的数据类型是 bigint，会不会是 bigint类型落表int类型字段不行，从道理上讲是不会的，但就是这个原因。怎么发现的？我将select逻辑放到 create table test.test_0602 as 后面，先跑一次，看到是bigint类型。而后将 bigint 的B值转换成（cast B as int），删表后继续落表，可以看到B值统统为 0。这下，就知道了 spark引擎，不支持 bigint 类型数据落入 int类型字段，值会为0。那怎么处理？处理值A，将其转换为 int 类型，这样值B也会为 int 类型，这样处理后，落表数据就准确了。
PS.猜测 spark引擎下，汇总值都会变成 bigint 类型。

2、hive的元数据存在在mysql上。当某表分区元数据出现问题的时候，删除相关表的相关记录。要注意查询相关表是否有其余表拿相关表某字段做外键，有的话要先删除其余表，再删除相关表记录。Navicat 方便查看外键表有哪些，如图,则操作应为：

1、查看表的 tbl_id
SELECT *  FROM TBLS WHERE TBL_NAME='A';
2、查看表的 PART_ID
select * from PARTITIONS  where  tbl_id='B' order by PART_ID desc limit 5;
3、查看外键表
select * from partition_params where part_id = C;
select * from PARTITION_KEY_VALS where part_id=C;
select * from part_privs where part_id = C;
select * from part_col_stats where part_id = C;
4、删除元数据该分区信息：
	delete from PARTITION_KEY_VALS where part_id=C;
	delete from PARTITION_PARAMS where  part_id=C;
	delete from PARTITIONS where tbl_id='B' and part_id=C;
5、hive窗口，恢复数据:
	MSCK REPAIR TABLE db_name.A;
6、impala同步元数据
    invalidate metadata  db_name.A;

在这里插入图片描述
3、关于延时上报
场景是：小程序的无埋点采集功能采集到的数据上报时间，可能是用户下次打开小程序的时间，因为小程序有存储数据功能。eventtime是日志推送的事件时间，即是用户操作的时间，因此某一时间点查询某一段时间的eventtime，数据量是有可能增加的，即是延时上报的情况。因此，同个topic，同一条数据，不同的消费组，eventtime是一样的，且一整条数据都是一样的。

4、Flink 生产环境指定Session集群版本升级过程
4.1、在测试环境创建一个目标版本Session集群，将相关任务们带过来，修改消费组为测试组、sink库为测试库等等，以免影响生产数据。其他的修改有，升级可带来的代码迭代、不兼容引发的迭代。每个参数要清楚含义才能配置上去。
4.2、代码验证后，让任务运行一段时间后，用测试数据和生产数据比对数据完整性、一致性。
4.3、生产环境的任务们一个个停止，检查kafka消费情况、holo的数据生成情况，是否已经正常停止消费、生成。
4.4、停止该 Session 集群的运行（虽然已可以停止集群就能停止相关任务，但还是一个个停止任务更加稳妥）
4.5、修改一个任务的高级配置和资源配置。

checkpoint 时间间隔为180s，最短checkpoint时间间隔为60s
Flink 重启策略配置固定间隔重启，尝试重启次数 10次，每次重启时间间隔 5min。
资源配置，并发度可根据任务数配置，如有三个 sink，则配置 3并发度。如果数据量小的话，三个 sink 也可配置 1 并发度。
cacheSize 单位是行数，要大于表行数，才能都加载进来。cacheTTLMs 如果表数据更新频率是每天的话，就可以设置大点，如一小时 3600000。
4.6、验证；上线；查看版本信息确认代码及配置修改是否完成。
4.7、运行一个任务，检查kafka消费情况、holo的数据生成情况，是否已经正常消费、生成。
相关任务们轮流做一遍 4.5、4.6、4.7。
4.8、第二天，检查各个任务生成数据的日环比。

5、Holo表的DML 是可以到某行某字段的。因此 flink sink holo表时，‘mutatetype’=‘insertorupdate’ 是更合适的。指定 insertorreplace 会在情况只修改Holo表部分字段时，将其他字段修改为了空值，因为 insertorreplace 是以行为修改粒度的。官方参考资料链接

6、Holo将 text类型的时间（年月日时分秒）转换为日期时间： select to_char(cast(text_field as timestampz),'yyyy-MM-dd');
7、shell脚本报错：“[: =: unary operator expected” 。逻辑与的使用应为：

if [[ 1 == 1 && 1 == 2 ]];then
        echo "true"
else
        echo "false"
fi

if [ "1" == "1" ] && [ "1" == "2" ];then
        echo "true"
else
        echo "false"
fi
if [[ "1" == "1" && "1" == "2" ]];then
        echo "true"
else
        echo "false"
fi

8、flink 作业报错记录
在作业运维>作业探查>JVM异常，log里有如下显示：

Caused by: org.apache.flink.streaming.runtime.tasks.ExceptionInChainedOperatorException: Could not forward element to next operator
Caused by: java.io.IOException: java.util.concurrent.ExecutionException: com.alibaba.hologres.client.exception.HoloClientException: getOrSubmitPartition fail. tableName="db_name"."table_name", partValue=202206291100
Caused by: com.alibaba.hologres.client.exception.HoloClientException: [UNKNOW:XX000]ERROR: could not open relation with OID 454128

发现维表holo表 “db_name”.“table_name” 设置了缓存，为如下参数：

  'cache' = 'LRU',
  'cacheSize' = '100000',
  'cacheTTLMs' = '120000',

而该表一个分区有100w条数据。cacheSize的单位为行数。理解cacheSize应该大于表一个分区的行数，才能有效缓存。因为数据量大，删除此设置，后续在观察还有无此报错。

9、特殊时间需求处理
背景：字段为最早营业时间、最晚营业时间、跑数时间是否在营业时间内（[最早营业时间,最晚营业时间]），粒度小时。最早营业时间为6点，最晚营业时间为5点。营业时间粒度是秒，为 0~23。最早营业时间向上取值，如 5：30 取值 5，00：30取值 0，00：00取值0。最晚营业时间向下取值，如 5：30 取值6，00：30取值1，00：00取值0。

		,hour(rent_time) as rent_time_start_hour
		,case hour(rent_time) 
			when 5 then 5 
			when 23 then 0 
			else hour(rent_time)+1 
			end 
		 as rent_time_end_hour
--最早最晚的排序sql 就不列出了

跑数频率为每10分钟，粒度为分钟，即5：10不在营业时间[2,5]里，5：10 该处理成6，再进行比较。

if [[ -z $1 ]]; then
datetime=`date -d "-10 minutes" +"%Y-%m-%d %H:%M"`
else
datetime=$1
fi

echo "datetime:${datetime}"
statdate=`date -d "${datetime}" +"%Y%m%d%H%M"`
echo "statdate:${statdate}"
statdate_day_1d_ago=`date -d "${datetime} 1 days ago" +"%Y%m%d"`
echo "statdate_day_1d_ago:${statdate_day_1d_ago}"
datetime_day=`date -d "${datetime}" +"%Y-%m-%d"`
echo "datetime_day:${datetime_day}"
statdate_hour=`date -d "${datetime}" +"%H"`
echo "statdate_hour:${statdate_hour}"
#下面参数是用于判断跑输时间是否在营业时间段内
statdate_minute=`date -d "${datetime}" +"%M"`
echo "statdate_minute:${statdate_minute}"
is_over_hour=0
statdate_hour_new=${statdate_hour}
if [ ${statdate_minute} != "00" ];then 
	is_over_hour=1
	statdate_hour_new=`date -d "${datetime} 1 hours" +"%H"`
fi
if [ ${statdate_hour_new:0:1} -eq 0 ];then 
	statdate_hour_new=${statdate_hour_new:1:1}
fi
echo "statdate_hour:${statdate_hour}"
echo "is_over_hour:${is_over_hour}"
echo "statdate_hour_new:${statdate_hour_new}"

		--case 1.营业时间跨天,跑数时间在跨天后到最晚营业时间段内的，跑数时间和最晚营业时间都加上24来得出是否在营业时间段内
		,(case when t_shop_business_time.shop_30days_order_latest_end_time < t_shop_business_time.shop_30days_order_earliest_start_time and cast('${statdate_hour_new}' as int)<=t_shop_business_time.shop_30days_order_latest_end_time and cast('${statdate_hour_new}' as int)+24 between t_shop_business_time.shop_30days_order_earliest_start_time and t_shop_business_time.shop_30days_order_latest_end_time+24 then 1
		--case 2.营业时间跨天,跑数时间在最晚营业时间段后，最晚营业时间加上24，来得出是否在营业时间段内
		 when t_shop_business_time.shop_30days_order_latest_end_time < t_shop_business_time.shop_30days_order_earliest_start_time and cast('${statdate_hour_new}' as int)>t_shop_business_time.shop_30days_order_latest_end_time and cast('${statdate_hour_new}' as int) between t_shop_business_time.shop_30days_order_earliest_start_time and t_shop_business_time.shop_30days_order_latest_end_time+24 then 1
		--case 3.营业时间没跨天，正常判断[最早营业时间，最晚营业时间]即可
		 when t_shop_business_time.shop_30days_order_latest_end_time >= t_shop_business_time.shop_30days_order_earliest_start_time and cast('${statdate_hour_new}' as int) between t_shop_business_time.shop_30days_order_earliest_start_time and t_shop_business_time.shop_30days_order_latest_end_time then 1
		 else 0 end) as is_statdate_in_business_time

1、停止任务。保存Savepoint --Savepoint 即手工的checkpoint作用，且回清除历史checkpoint
2、全新启动。一边查看holo数据，一边查看kafka消息情况 --即按当前版本运行
以后如果key无变动、版本无变动、无兼容性问题的话，就选暂停，保存Savepoint,最全新启动，这样就不会清除历史checkpoint。
如果出现兼容性问题的情况或者启动后异常的，就停止作业来废弃旧的状态重启。