杂记_202203_week4

最新推荐文章于 2022-05-21 23:57:57 发布

Anla Likes Sunshine

最新推荐文章于 2022-05-21 23:57:57 发布

阅读量1.3k

点赞数

分类专栏：大数据文章标签：笔记学习

本文链接：https://blog.csdn.net/AnlaGodness/article/details/123739966

版权

大数据专栏收录该内容

48 篇文章 1 订阅

订阅专栏

1、sum(字段) 字段若存在至少一个非 NULL值的话，sum 的结果值也为非 NULL，若全为NULL，则为NULL。

--hive
select sum(j) --NULL	
from (select cast(null as int) as j union all select cast(null as int) as j) t
--impala
select sum(j) from (select null as j union all select null as j) t ; --NULL	
--hive / impala
select sum(j) 
from (
	select 1 as j 
	union all 
	select null as j
) t; --1

2、flink任务分三种：
1、pipeline etl，如读取日志
2、分析，有两种，一种算力在flink–比较少,一种算力在Holo,如计算出通用性指标到holo表，分析师再进行个性化统计。
3、事件控制，如 airflow 日志元数据刷新异常监控

3、flink 状态
使用场景如：去重、窗口计算、访问历史数据，即要用到历史的场景

4、 Python正则表达式之re.match()
5、spark任务调优：任务报错：ERROR : Job failed with org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 14.0 failed 4 times, most recent failure: Lost task 0.3 in stage 14.0 (TID 209, emr-worker-xx.cluster-xxxxxx, executor 15): ExecutorLostFailure (executor 15 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 15.3 GB of 13.2 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714. [2022-04-01 01:49:27,624] {bash_operator.py:128} INFO - java.util.concurrent.ExecutionException: Exception thrown by job [2022-04-01 01:49:27,624] {bash_operator.py:128} INFO - at org.apache.spark.JavaFutureActionWrapper.getImpl(FutureAction.scala:337)

得知 executor memory 不够大，于是从12g 调为了 16g。

一个 core/executor 对应一个 task，总共多少个core 即最大并行度（同一时间运行多少个Task），可通过看任务Log,一行的 +数字和的范围，如下图一，初始化的时候并行度（一行的+相加）为15，图二为并行度（一行的+相加）最大的行，为25。
在这里插入图片描述

在这里插入图片描述
图3 sparkhistory 也能看到executor的使用情况，可看出不少空闲executor，executor数配置应该调小点。
因此，优化内容为：

step 1、初始化executor数：initialExecutors 由20调为了8。
step 2、core 数：cores 数为2即够用了，2*8（initialExecutors） = 16（大于初始化的时候并行度 15了)
step 3、maxExecutors数：maxExecutors 调为了 16，16*2（core 数）=32 （大于最大并行度 25了）
ps. minExecutors 数跟着合理配置下就好了。

6、想要看kafka 某个topic 的日志，要自行消费查看，因为运维不开放kibana给数仓同学使用，哭哭，但发现其实消费操作并不难嘛：

step 1：找到 kafka bin路径

find .* -name kafka-console-consumer.sh

step2 : 消费命令
bootstrap-server 通过连接kafka的任务脚本可以找到
zookeeper通过 find .* -name server.properties 找到脚本后，查询 zookeeper.connect ，取一个机器：端口号即可

路径/kafka-console-consumer.sh \ --zookeeper emr-worker-x.cluster-xxxxxx:2181 --bootstrap-server xxx.xx.xxx.xxx:9092,xxx.xx.xxx.xxx:9092,xxx.xx.xxx.xxx:9092 --topic xxx_topic --consumer-property group.id=xxx_group