大数据遇到的问题总结（旧）

最新推荐文章于 2023-11-04 13:36:29 发布

xylzc

最新推荐文章于 2023-11-04 13:36:29 发布

阅读量6.6k

点赞数 1

本文链接：https://blog.csdn.net/lin86182824/article/details/114433803

版权

1、用户用本地代码操作phoenix，连接zk默认使用的内网。这样导致网络不通，现在需求就是想把zk地址换成公网
分析：zookeeper本来监听在了0.0.0.0。外网也能访问zk。这个是要HBase的regionserver注册到zk，要使用公网ip，是16020这个端口所在的进程也就是regionserver，他往zk注册的时候要用公网ip注册，或者注册host，然后本地绑公网ip。在hbase-site修改hbase.server.hostname.useip成false，然后重启一下HBase，然后本地绑定一下 emr-header-1 emr-worker-x等host绑定到对应的公网ip就可以了

2、用Phoenix 映射HBASE的表的时候，用了一个HBASE的计数器。在HBASE的数据类型是Long,在Phoenix这边的类型是bigint。然后在Phoenix查询的时候变成了这样子，这个值在HBASE那边是10.请问一下这个是哪个步骤出了问题不？
答：Phoenix里面用这个类型 UNSIGNED_LONG 试试

3、用户一晚上挂了21台DataNode。用户查看日志发现发现了oom。用户重启了所有的DataNode，NameNode自己做了两次切换之后恢复正常，时候追溯原因是DataNode无法创建新线程，发现是用户自己风控策略spark任务的task开了多线程把机器线程打满导致。

4、用户hive加了kudu的相关配置后重启hive，发现hive的metastore启动不了，配置好后能启动但是报class找不到，最后排查可能是因为hms-plugin.jar这个jar包权限问题，并且hive_aux_jars_path这个参数这里直接配置直接指定到jar包的名字。metastore是hadoop启动的，这个jar包是root的话就会有问题。

5、kafka-consumer-groups.sh --bootstrap-server host-kf001.ymt.io:9092 --group SA-Orderv3-fix20201014 --reset-offsets --topic pay_flume_orderv3-core --to-datetime 2020-10-13T03:50:00.000 --execute
offset重置失败，用户反馈命令是成功了的，但是没有达到效果，指定offset是可以的。社区好像也有这个问题但是没有找到解法

6、HIVE开启ACID支持并修改相关配置参数后，执行HQL异常。发现因为开了 txn，所以 metastore 会启动一个后台进程去合并小文件。但是 txn 目前是有问题的，所以就不断打异常日志。删掉 metastore.log 然后再启动metastore，还是一直打日志。从代码角度看这是属于bug。在 hive-site 里加一下这四个配置（javax.jdo.option.ConnectionURL、javax.jdo.option.ConnectionUserName、javax.jdo.option.ConnectionPassword、javax.jdo.option.ConnectionDriverName），这些配置原本在 hivemetastore-site 里。

7、https://issues.apache.org/jira/browse/HIVE-22373 应该是hive的一个bug。set tez.am.container.reuse.enabled=false; 加一下这个参数试一下，用户查询可以改用mr引擎来执行set hive.execution.engine=mr;

8、用户在hue上执行语句报错，Container exited with a non-zero exit code 143 Killed by external signal，在hive上执行也是一样报错。到yarn ui上看发现什么报错也就这些，没有map 也没有reduce相关信息。看日志应该是被kill掉了。排查了好久，怀疑是hive环境问题，或者是其他的？最后想让用户先select * 然后limit一下，或者用hive里边用spark去运行试试，最后一想，让用户用tez运行试试，应该是数据量太大因为用户有join操作或者是mr的问题，set hive.execution.engine=tez;再运行时报错is running 3743744B beyond the ‘PHYSICAL’ memory limit. Current usage: 643.6 MB of 640 MB physical memory used; 4.4 GB of 3.1 TB virtual memory used. Killing container.
发现tez配置的参数是tez.am.resource.memory.mb=640。修改一下参数发现任务还是被kill。最后排查发现有很多可疑进程，在安全日志看到有可疑ip尝试登陆并成功了。最后限制ip登陆，杀了可疑进程再执行就好了

9、Sqoop（其中的mysql jdbc驱动包为mysql-connector-java-8.0.21.jar），从RDS（UTC时间）导入数据至Hive（东八区)时，如何避免自动转换时区。从RDS抽取数据至Hive时时间发生变化了，我这里不想转，希望保持UTC时间不变。这里用户用错了jar包，自带的是mysql-connector-java-5.1.38.jar。用这个报错ERROR manager.SqlManager: Error executing statement: java.sql.SQLException: Unknown system variable 'query_cache_size’。换成mysql-connector-java-5.1.49.jar后解决。

10、kibana启动一直报Generating a random key for xpack.encrypted_saved_objects.encryptionKey. To be able to decrypt encrypted saved objects attributes after restart, please set xpack.encrypted_saved_objects.encryptionKey in kibana.yml。
修改 es配置文件config/elasticsearch.yml,修改项如下
cluster.name: my-application
node.name: node-1
network.host: 0.0.0.0
cluster.initial_master_nodes: [“node-1”]
xpack.license.self_generated.type: basic
1.如果不打开node.name的注释，在启动kibana的时候会出现错误Elasticsearch cluster did not respond with license information.
2.cluster.initial_master_nodes默认为cluster.initial_master_nodes: [“node-1”,“node-2”],如果不删除node-2，会产生bootstrap checks failed错误
3.xpack.license.self_generated.type: basic因为elasticsearch 7.2默认集成了xpack,而默认的license就只能用30天，所以更改为只使用最基本的功能

11、namenode异常停止，尝试启动发现起不来。原因是用户自己删除了hdfs数据目录下的数据块。解决方式是用户有个journalnode是好的，将这个journalnode的数据同步到其他两个journalnode，这两个journalnode的数据需要先备份下，然后重启journalnode后尝试启动NameNode发现正常了。后面又有任务跑失败让用户先fsck检查下block发现丢失了很多副本块。让用户用mr执行任务是正常的，可能是因为tez的包都是在hdfs上所以失败了。让用户根据以下命令去修复副本块。
状态查询：hdfs dfsadmin -report
数据块健康查询： hdfs fsck /
数据块修复操作：hdfs debug recoverLease -path 文件位置 -retries 重试次数
修复Under-replicated blocks：
先查询到Under-replicated blocks的路径
hdfs fsck / | grep ‘Under replicated’ | awk -F’:’ ‘{print $1}’ >> /tmp/under_replicated_files
再根据路径进行修复
for hdfsfile in cat /tmp/under_replicated_files; do echo “Fixing $hdfsfile :” ; hadoop fs -setrep 2 $hdfsfile; done
修复半天发现还是没有修复，而且越来越多丢失。然后fsck看了一下原来是有个DataNode已经没连上集群了，然后看了下日志确认了是哪个DataNode后在emr界面重启了下DataNode。之后自动在修复块了。但是有些没有自动恢复的应该需要命令修复

12、presto连接数超限，报错 Caused by: java.util.concurrent.RejectedExecutionException: Max requests per destination 1024 exceeded for HttpDestination[http://172.17.2.63:9090]@39f9a9e4,queue=1024,pool=DuplexConnectionPool[c=1000/1000,a=1000,i=0]
参照https://zhuanlan.zhihu.com/p/57956341
for slave/worker:
sudo su -c ‘echo “exchange.http-client.max-requests-queued-per-destination=5000
exchange.http-client.max-connections-per-server=5000” >> /etc/presto/conf/config.properties’
sudo restart presto-server
for master/node:
sudo su -c ‘echo “scheduler.http-client.max-requests-queued-per-destination=5000
scheduler.http-client.max-connections-per-server=5000
exchange.http-client.max-requests-queued-per-destination=5000
exchange.http-client.max-connections-per-server=5000” >> /etc/presto/conf/config.properties’
sudo restart presto-server

13、spark程序报错
Application application_1614944491376_0003 failed 1 times (global limit =2; local limit is =1) due to ApplicationMaster for attempt appattempt_1614944491376_0003_000001 timed out. Failing the application.
[?] WHY FAILED:
CHECK STATUS: has 2 tasks failed
排查发现某一个时间段cpu100%。看了日志，两个作业都是一段时间之后 spark executor 连接不上 Driver。而且 Driver 所在的机器系统日志都中断了。可能是因为系统利用率太高（比如CPU）导致集群ECS挂掉，然后作业就失败了。

14、HDFS疑问，standby节点的editlog还是16xxx，但是fsimage已经17xxx了。而editlog16xxx还在，难道是没有删掉？我的理解是所有操作都是由active接收到然后把edit给到了journalnode吧，然后standby去同步journalnode的edit吧。fsimage不是从active那边直接拉过来的吗？
原因是这样的。standyby生成fsimage 是基于本地的image和jounal上的关闭editlog，然后把生成后的新的image上传给active，然后删除本地旧的image，standy节点不保存editlog的
。而启动的时候NameNode也是加载的journal里面的editlog。这个问题是切换 active namenode导致的，下次切换应该就会删除掉。

15、【问题描述】 hive sql报错Error: Error while compiling statement: FAILED: SemanticException UDF reflect is not allowed。
set hive.server2.builtin.udf.blacklist=empty_blacklist; set hive.server2.builtin.udf.blacklist; select t1.device_id, t1.appsflyer_id, t1.dt as server_date, t1.event_time, t1.platform, t1.device_model, t1.device_manufacturer, t1.app_version, t2.media_source, t2.install_time from ( – 每天按事件时间全量计算设备增加日期 select device_id, appsflyer_id, event_time, platform, device_model, device_manufacturer, app_version, dt from ( select device_id, appsflyer_id, event_time, platform, device_model, device_manufacturer, app_version, dt, row_number() over(partition by device_id order by event_time) rn from opay_dw.dwd_owallet_client_first_visit_base_di where dt <= ‘2021-04-12’ and device_id != ‘’ and platform != ‘h5’ ) tx where rn = 1 ) t1 left join ( – 设备关联渠道信息 SELECT appsflyer_id, media_source, app_name, install_time from ( SELECT appsflyer_id, media_source, app_name, install_time, row_number() over(PARTITION BY appsflyer_id ORDER BY install_time) rn from opay_dw.dwd_opay_tracker_appsflyer_di where dt<= if(‘2021-04-12’ <= ‘2020-11-30’, ‘2020-11-30’, ‘2021-04-12’) and event_name = ‘install’ and app_id=‘team.opay.pay’ ) t0 where rn = 1 ) t2 on nvl(t1.appsflyer_id, regexp_replace(reflect(“java.util.UUID”, “randomUUID”), “-”, “”)) = t2.appsflyer_id
在使用HiveServer2使用hive内置的UDF函数reflect 的时候，提示“semanticexception udf reflect is not allowed”,“reflect ”不允许被使用，默认情况下，HiveServer2为了安全，禁用了部分udf函数。
可以通过修改hive-site.xml，重启HiveServer2，即可生效(无法使用set进行修改)。

 <property>
    <name>hive.server2.builtin.udf.blacklist</name>
    <value>empty_blacklist</value>
  </property>
 
  <property>
    <name>hive.server2.builtin.udf.whitelist</name>
    <value></value>
  </property>

16、Tez一直卡主
在这里插入图片描述
原因是：集群的 mapreduce.map.cpu.vcores 改成了 4，hive 默认会使用这个参数作为 tez container 的 vcores。tez vcores 大于 1 时会导致无法正常启动 container，可以设置参数 hive.tez.cpu.vcores=1 来解决

17、
standby看到的和active看到的dead nodes不一样。重启下那个DataNode解决。原因是触发了开源的bug https://issues.apache.org/jira/browse/HDFS-14219 重启worker-56修复了。这个bug极少触发，这个DN进程运行了大半年了。主要是当内存不够的时候才会触发。
这台机器DN内存才2G，而每台DN block数量已经达到900w，已经很紧张了。建议将所有DN内存调大8G或者更多。
在这里插入图片描述

18、报错如下
2021-06-25 21:12:29,906 Stage-1 map = 11%, reduce = 0%, Cumulative CPU 211.65 sec
2021-06-25 21:12:31,967 Stage-1 map = 12%, reduce = 0%, Cumulative CPU 240.86 sec
2021-06-25 21:12:34,038 Stage-1 map = 13%, reduce = 0%, Cumulative CPU 257.37 sec
2021-06-25 21:12:35,070 Stage-1 map = 14%, reduce = 0%, Cumulative CPU 269.01 sec
2021-06-25 21:12:39,204 Stage-1 map = 15%, reduce = 0%, Cumulative CPU 296.87 sec
2021-06-25 21:12:41,266 Stage-1 map = 17%, reduce = 0%, Cumulative CPU 364.57 sec
2021-06-25 21:12:43,328 Stage-1 map = 19%, reduce = 0%, Cumulative CPU 396.36 sec
2021-06-25 21:12:49,520 Stage-1 map = 20%, reduce = 0%, Cumulative CPU 423.19 sec
2021-06-25 21:12:53,643 Stage-1 map = 22%, reduce = 0%, Cumulative CPU 432.38 sec
2021-06-25 21:13:02,921 Stage-1 map = 23%, reduce = 0%, Cumulative CPU 455.76 sec
2021-06-25 21:13:12,186 Stage-1 map = 24%, reduce = 0%, Cumulative CPU 466.36 sec
2021-06-25 21:13:15,277 Stage-1 map = 25%, reduce = 0%, Cumulative CPU 524.85 sec
2021-06-25 21:13:26,614 Stage-1 map = 26%, reduce = 0%, Cumulative CPU 582.57 sec
2021-06-25 21:13:28,671 Stage-1 map = 27%, reduce = 0%, Cumulative CPU 644.05 sec
2021-06-25 21:13:31,756 Stage-1 map = 28%, reduce = 0%, Cumulative CPU 719.1 sec
2021-06-25 21:13:37,934 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 660.27 sec
MapReduce Total cumulative CPU time: 11 minutes 0 seconds 270 msec
Ended Job = job_1623809152303_0587 with errors
Error during job, obtaining debugging information…
Job Tracking URL: http://emr-header-1.cluster-230631:20888/proxy/application_1623809152303_0587/
Examining task ID: task_1623809152303_0587_m_000008 (and more) from job job_1623809152303_0587
Examining task ID: task_1623809152303_0587_m_000012 (and more) from job job_1623809152303_0587
Examining task ID: task_1623809152303_0587_m_000001 (and more) from job job_1623809152303_0587
Examining task ID: task_1623809152303_0587_m_000010 (and more) from job job_1623809152303_0587
Examining task ID: task_1623809152303_0587_m_000033 (and more) from job job_1623809152303_0587
Examining task ID: task_1623809152303_0587_m_000040 (and more) from job job_1623809152303_0587

Task with the most failures(4):

Task ID:
task_1623809152303_0587_m_000010

URL:
http://0.0.0.0:8088/taskdetails.jsp?jobid=job_1623809152303_0587&tipid=task_1623809152303_0587_m_000010

Diagnostic Messages for this Task:
Container [pid=23152,containerID=container_e03_1623809152303_0587_01_000055] is running beyond physical memory limits. Current usage: 8.7 GB of 8.2 GB physical memory used; 16.5 GB of 75.0 TB virtual memory used. Killing container.
Dump of the process-tree for container_e03_1623809152303_0587_01_000055 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 23152 23150 23152 23152 (bash) 0 0 116043776 680 /bin/bash -c /usr/lib/jvm/java-1.8.0/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx8392m -XX:ParallelGCThreads=2 -XX:CICompilerCount=2 -Djava.io.tmpdir=/mnt/disk4/yarn/usercache/root/appcache/application_1623809152303_0587/container_e03_1623809152303_0587_01_000055/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/mnt/disk2/log/hadoop-yarn/containers/application_1623809152303_0587/container_e03_1623809152303_0587_01_000055 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 192.168.20.19 37771 attempt_1623809152303_0587_m_000010_3 3298534883383 1>/mnt/disk2/log/hadoop-yarn/containers/application_1623809152303_0587/container_e03_1623809152303_0587_01_000055/stdout 2>/mnt/disk2/log/hadoop-yarn/containers/application_1623809152303_0587/container_e03_1623809152303_0587_01_000055/stderr
|- 23169 23152 23152 23152 (java) 1738 979 17575825408 2270358 /usr/lib/jvm/java-1.8.0/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx8392m -XX:ParallelGCThreads=2 -XX:CICompilerCount=2 -Djava.io.tmpdir=/mnt/disk4/yarn/usercache/root/appcache/application_1623809152303_0587/container_e03_1623809152303_0587_01_000055/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/mnt/disk2/log/hadoop-yarn/containers/application_1623809152303_0587/container_e03_1623809152303_0587_01_000055 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 192.168.20.19 37771 attempt_1623809152303_0587_m_000010_3 3298534883383

Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: Map: 130 Reduce: 1 Cumulative CPU: 660.27 sec HDFS Read: 10551786453 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 11 minutes 0 seconds 270 msec
hive>

这个报错是因为 MapReduce 任务的 mapreduce.map.memory.mb 和 mapreduce.map.java.opts 配置的内存值太接近（都是8.2G），建议前者是后者的1.2倍

19、报错如下
21/06/25 15:49:46 WARN [netty-rpc-env-timeout] YarnSchedulerBackend $Y a r n S c h e d u l e r E n d p o i n t : A t t e m p t e d t o g e t e x e c u t o r l o s s r e a s o n f o r e x e c u t o r i d 28 a t R P C a d d r e s s 192.168.20.3 : 57410, b u t g o t n o r e s p o n s e . M a r k i n g a s s l a v e l o s t . o r g . a p a c h e . s p a r k . r p c . R p c T i m e o u t E x c e p t i o n : C a n n o t r e c e i v e a n y r e p l y f r o m n u l l i n 120 s e c o n d s . T h i s t i m e o u t i s c o n t r o l l e d b y s p a r k . r p c . a s k T i m e o u t a t o r g . a p a c h e . s p a r k . r p c . R p c T i m e o u t . o r g$ apache $s p a r k$ rpc $R p c T i m e o u t$ $c r e a t e R p c T i m e o u t E x c e p t i o n (R p c T i m e o u t . s c a l a : 47) a t o r g . a p a c h e . s p a r k . r p c . R p c T i m e o u t$ $a n o n f u n$ addMessageIfTimeout $1 . a p p l y O r E l s e (R p c T i m e o u t . s c a l a : 62) a t o r g . a p a c h e . s p a r k . r p c . R p c T i m e o u t$ $a n o n f u n$ addMessageIfTimeout $1 . a p p l y O r E l s e (R p c T i m e o u t . s c a l a : 58) a t s c a l a . r u n t i m e . A b s t r a c t P a r t i a l F u n c t i o n . a p p l y (A b s t r a c t P a r t i a l F u n c t i o n . s c a l a : 36) a t s c a l a . u t i l . F a i l u r e$ $a n o n f u n$ recover $1 . a p p l y (T r y . s c a l a : 216) a t s c a l a . u t i l . T r y$ .apply(Try.scala:192)
at scala.util.Failure.recover(Try.scala:216)
at scala.concurrent.Future $KaTeX parse error: Can't use function '$' in math mode at position 8: anonfun$̲recover$1.apply\dots$ anonfun$recover $1.apply(Future.scala:326) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) at org.spark_project.guava.util.concurrent.MoreExecutors$ SameThreadExecutorService.execute(MoreExecutors.java:293)
at scala.concurrent.impl.ExecutionContextImpl $KaTeX parse error: Can't use function '$' in math mode at position 5: anon$̲1.execute(Execu\dots$ anonfun$map $1 . a p p l y (F u t u r e . s c a l a : 237) a t s c a l a . c o n c u r r e n t . F u t u r e$ $a n o n f u n$ map $1 . a p p l y (F u t u r e . s c a l a : 237) a t s c a l a . c o n c u r r e n t . i m p l . C a l l b a c k R u n n a b l e . r u n (P r o m i s e . s c a l a : 36) a t s c a l a . c o n c u r r e n t . B a t c h i n g E x e c u t o r$ Batch $KaTeX parse error: Can't use function '$' in math mode at position 8: anonfun$̲run$1.processBa\dots$ anonfun$run $1 . a p p l y$ mcV $s p (B a t c h i n g E x e c u t o r . s c a l a : 78) a t s c a l a . c o n c u r r e n t . B a t c h i n g E x e c u t o r$ Batch $KaTeX parse error: Can't use function '$' in math mode at position 8: anonfun$̲run$1.apply(Bat\dots$ anonfun$run $1 . a p p l y (B a t c h i n g E x e c u t o r . s c a l a : 55) a t s c a l a . c o n c u r r e n t . B l o c k C o n t e x t$ .withBlockContext(BlockContext.scala:72)
at scala.concurrent.BatchingExecutor $B a t c h . r u n (B a t c h i n g E x e c u t o r . s c a l a : 54) a t s c a l a . c o n c u r r e n t . F u t u r e$ InternalCallbackExecutor $. u n b a t c h e d E x e c u t e (F u t u r e . s c a l a : 601) a t s c a l a . c o n c u r r e n t . B a t c h i n g E x e c u t o r$ class.execute(BatchingExecutor.scala:106)
at scala.concurrent.Future $I n t e r n a l C a l l b a c k E x e c u t o r$ .execute(Future.scala:599)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
at scala.concurrent.impl.Promise $D e f a u l t P r o m i s e . t r y C o m p l e t e (P r o m i s e . s c a l a : 252) a t s c a l a . c o n c u r r e n t . P r o m i s e$ class.tryFailure(Promise.scala:112)
at scala.concurrent.impl.Promise $D e f a u l t P r o m i s e . t r y F a i l u r e (P r o m i s e . s c a l a : 157) a t o r g . a p a c h e . s p a r k . r p c . n e t t y . N e t t y R p c E n v . o r g$ apache $s p a r k$ rpc $n e t t y$ NettyRpcEnv $KaTeX parse error: Can't use function '$' in math mode at position 10: onFailure$̲1(NettyRpcEnv.s\dots$ anon $1 . r u n (N e t t y R p c E n v . s c a l a : 243) a t j a v a . u t i l . c o n c u r r e n t . E x e c u t o r s$ RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access $201 (S c h e d u l e d T h r e a d P o o l E x e c u t o r . j a v a : 180) a t j a v a . u t i l . c o n c u r r e n t . S c h e d u l e d T h r e a d P o o l E x e c u t o r$ ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply from null in 120 seconds
… 8 more
21/06/25 15:49:46 ERROR [dispatcher-event-loop-5] YarnScheduler: Lost executor 28 on emr-worker-113.cluster-230631: Slave lost
21/06/25 15:49:46 WARN [dispatcher-event-loop-5] TaskSetManager: Lost task 400.0 in stage 3.0 (TID 576, emr-worker-113.cluster-230631, executor 28): ExecutorLostFailure (executor 28 exited caused by one of the running tasks) Reason: Slave lost
21/06/25 15:49:46 WARN [dispatcher-event-loop-5] TaskSetManager: Lost task 403.0 in stage 3.0 (TID 579, emr-worker-113.cluster-230631, executor 28): ExecutorLostFailure (executor 28 exited caused by one of the running tasks) Reason: Slave lost
21/06/25 15:49:46 WARN [dispatcher-event-loop-5] TaskSetManager: Lost task 404.0 in stage 3.0 (TID 580, emr-worker-113.cluster-230631, executor 28): ExecutorLostFailure (executor 28 exited caused by one of the running tasks) Reason: Slave lost
21/06/25 15:49:46 WARN [dispatcher-event-loop-5] TaskSetManager: Lost task 407.0 in stage 3.0 (TID 583, emr-worker-113.cluster-230631, executor 28): ExecutorLostFailure (executor 28 exited caused by one of the running tasks) Reason: Slave lost
21/06/25 15:49:46 INFO [spark-listener-group-executorManagement] ExecutorAllocationManager: Existing executor 28 has been removed (new total is 0)

实际上就是内存不够了。加大内存解决。

20、hdfs Rebalance还是没均衡。

su -l hdfs -c "/usr/lib/hadoop-current/sbin/start-balancer.sh -threshold 10"

在这里插入图片描述

在这里插入图片描述

从日志看已经结束了。
然后看ui确实已经是10%以内了。

这里可以看到threshold是跟average比较的。

根据文档。if overall usage across all the DataNodes in the cluster is 40% of the cluster’s total disk-storage capacity, the script ensures that DataNode disk usage is between 30% and 50% of the DataNode disk-storage capacity.
对照我们上面的情况分析。最大值最小值差20%。最小值跟平均值差10%。最大值跟平均值差20%，最大值跟最小值差20%。
原来我觉得图中的median是平均值但是实际上是集群DataNode之中使用率是中间的这个数值（使用率中位数）而不是平均值。。
详情文档参照：https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/admin_hdfs_balancer.html

所以只能调整下Rebalance命令。

sudo su -l hdfs -c "/usr/lib/hadoop-current/sbin/start-balancer.sh -threshold 5"

21、hive创建表时快时慢
一般这种问题就是gc问题。
jstat -gc pid 200
在这里插入图片描述
建议调大metastore的内存。
参数 hive_metastore_heapsize
另外连接数也蛮多的话建议使用hiveserver2这种，会复用之前的metaStoreClient实例。

22、hue上select * 和select count不一致

【问题】hue中查询数据， select count(1) from …可以查到数据总量，但是select * from …就查不出来数据。
首先确定表里确实是没有数据的。跑的时候还加了参数。
当hive.compute.query.using.stats=true时，select count(*) from直接从元数据保存的统计信息中获取表中记录条数。
当时元数据里的信息还是你当时有数据的时候的可能，然后直接从里面获取的。select * 的时候走了mr计算所以两个结果就不一样。
有的时候hive.cbo.enable这种参数也需要关掉。

23、yarn ui 指标显示异常。
按照配置参数yarn.nodemanager.resource.cpu-vcores是8。两台机器的话total vcores应该是16。
但是这里一直在变化。
在这里插入图片描述
![在这里插入图片描述](https://img-blog.csdnimg.cn/499828f125474998851c5a1850d69704.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2xpbjg2MTgyODI0,size_16,color_FFFFFF,t_70
然后发现这是个bug。
https://issues.apache.org/jira/browse/YARN-8443
当一个容器被分配到nodemanager的时候，如果不满足条件那么就会被reserve（冻结），直到达到标准才会分给那个容器。

Cluster metrics on the web UI will give wrong Total Vcores when there is reserved container for CapacityScheduler.
不影响使用，除非是采集metric指标的作业。。

24、DataNode和NameNode启动不来
进程启动不来，这个jstat -gc pid 这种查看下进程的gc情况。有的话调整内存大小即可。DataNode数量较大的时候建议NameNode也需要重启下，避免雪崩问题。
DataNode启动问题，进程是一直在的，也没有gc情况。日志不打印。hdfs ui页面上查看不到这个datanode，hdfs dfsadmin -report也查看不到这个DataNode。就很诡异。
然后排查到发现io占用很大，这个DataNode过会儿自己加进去了。
问题应该是这样的：
HDFS中为了统计每个DN数据节点上的存储使用量，DN节点上会对每块盘路径周期性的执行DU操作，汇报到Namenode节点上，就可以统计到总存储的使用量，每个点的存储使用量。默认10min会执行一次du操作，尽量保证数据使用量的实时性。在存储使用量不大的情况下，执行对每块盘的du操作，对整个系统的IO影响不太明显，但是当节点存储使用率比较高的情况下，du操作引发IO高的问题对整个系统的影响就很大了，很可能会引发阻塞IO，影响整个节点上的服务。
du统计原理在于将目标路径下的当前没有被删除的文件进行大小累加，然后得出总使用量。这种计算方式在文件数量少时往往不会表现出什么问题。但是当目标路径目录多，文件多的时候，du会表现出明显的时间执行耗时，而在这一点上，df命令则用的是另一种更加高效的方式，它的统计值来通过文件系统获取的。但是df命令的一个最不适用的地方在于它不能按照具体目录进行使用量的统计。df是按照所在磁盘级别进行统计的。换句话说，用df命令在属于同一块物理盘的子路径下执行df命令，获取的值会是完全一致的。比较遗憾，这种情况将无法支持DataNode多block pool共用一块盘的情况。
解决这个问题可以适当调大datanode du执行间隔
通过调整fs.du.interval配置来加大datanode du执行间隔，减少一些存储使用量的实时性来缓解Du带来的IO高问题。
或者从Liunx层面调整cache相关配置。
linux主要是缓存inode相关信息，linux文件系统上，用户对文件、目录的访问都是先访问对应的inode信息，所以适当调整文件系统的cache配置可以提高访问效率。

vm.vfs_cache_pressure
该项表示内核回收用于directory和inode cache内存的倾向：
缺省值100表示内核将根据pagecache和swapcache，把directory和inode cache保持在一个合理的百分比
降低该值低于100，将导致内核倾向于保留directory和inode cache
增加该值超过100，将导致内核倾向于回收directory和inode cache。

也可以减少数据存储的现有目录层级[HDFS-8791]。
参考https://blog.csdn.net/breakout_alex/article/details/100879489

25、impala报错file not found打满磁盘。
报错如下

W0721 18:39:12.039170 29442 FileSystemUtil.java:758] ErrorCode : 25002 , ErrorMsg: File not found. File not found: /taste_matters/TAB/order_data/dwd/tab_dwd_order_taste_main_order_di/ds=20210507 in bucket kaifeiyaoW0721 18:39:12.054760 29442 FileSystemUtil.java:758] ErrorCode : 25002 , ErrorMsg: File not found. File not found: /taste_matters/TAB/order_data/dwd/tab_dwd_order_taste_main_order_di/ds=20210507 in bucket kaifeiyao
W0721 18:39:12.070664 29442 FileSystemUtil.java:758] ErrorCode : 25002 , ErrorMsg: File not found. File not found: /taste_matters/TAB/order_data/dwd/tab_dwd_order_taste_main_order_di/ds=20210507 in bucket kaifeiyao
W0721 18:39:12.087787 29442 FileSystemUtil.java:758] ErrorCode : 25002 , ErrorMsg: File not found. File not found: /taste_matters/TAB/order_data/dwd/tab_dwd_order_taste_main_order_di/ds=20210507 in bucket kaifeiyao
W0721 18:39:12.104749 29442 FileSystemUtil.java:758] ErrorCode : 25002 , ErrorMsg: File not found. File not found: /taste_matters/TAB/order_data/dwd/tab_dwd_order_taste_main_order_di/ds=20210507 in bucket kaifeiyao
W0721 18:39:12.120471 29442 FileSystemUtil.java:758] ErrorCode : 25002 , ErrorMsg: File not found. File not found: /taste_matters/TAB/order_data/dwd/tab_dwd_order_taste_main_order_di/ds=20210507 in bucket kaifeiyao
W0721 18:39:12.135872 29442 FileSystemUtil.java:758] ErrorCode : 25002 , ErrorMsg: File not found. File not found: /taste_matters/TAB/order_data/dwd/tab_dwd_order_taste_main_order_di/ds=20210507 in bucket kaifeiyao
W0721 18:39:12.151440 29442 FileSystemUtil.java:758] ErrorCode : 25002 , ErrorMsg: File not found. File not found: /taste_matters/TAB/order_data/dwd/tab_dwd_order_taste_main_order_di/ds=20210507 in bucket kaifeiyao
W0721 18:39:12.167306 29442 FileSystemUtil.java:758] ErrorCode : 25002 , ErrorMsg: File not found. File not found: /taste_matters/TAB/order_data/dwd/tab_dwd_order_taste_main_order_di/ds=20210507 in bucket kaifeiyao
W0721 18:39:12.183763 29442 FileSystemUtil.java:758] ErrorCode : 25002 , ErrorMsg: File not found. File not found: /taste_matters/TAB/order_data/dwd/tab_dwd_order_taste_main_order_di/ds=20210507 in bucket kaifeiyao
W0721 18:39:12.200184 29442 FileSystemUtil.java:758] ErrorCode : 25002 , ErrorMsg: File not found. File not found: /taste_matters/TAB/order_data/dwd/tab_dwd_order_taste_main_order_di/ds=20210507 in bucket kaifeiyao
W0721 18:39:12.216246 29442 FileSystemUtil.java:758] ErrorCode : 25002 , ErrorMsg: File not found. File not found: /taste_matters/TAB/order_data/dwd/tab_dwd_order_taste_main_order_di/ds=20210507 in bucket kaifeiyao
W0721 18:39:12.232846 29442 FileSystemUtil.java:758] ErrorCode : 25002 , ErrorMsg: File not found. File not found: /taste_matters/TAB/order_data/dwd/tab_dwd_order_taste_main_order_di/ds=20210507 in bucket kaifeiyao
W0721 18:39:12.248136 29442 FileSystemUtil.java:758] ErrorCode : 25002 , ErrorMsg: File not found. File not found: /taste_matters/TAB/order_data/dwd/tab_dwd_order_taste_main_order_di/ds=20210507 in bucket kaifeiyao
W0721 18:39:12.265051 29442 FileSystemUtil.java:758] ErrorCode : 25002 , ErrorMsg: File not found. File not found: /taste_matters/TAB/order_data/dwd/tab_dwd_order_taste_main_order_di/ds=20210507 in bucket kaifeiyao
W0721 18:39:12.281122 29442 FileSystemUtil.java:758] ErrorCode : 25002 , ErrorMsg: File not found. File not found: /taste_matters/TAB/order_data/dwd/tab_dwd_order_taste_main_order_di/ds=20210507 in bucket kaifeiyao
W0721 18:39:12.296733 29442 FileSystemUtil.java:758] ErrorCode : 25002 , ErrorMsg: File not found. File not found: /taste_matters/TAB/order_data/dwd/tab_dwd_order_taste_main_order_di/ds=20210507 in bucket kaifeiyao
W0721 18:39:12.312978 29442 FileSystemUtil.java:758] ErrorCode : 25002 , ErrorMsg: File not found. File not found: /taste_matters/TAB/order_data/dwd/tab_dwd_order_taste_main_order_di/ds=20210507 in bucket kaifeiyao
W0721 18:39:12.329353 29442 FileSystemUtil.java:758] ErrorCode : 25002 , ErrorMsg: File not found. File not found: /taste_matters/TAB/order_data/dwd/tab_dwd_order_taste_main_order_di/ds=20210507 in bucket kaifeiyao

应该是清理数据的时候没有正常的从表里面命令删除。直接干掉了数据没有清理掉元数据信息导致一直报错。
看了下是个开源的bug。不当操作导致触发了bug。https://issues.apache.org/jira/browse/IMPALA-10579
解决方案只能是hive删除掉元数据/分区下的元数据再重启下catalogd。

26、flume写入hdfs报错后自动恢复
在这里插入图片描述
如果并发量 x 128M（blocksize）x 2 > 磁盘剩余空间，会导致flume写入失败。
从DataNode的日志里看到从06：50 - 07：10这20分钟内写入了超过一万个block，一个block需要的空间是128M，所以会申请超过1TB的存储空间。从发的dfs report来看 emr-worker-1节点只有 368GB剩余空间，所以会导致空间不足（结果是 There are 4 datanode(s) running and no node(s) are excluded in this operation.）。
小文件写完了就不用占128M，但是在写入过程中会以128M 预留空间，如果同时并发写入（但不关闭文件）就会产生这个问题。如果1万个文件是顺序写入，就没问题。
解决方案：一个是扩容，另外一个是降低业务的并发写入量。

27、HBASE启动异常
在这里插入图片描述
根据报错日志，查看zk日志发现zookeeper连接太多了，调大zk客户端请求连接数（maxClinetCnxns），重启服务解决。

28、sqoop使用过程中出现异常
sqoop1的最新版本为1.4.7，默认带的jar包，可能和其他框架不兼容，比如连接mysql8，比如连接hive3（默认是hive2）
不兼容的具体表现可能为各种形式的报错，甚至包括同步数据成功，但是数据格式异常，日期转换有问题。
这个就要根据自己的需要去调整jar包解决了。