Tensorflow On Spark Troubleshooting

2 篇文章 0 订阅
2 篇文章 0 订阅

Troubleshooting

已经在python中安装了tensorflow,并且zip包也已经提交到hdfs,运行后报错:
no module named tensorflow

可能是未正常使用zip包中的python,而使用了系统中其他版本的python
在启动命令中加入以下选项,明确指示使用哪个python

--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./Python/bin/python \
--conf spark.pyspark.python=./Python/bin/python  \
--conf spark.pyspark.driver.python=./Python/bin/python \

java.lang.ClassNotFoundException: org.tensorflow.hadoop.io.TFRecordFileOutputFormat

solution:

git clone https://github.com/tensorflow/ecosystem.git
cd ecosystem
cp -r ecosystem/hadoop /tmp
cd /tmp/hadoop
mvn clean package
cd target/
hadoop fs -put tensorflow-hadoop-1.10.0.jar

then add follow config to your submit command:

--jars hdfs:///user/${USER}/tensorflow-hadoop-1.10.0.jar \

2021-04-14 15:55:10,148 ERROR (Thread-3-41283) Exception in TF background thread

AttributeError: Can't pickle local object 'start.<locals>.<lambda>'

暂时未找到原因,怀疑是mac的中multiprocess库的bug,类似这个issue and this discussion


Failed to download resource { { hdfs://centos7:9000/user/zhangchen/python38.zip, 1618426581429, ARCHIVE, null },pending,[(container_1618449935987_0002_01_000001)],409455709874,DOWNLOADING} ENOENT: No such file or directory

solution:
add

export LIB_HDFS=$HADOOP_HOME/lib/native
export LIB_JVM=$JAVA_HOME/lib/server

to your ~/.bashrc
and source ~/.bashrc

then add --conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_JVM:$LIB_HDFS \ to your submit command.

if above steps don’t work, use hadoop fs -ls /user/zhangchen/python38.zip to check the filesize and use hadoop fs -get /user/zhangchen/python38.zip to downloaded it to your local and check it’s content.

What I found is that it’s an empty directory tree, but doesn’t have any file.
I just re-compress the zip and upload it to hdfs to get this problem solved!!!


ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM

when I check the node manager log by the timestamp, I got the following error in that time

2021-04-15 11:30:30,878 ERROR org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService: Most of the disks failed. 1/1 local-dirs usable space is below configured utilization percentage/no more usable space [ /usr/local/share/data/hadoop/nm-local-dir : used space above threshold of 90.0% ] ; 1/1 log-dirs usable space is below configured utilization percentage/no more usable space [ /opt/hadoop-3.2.2/logs/userlogs : used space above threshold of 90.0% ] 

reason: df -h show that disk usage of virtual machine is above 90%.
solution: migrate hadoop files to host shared dir
on host

mkdir ~/virtual_machines/vmdata

share the vmdata to VM

on VM

$HADOOP_HOME/sbin/stop-all.sh
mv /usr/local/share/data/hadoop /mnt/hgfs/vmdata/
sed -i 's#/usr/local/share/data#/mnt/hgfs/vmdata#g' $HADOOP_HOME/etc/hadoop/core-site.xml
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh

use df -h to check again, ensure disk usage is lowed down.


exit code 137

reason: manual intervention or oom
if not killed by some guy, check the memory config and resubmit the job.


Exception: Timeout while feeding partition
When this error occurred, I checked the container log and also find following error:

2021-04-15 16:22:33.646363: E tensorflow/core/platform/hadoop/hadoop_file_system.cc:115] HadoopFileSystem load error: libjvm.so: cannot open shared object file: No such file or directory

so I just add $LIB_JVM to --conf spark.executorEnv.LD_LIBRARY_PATH, and everything gets ok.

--conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_JVM:$LIB_HDFS \

another time,when I go through the log,I found the problem is because that HDFS has entered standby mode.


tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 0

reason: I misused the TfRecord api to read csv files, when I using tfr files, everything fine.


import pyspark report error:
cloudpickle.py line 127 in _make_cell_set_template_code
TypeError: an integer is required (got type bytes)

reason: maybe the version of pyspark shipped with spark 2.3.3 is not compatible with my python version 3.8.8 as described here

solution: I’m not willing to downgrade my python version (because life is short), and since the hadoop version in my company is so old(2.6.0), so I decide to use Hadoop free build of Spark and then run the following command

echo 'export SPARK_DIST_CLASSPATH=$(hadoop classpath)' >> $SPARK_HOME/conf/spark-env.sh

SparkException: When running with master ‘yarn’ either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.

echo 'export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop' >> $SPARK_HOME/conf/spark-env.sh

Exception: No executor_id file found on this node

I add --conf spark.cores.max=${TOTAL_CORES} to your submit command, as said by the author of TFoS, but the problem still here.
I checked the log again, and find an error before said:

Lost executor 1 on node3: Container killed by YARN for execeeding physical memory limits.

so I boosting the --conf spark.executor.memoryOverhead to 2G and things get done.


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值