Tensorflow On Spark Troubleshooting

最新推荐文章于 2021-08-12 14:30:21 发布

chansonzhang

最新推荐文章于 2021-08-12 14:30:21 发布

阅读量176

点赞数

分类专栏： AI Platform TensorFlow Spark 文章标签： tensorflow spark 深度学习人工智能大数据

本文链接：https://blog.csdn.net/chansonzhang/article/details/115682173

版权

AI Platform 同时被 3 个专栏收录

4 篇文章 5 订阅

订阅专栏

TensorFlow

2 篇文章 0 订阅

订阅专栏

Spark

2 篇文章 0 订阅

订阅专栏

Troubleshooting

已经在python中安装了tensorflow，并且zip包也已经提交到hdfs，运行后报错：
no module named tensorflow

可能是未正常使用zip包中的python，而使用了系统中其他版本的python
在启动命令中加入以下选项，明确指示使用哪个python

--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./Python/bin/python \
--conf spark.pyspark.python=./Python/bin/python  \
--conf spark.pyspark.driver.python=./Python/bin/python \

java.lang.ClassNotFoundException: org.tensorflow.hadoop.io.TFRecordFileOutputFormat

solution:

git clone https://github.com/tensorflow/ecosystem.git
cd ecosystem
cp -r ecosystem/hadoop /tmp
cd /tmp/hadoop
mvn clean package
cd target/
hadoop fs -put tensorflow-hadoop-1.10.0.jar

then add follow config to your submit command:

--jars hdfs:///user/${USER}/tensorflow-hadoop-1.10.0.jar \

2021-04-14 15:55:10,148 ERROR (Thread-3-41283) Exception in TF background thread

AttributeError: Can't pickle local object 'start.<locals>.<lambda>'

暂时未找到原因，怀疑是mac的中multiprocess库的bug，类似这个issue and this discussion

Failed to download resource { { hdfs://centos7:9000/user/zhangchen/python38.zip, 1618426581429, ARCHIVE, null },pending,[(container_1618449935987_0002_01_000001)],409455709874,DOWNLOADING} ENOENT: No such file or directory

solution:
add

export LIB_HDFS=$HADOOP_HOME/lib/native
export LIB_JVM=$JAVA_HOME/lib/server

to your ~/.bashrc
and source ~/.bashrc

then add --conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_JVM:$LIB_HDFS \ to your submit command.

if above steps don’t work, use hadoop fs -ls /user/zhangchen/python38.zip to check the filesize and use hadoop fs -get /user/zhangchen/python38.zip to downloaded it to your local and check it’s content.

What I found is that it’s an empty directory tree, but doesn’t have any file.
I just re-compress the zip and upload it to hdfs to get this problem solved!!!

ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM

when I check the node manager log by the timestamp, I got the following error in that time

2021-04-15 11:30:30,878 ERROR org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService: Most of the disks failed. 1/1 local-dirs usable space is below configured utilization percentage/no more usable space [ /usr/local/share/data/hadoop/nm-local-dir : used space above threshold of 90.0% ] ; 1/1 log-dirs usable space is below configured utilization percentage/no more usable space [ /opt/hadoop-3.2.2/logs/userlogs : used space above threshold of 90.0% ]

reason: df -h show that disk usage of virtual machine is above 90%.
solution: migrate hadoop files to host shared dir
on host

mkdir ~/virtual_machines/vmdata

share the vmdata to VM

on VM

$HADOOP_HOME/sbin/stop-all.sh
mv /usr/local/share/data/hadoop /mnt/hgfs/vmdata/
sed -i 's#/usr/local/share/data#/mnt/hgfs/vmdata#g' $HADOOP_HOME/etc/hadoop/core-site.xml
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh

use df -h to check again, ensure disk usage is lowed down.

exit code 137

reason: manual intervention or oom
if not killed by some guy, check the memory config and resubmit the job.

Exception: Timeout while feeding partition
When this error occurred, I checked the container log and also find following error:

2021-04-15 16:22:33.646363: E tensorflow/core/platform/hadoop/hadoop_file_system.cc:115] HadoopFileSystem load error: libjvm.so: cannot open shared object file: No such file or directory

so I just add $LIB_JVM to --conf spark.executorEnv.LD_LIBRARY_PATH, and everything gets ok.

--conf spark.executorEnv.LD_LIBRARY_PATH=$LIB_JVM:$LIB_HDFS \

another time，when I go through the log，I found the problem is because that HDFS has entered standby mode.

tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 0

reason: I misused the TfRecord api to read csv files, when I using tfr files, everything fine.

import pyspark report error:
cloudpickle.py line 127 in _make_cell_set_template_code
TypeError: an integer is required (got type bytes)

reason: maybe the version of pyspark shipped with spark 2.3.3 is not compatible with my python version 3.8.8 as described here

solution: I’m not willing to downgrade my python version (because life is short), and since the hadoop version in my company is so old(2.6.0), so I decide to use Hadoop free build of Spark and then run the following command

echo 'export SPARK_DIST_CLASSPATH=$(hadoop classpath)' >> $SPARK_HOME/conf/spark-env.sh

SparkException: When running with master ‘yarn’ either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.

echo 'export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop' >> $SPARK_HOME/conf/spark-env.sh

Exception: No executor_id file found on this node

I add --conf spark.cores.max=${TOTAL_CORES} to your submit command, as said by the author of TFoS, but the problem still here.
I checked the log again, and find an error before said:

Lost executor 1 on node3: Container killed by YARN for execeeding physical memory limits.

so I boosting the --conf spark.executor.memoryOverhead to 2G and things get done.

chansonzhang

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Tensorflow On Spark Troubleshooting

Troubleshooting已经在python中安装了tensorflow，并且zip包也已经提交到hdfs，运行后报错：no module named tensorflow可能是未正常使用zip包中的python，而使用了系统中其他版本的python在启动命令中加入以下选项，明确指示使用哪个python--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./Python/bin/python \--conf spark.pyspark.python=.
复制链接

扫一扫

专栏目录