背景
离线推荐场景在业务中并不罕见,如果对实效性有一定要求,其计算压力会成倍增加,单机环境很难承载计算压力和后续扩展需求,很直接的想法就是借助分布式集群加速计算。这里我们使用pyspark做tf savedmodel的inference,之前没搞过,所以在环境问题上踩了很多坑,这里记录一下。
踩坑记录
- 报错信息如下
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 5, localhost, executor driver): java.lang.IllegalArgumentException
at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)
at org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)
at org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132)
at org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181)
at org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized