Spark上的深度学习流水线

最新推荐文章于 2024-07-23 11:00:49 发布

weixin_34133829

最新推荐文章于 2024-07-23 11:00:49 发布

阅读量504

点赞数

文章标签：人工智能

原文链接：https://my.oschina.net/u/2306127/blog/1811876

版权

2019独角兽企业重金招聘Python工程师标准>>>

Spark上的深度学习流水线

本文根据 https://github.com/databricks/spark-deep-learning 翻译。
本文地址 https://my.oschina.net/u/2306127/blog/1811876，By openthings，2018-05-18.

深度学习需要一个样本数据处理、模型训练、模型检验、模型部署的完整处理过程，而传统的深度学习引擎主要完成训练计算和模型调用的核心功能，在用于规模化的生产级应用时还需要大量的开发工作，运维管理也较为复杂。

Deep Learning Pipeline

Apache Spark上的深度学习流水线提供了一个高阶的API接口，可以通过Python支持深度学习的规模伸缩能力。这得益于Spark的集群计算和分布式内存架构，可以快速存取大规模的数据以及调用多个节点上的计算能力。

概览

深度学习流水线（Deep Learning Pipelines）提供了高级API，通过Python进行深度学习的规模伸缩，运行于Spark计算集群之上。

该支持库来自于Databricks和 Spark的两大优势：

在Spark的指导原则和Spark MLlib的支持下，提供了易于使用的API，只需数行代码即可实现深度学习能力。
使用Spark的强大的分布式引擎使深度学习在处理海量数据集时实现规模伸缩。

目前，TensorFlow和TensorFlow支持下的Keras深度学习引擎已经支持，主要聚焦于：

大规模的推理/评分。
影像数据的转移学习（transfer learning）和超参数（hyperparameter ）调优。

下一步，将为数据科学家和机器学习专家提供工具，使其能将深度学习模型转化为SQL函数，从而能让更多的用户群体所使用。这不是简单地执行单个模型的分布式训练，而是一个活跃的研究领域，我们将能够为大多数深度学习的适用场景提供现实可操作的解决方案。

对该库的概览描述，参见Databricks的博客（blog post），对深度学习流水线进行了介绍。对于该软件库服务的多种应用案例，查看下面的快速使用参考部分（Quick user guide）。

该支持库还在早期开发阶段，还有任何人提出反馈及作出贡献。

开发维护者: Bago Amirbekian, Joseph Bradley, Yogesh Garg, Sue Ann Hong, Tim Hunter, Siddharth Murching, Tomas Nykodym, Lu Wang

构建和运行单元测试

为了编译该项目, 从项目主目录运行 build/sbt assembly 。这将启动 Scala unit tests。

为了运行Python的 unit tests, 在e python/ 目录下启动 run-tests.sh 脚本 (编译之后)。首先需要设置几个环境变量。

# Be sure to run build/sbt assembly before running the Python tests
sparkdl$ SPARK_HOME=/usr/local/lib/spark-2.3.0-bin-hadoop2.7 PYSPARK_PYTHON=python3 SCALA_VERSION=2.11.8 SPARK_VERSION=2.3.0 ./python/run-tests.sh

Spark 版本兼容性

为了使用最新的代码，Spark 2.3.0 是必须的，建议使用Python 3.6 和 Scala 2.11。查看 travis config 获得通常的测试所用的软件组合。

每一版本的兼容性要求列在版本（ Releases）一节。

支持

提问和参与开发讨论，到 DL Pipelines Google group.

提交bug报告或者特性要求，在 Github issues 中创建条目或参与已有的话题。

发行版本

1.0.0 版本: 要求Spark 2.3.0. Python 3.6 & Scala 2.11 为建议。要求TensorFlow 1.6.0。
1. 使用Spark 2.3.0的影像定义，新的定义使用 BGR channel ordering for 3-channel images，instead of the RGB ordering used in this project before the change.
2. Persistence for DeepImageFeaturizer (both Python and Scala).

快速使用指南

深度学习流水线（Deep Learning Pipelines）提供了一系列工具，用于使用深度学习进行影像处理。包含的分类如下：

Working with images in Spark : 内置在 Spark DataFrames之中。
Transfer learning : 一个借助深度学习的超级快速的工具。
Distributed hyperparameter tuning : 通过Spark MLlib Pipelines的超参数调试。
Applying deep learning models at scale - to images : 使用自己的或者已知的流行模型进行预测或者转换其为features。
Applying deep learning models at scale - to tensors : of up to 2 dimensions
Deploying models as SQL functions : 在SQL中使用深度学习模型。

为了运行下面的例子，获取Databricks notebook（ Databricks docs for Deep Learning Pipelines）, 可以在最新的 Deep Learning Pipelines版本下运行. 这里是与老版本（ 0.1.0, 0.2.0, 0.3.0, 1.0.0.）兼容的一些 Databricks notebooks。

与Sparkd的 images 对象一起使用

应用深度学习于影像的第一步就是载入影像。Spark和Deep Learning Pipelines包含载入数百万张图像到 Spark DataFrame 的实用函数，而且以分布式方式自动解码，允许可扩展地操作。

使用Spark's ImageSchema：

from pyspark.ml.image import ImageSchema
image_df = ImageSchema.readImages("/data/myimages")

或炸，使用自己的 image library：

from sparkdl.image import imageIO as imageIO
image_df = imageIO.readImagesWithCustomFn("/data/myimages",decode_f=<your image library, see imageIO.PIL_decode>)

结果 DataFrame 包含字符串列， "image" 包含 image struct ，其 schema == ImageSchema.

image_df.show()

Why images? 深度学习已经证明了在影像相关的任务处理的强大能力，因此我们决定对 Spark 加入影像数据的内置支持。最终目的是为了支持更多的数据类型，如文本和时间序列，建立于社区的具体需求。

迁移学习（Transfer learning）

Deep Learning Pipelines 提供了实用工具，用于实现影像的transfer learning , 这是利用深度学习的最快方法之一。使用Deep Learning Pipelines,，只需要几行代码就能完成。

from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline
from sparkdl import DeepImageFeaturizer

featurizer = DeepImageFeaturizer(inputCol="image", outputCol="features", modelName="InceptionV3")
lr = LogisticRegression(maxIter=20, regParam=0.05, elasticNetParam=0.3, labelCol="label")
p = Pipeline(stages=[featurizer, lr])

model = p.fit(train_images_df)    # train_images_df is a dataset of images and labels

# Inspect training error
df = model.transform(train_images_df.limit(10)).select("image", "probability",  "uri", "label")
predictionAndLabels = df.select("prediction", "label")
evaluator = MulticlassClassificationEvaluator(metricName="accuracy")
print("Training set accuracy = " + str(evaluator.evaluate(predictionAndLabels)))

分布式超参数调试

在深度学习中，为了对于训练参数的不同值得到最好的结果，一个重要的步骤叫做超参数调优（ hyperparameter tuning）。因为Deep Learning Pipelines将深度学习作为Spark的机器学习流水线的一个步骤，用户可以使用已经整合到Spark MLlib的超参数调优架构。

对于Keras用户

为了执行Keras模型上的超参数调优， KerasImageFileEstimator 用于构建一个 Estimator ，然后使用 MLlib的工具来跳有超参数(e.g. CrossValidator)。KerasImageFileEstimator 与image URI columns一起工作 (不是 ImageSchema columns)，为了允许自定义的影像载入和处理函数，这在 keras中经常会用到。

为了使用 KerasImageFileEstimator 构建 estimator , 我们需要一个存储为文件的 Keras model。这可以是 Keras 内置的模型或者用户训练好的模型。

from keras.applications import InceptionV3

model = InceptionV3(weights="imagenet")
model.save('/tmp/model-full.h5')

我们还需要创建一个影像载入函数，用于从URI读取影像数据，预处理，然后返回numerical tensor到keras Model的输入格式。然后，我们创建KerasImageFileEstimator，接收保存的模型文件。

import PIL.Image
import numpy as np
from keras.applications.imagenet_utils import preprocess_input
from sparkdl.estimators.keras_image_file_estimator import KerasImageFileEstimator

def load_image_from_uri(local_uri):
  img = (PIL.Image.open(local_uri).convert('RGB').resize((299, 299), PIL.Image.ANTIALIAS))
  img_arr = np.array(img).astype(np.float32)
  img_tnsr = preprocess_input(img_arr[np.newaxis, :])
  return img_tnsr

estimator = KerasImageFileEstimator( inputCol="uri",
                                     outputCol="prediction",
                                     labelCol="one_hot_label",
                                     imageLoader=load_image_from_uri,
                                     kerasOptimizer='adam',
                                     kerasLoss='categorical_crossentropy',
                                     modelFile='/tmp/model-full-tmp.h5' # local file path for model
                                   )

我们使用其进行超参数调优，使用 CrossValidataor执行grid search来实现。

from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

paramGrid = (
  ParamGridBuilder()
  .addGrid(estimator.kerasFitParams, [{"batch_size": 32, "verbose": 0},
                                      {"batch_size": 64, "verbose": 0}])
  .build()
)
bc = BinaryClassificationEvaluator(rawPredictionCol="prediction", labelCol="label" )
cv = CrossValidator(estimator=estimator, estimatorParamMaps=paramGrid, evaluator=bc, numFolds=2)

cvModel = cv.fit(train_df)

深度学习模型扩容

Spark DataFrames是应用深度学习模型到大规模数据集的自然选择。Deep Learning Pipelines 提供了一些列Spark MLlib Transformers，将TensorFlow Graphs和基于TensorFlow的Keras Models扩展到集群上。这些Transformers背后由Tensorframes库支持，在Spark worker节点上高效地处理分布式模型和数据。

应用deep learning models于影像并扩展

Deep Learning Pipelines提供了几种方法应用影像模型并扩展到集群：

通用images models可以直接处理，不需要TensorFlow或Keras的代码处理。
TensorFlow graphs 用于处理 images。
Keras models 用于处理 images。

应用通用 image models

已经有很多大家都知道的影像深度学习模型。如果要做的处理与模型提供的很像(如基于with ImageNet classes的对象识别), 或者处于探索的目的，可以使用Transformer DeepImagePredictor ，简单地指定model的名称即可。

from pyspark.ml.image import ImageSchema
from sparkdl import DeepImagePredictor

image_df = ImageSchema.readImages(sample_img_dir)

predictor = DeepImagePredictor(inputCol="image", outputCol="predicted_labels", modelName="InceptionV3", decodePredictions=True, topK=10)
predictions_df = predictor.transform(image_df)

对于 TensorFlow 用户

Deep Learning Pipelines提供MLlib Transformer，可以将给定的TensorFlow Graph应用于包含影像列的DataFrame (影像使用前面描述的方法载入)。这里是一个非常简单的例子，演示了 TensorFlow Graph 如何用于 Transformer. 实践中，TensorFlow Graph将从文件中载入，然后用于调用 TFImageTransformer。

from pyspark.ml.image import ImageSchema
from sparkdl import TFImageTransformer
import sparkdl.graph.utils as tfx  # strip_and_freeze_until was moved from sparkdl.transformers to sparkdl.graph.utils in 0.2.0
from sparkdl.transformers import utils
import tensorflow as tf

graph = tf.Graph()
with tf.Session(graph=graph) as sess:
    image_arr = utils.imageInputPlaceholder()
    resized_images = tf.image.resize_images(image_arr, (299, 299))
    # the following step is not necessary for this graph, but can be for graphs with variables, etc
    frozen_graph = tfx.strip_and_freeze_until([resized_images], graph, sess, return_graph=True)

transformer = TFImageTransformer(inputCol="image", outputCol="predictions", graph=frozen_graph,
                                 inputTensor=image_arr, outputTensor=resized_images,
                                 outputMode="image")

image_df = ImageSchema.readImages(sample_img_dir)
processed_image_df = transformer.transform(image_df)

对于 Keras 用户

为了在Spark中用分布式的方法应用Keras models，KerasImageFileTransformer 与TensorFlow作为引擎的 Keras models一起工作。

内部创建一个 DataFrame，包含影像列，载入用户指定的影像和处理函数，输入到包含有影像列的 DataFrame。
载入 Keras model，从给定的文件路径读入。
应用model到image DataFrame。

与TFImageTransformer 的API的不同在于，通常Keras workflows有一些非常特殊的载入和重设尺寸等方法，而这些功能通常不是 TensorFlow Graph的一部分。

为了使用transformer, 我们首先需要一个存储在文件中的Keras model。我们直接使用Keras内置的 InceptionV3 model，就不用自己来训练了。

from keras.applications import InceptionV3

model = InceptionV3(weights="imagenet")
model.save('/tmp/model-full.h5')

再使用模型来进行预测：

from keras.applications.inception_v3 import preprocess_input
from keras.preprocessing.image import img_to_array, load_img
import numpy as np
import os
from pyspark.sql.types import StringType
from sparkdl import KerasImageFileTransformer

def loadAndPreprocessKerasInceptionV3(uri):
  # this is a typical way to load and prep images in keras
  image = img_to_array(load_img(uri, target_size=(299, 299)))  # image dimensions for InceptionV3
  image = np.expand_dims(image, axis=0)
  return preprocess_input(image)

transformer = KerasImageFileTransformer(inputCol="uri", outputCol="predictions",
                                        modelFile='/tmp/model-full-tmp.h5',  # local file path for model
                                        imageLoader=loadAndPreprocessKerasInceptionV3,
                                        outputMode="vector")

files = [os.path.abspath(os.path.join(dirpath, f)) for f in os.listdir("/data/myimages") if f.endswith('.jpg')]
uri_df = sqlContext.createDataFrame(files, StringType()).toDF("uri")

keras_pred_df = transformer.transform(uri_df)

应用 deep learning models 到 tensors 并扩容

Deep Learning Pipelines 也提供了使用tensor inputs应用模型 (到 2 dimensions), 由通用的 deep learning libraries提供：

TensorFlow graphs
Keras models

对于 TensorFlow 用户

TFTransformer 应用一个用户指定的TensorFlow graph到tensor inputs（最多二维）。TensorFlow graph 可以作为TensorFlow graph objects (tf.Graph 指定，或者是一个引用 tf.GraphDef)，或者是 checkpoint ，或者是 SavedModel objects (查看input object class 获得更多细节).。 transform() 函数应用 TensorFlow graph 到输入 DataFrame 的column of arrays (这里每一个 array 对应于一个 Tensor)，并且输出 column of arrays，对应于每一个 graph。

首先，我们创建一个二维点的样本数据集, 围绕两个不同中心点的高斯分布。

import numpy as np
from pyspark.sql.types import Row

n_sample = 1000
center_0 = [-1.5, 1.5]
center_1 = [1.5, -1.5]

def to_row(args):
  xy, l = args
  return Row(inputCol = xy, label = l)

samples_0 = [np.random.randn(2) + center_0 for _ in range(n_sample//2)]
labels_0 = [0 for _ in range(n_sample//2)]
samples_1 = [np.random.randn(2) + center_1 for _ in range(n_sample//2)]
labels_1 = [1 for _ in range(n_sample//2)]

rows = map(to_row, zip(map(lambda x: x.tolist(), samples_0 + samples_1), labels_0 + labels_1))
sdf = spark.createDataFrame(rows)

下一步，编写一个函数返回tensorflow graph和它的input：

import tensorflow as tf

def build_graph(sess, w0):
  X = tf.placeholder(tf.float32, shape=[None, 2], name="input_tensor")
  model = tf.sigmoid(tf.matmul(X, w0), name="output_tensor")
  return model, X

然后，就是使用Tensorflow在单个节点上进行预测而编写的代码。

w0 = np.array([[1], [-1]]).astype(np.float32)
with tf.Session() as sess:
  model, X = build_graph(sess, w0)
  output = sess.run(model, feed_dict = {
    X : samples_0 + samples_1
  })

现在，你可以使用下面的 Spark MLlib Transformer应用 model到DataFrame，按照分布式的方式运行。

from sparkdl import TFTransformer
from sparkdl.graph.input import TFInputGraph
import sparkdl.graph.utils as tfx

graph = tf.Graph()
with tf.Session(graph=graph) as session, graph.as_default():
    _, _ = build_graph(session, w0)
    gin = TFInputGraph.fromGraph(session.graph, session,
                                 ["input_tensor"], ["output_tensor"])

transformer = TFTransformer(
    tfInputGraph=gin,
    inputMapping={'inputCol': tfx.tensor_name("input_tensor")},
    outputMapping={tfx.tensor_name("output_tensor"): 'outputCol'})

odf = transformer.transform(sdf)

对于 Keras 用户

KerasTransformer 应用基于TensorFlow的Keras model到tensor inputs，不超过二维。它从给定的模型文件路径载入 Keras model，然后应用model到column of arrays (一个array对应于 Tensor)，输出column of arrays。

from sparkdl import KerasTransformer
from keras.models import Sequential
from keras.layers import Dense
import numpy as np

# Generate random input data
num_features = 10
num_examples = 100
input_data = [{"features" : np.random.randn(num_features).tolist()} for i in range(num_examples)]
input_df = sqlContext.createDataFrame(input_data)

# Create and save a single-hidden-layer Keras model for binary classification
# NOTE: In a typical workflow, we'd train the model before exporting it to disk,
# but we skip that step here for brevity
model = Sequential()
model.add(Dense(units=20, input_shape=[num_features], activation='relu'))
model.add(Dense(units=1, activation='sigmoid'))
model_path = "/tmp/simple-binary-classification"
model.save(model_path)

# Create transformer and apply it to our input data
transformer = KerasTransformer(inputCol="features", outputCol="predictions", modelFile=model_path)
final_df = transformer.transform(input_df)

部署 models 到 SQL 函数

将模型提升到生产级的方法之一是将其部署到Spark SQL用户定义函数（UDF，User Defined Function），从而允许任何熟悉SQL的人都能使用。Deep Learning Pipelines提供了一种机制，，可以将深度学习模型register 为一个Spark SQL User Defined Function (UDF)。尤其是，Deep Learning Pipelines 0.2.0添加了Keras models创建为 SQL UDFs，可以与影像数据一起工作。

结果UDF获取 column (格式化为image struct "SpImage") 并且产生给定的Keras model输出，比如对于 Inception V3，将产生 real valued score vector over the ImageNet object categories。

我们可以为 Keras model 注册一个 UDF，可以用于影像处理，像下面这样：

from keras.applications import InceptionV3
from sparkdl.udf.keras_image_model import registerKerasImageUDF

registerKerasImageUDF("inceptionV3_udf", InceptionV3(weights="imagenet"))

同样，我们也可以从模型文件register一个UDF：

registerKerasImageUDF("my_custom_keras_model_udf", "/tmp/model-full-tmp.h5")

在Keras处理影像的流程中，通常有一些预处理步骤，然后才将模型应用于影像数据。如果我妈的模型需要预处理，我们可选提供预处理函数给 UDF registration过程。预处理器通过接收一个文件路径，返回一个image array，下面是一个简单的例子：

from keras.applications import InceptionV3
from sparkdl.udf.keras_image_model import registerKerasImageUDF

def keras_load_img(fpath):
    from keras.preprocessing.image import load_img, img_to_array
    import numpy as np
    img = load_img(fpath, target_size=(299, 299))
    return img_to_array(img).astype(np.uint8)

registerKerasImageUDF("inceptionV3_udf_with_preprocessing", InceptionV3(weights="imagenet"), keras_load_img)

一旦 UDF 注册完毕，就可以在SQL查询中使用了。如下所示：

from pyspark.ml.image import ImageSchema

image_df = ImageSchema.readImages(sample_img_dir)
image_df.registerTempTable("sample_images")

SELECT my_custom_keras_model_udf(image) as predictions from sample_images

许可与授权

该 Deep Learning Pipelines 源代码授权许可为 Apache License 2.0。
Models，标记为 provided by Keras (被 DeepImageFeaturizer 和 DeepImagePredictor使用) 通过 MIT license提供，位于 https://github.com/fchollet/keras/blob/master/LICENSE ，其他的由代码或文档说明。更多信息，请查看 Keras applications page 。

转载于:https://my.oschina.net/u/2306127/blog/1811876