概述
在Apache Spark 2.4中引入了一个新的内置数据源, 图像数据源.用户可以通过DataFrame API加载指定目录的中图像文件,生成一个DataFrame对象.通过该DataFrame对象,用户可以对图像数据进行简单的处理,然后使用MLlib进行特定的训练和分类计算.
本文将介绍图像数据源的实现细节和使用方法.
简单使用
先通过一个例子来简单的了解下图像数据源使用方法. 本例设定有一组图像文件存放在阿里云的OSS上, 需要对这组图像加水印,并压缩存储到parquet文件中. 废话不说,先上代码:
// 为了突出重点,代码简化图像格式相关的处理逻辑
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[*]")
val spark = SparkSession.builder()
.config(conf)
.getOrCreate()
val imageDF = spark.read.format("image").load("oss://<bucket>/path/to/src/dir")
imageDF.select("image.origin", "image.width", "image.height", "image.nChannels", "image.mode", "image.data")
.map(row => {
val origin = row.getAs[String]("origin")
val width = row.getAs[Int]("width")
val height = row.getAs[Int]("height")
val mode = row.getAs[Int]("mode")
val nChannels = row.getAs[Int]("nChannels")
val data = row.getAs[Array[Byte]]("data")
Row(Row(origin, height, width, nChannels, mode,
markWithText(width, height, BufferedImage.TYPE_3BYTE_BGR, data, "EMR")))
}).write.format("parquet").save("oss://<bucket>/path/to/dst/dir")
}
def markWithText(width: Int, height: Int, imageType: Int, data: Array[Byte], text: String): Array[Byte] = {
val image = new BufferedImage(width, height, imageType)
val raster = image.getData.asInstanceOf[WritableRaster]
val pixels = data.map(_.toInt)
raster.setPixels(0, 0, width, height, pixels)
image.setData(raster)
val buffImg = new BufferedImage(width, height, imageType)
val g = buffImg.createGraphics
g.drawImage(i