spark-BigDl:深度学习之lenet5

一、lenet模型训练和测试

(一)把linux 本地图片转换成sequenceFile,并上传到HDFS上存储。

1.相关运行程序为:kingpoint.utils.ImageToSeqFile

2.首先把数据上传到linux本地上。数据文件夹格式为:dlDataImage/图片类别/图片名称

比如手写识别体,共有十个类别,则分为十个文件夹存储,每个文件夹内存放相应的图片

(1)图片类别


(2)图片



3.程序:

(1)ImageToSeqFile

package kingpoint.utils

import java.nio.file.{Files, Paths}
import com.intel.analytics.bigdl.dataset.DataSet
import com.intel.analytics.bigdl.dataset.image.{BGRImgToLocalSeqFile, LocalImgReaderWithName}

/**
 * 在linux本地上存了jpg图片,把图片形式读取成seq文件格式存到HDFS上
 * 注意:图像要存成“dir/.jpg”,其中dir为该图像的类别,一个类别一个文件夹
 * Created by llq on 2017/6/8.
 */
object ImageToSeqFile {

  /**
   * 批量处理image转换成SeqFile
   * @param blockSize
   * @param hdfsSavePath
   * @param hdfsSeqFile
   * @param dir
   * @param imageHigh
   * @param imageWidth
   */
  def toSeqFile(blockSize:Int,hdfsSavePath:String,hdfsSeqFile:String,dir:String,imageHigh:Int,imageWidth:Int): Unit ={
    // Process image data
    val validationFolderPath = Paths.get(dir)
    require(Files.isDirectory(validationFolderPath),
      s"${validationFolderPath} is not valid")

    val validationDataSet = DataSet.ImageFolder.paths(validationFolderPath)

    validationDataSet.shuffle()
    val iter = validationDataSet.data(train = false)
    (0 until 1).map(tid => {
      val workingThread = new Thread(new Runnable {
        override def run(): Unit = {
          val imageIter =LocalImgReaderWithName(imageHigh, imageWidth, 255f)(iter)

          val fileIter = BGRImgToLocalSeqFile(blockSize, Paths.get(hdfsSavePath,
            hdfsSeqFile), true)(imageIter)

          while (fileIter.hasNext) {
            println(s"Generated file ${fileIter.next()}")
          }
        }
      })
      workingThread.setDaemon(false)
      workingThread.start()
      workingThread
    }).foreach(_.join())

  }

  def main(args: Array[String]) {

    /**
     * 参数设置
     */
    if(args.length<6){
      System.err.println("Error:the parameter is less than 6")
      System.exit(1)
    }

    //读取linux上存放图片的目录名("/root/data/dlDataImage/")
    val linuxPath=args(0)
    //how many images each sequence file contains(12800)
    val blockSize: Int =args(1).toInt
    //保存Seq的HDFS路径("/user/root/dlData/")
    val hdfsSavePath=args(2)
    //保存Seq的名字("imagenet-seq")
    val hdfsSeqFile=args(3)
    //图片高度(28)
    val imageHigh=args(4).toInt
    //图片宽度(28)
    val imageWidth=args(5).toInt

    //把image转换成SeqFile,并存到HDFS上
    println("Process image data...")
    toSeqFile(blockSize,hdfsSavePath,hdfsSeqFile,linuxPath,imageHigh,imageWidth)
    println("Done")

  }

}

4.执行命令:

spark-submit \
--master local[4] \
--driver-class-path /root/data/dlLibs/lib/bigdl-0.1.0-jar-with-dependencies.jar \
--class "kingpoint.utils.ImageToSeqFile" /root/data/SparkBigDL.jar \
/root/data/dlDataImage/train/ \
12800 \
/user/root/dlData/train/ \
imagenet-seq \
28 28

(1)读取linux上存放图片的目录名:/root/data/dlDataImage/train/

(2)一个sequence文件最多可以包含多少个图片:12800

(3)保存sequence文件的HDFS路径:/user/root/dlData/train/

(4)保存sequence文件的名字:imagenet-seq

(5)每张图片高度:28

(6)每张图片宽度:28


5.最后保存在HDFS上的Sequence文件是一个包含了多个图片信息(图片label,像素点值,图片名称)的,如果超过设定参数(2)的值,则会生成第二个Sequence文件。第一个Sequence文件编号为0,第二个Sequence文件编号为1,以此类推。



6.保存在HDFS上的照片信息之中是把一个像素点映射成了3个像素点(RBG),所以重新读取像素点时宽度变为原来的3.



(二)读取HDFS上的sequenceFile文件,并训练lenet5模型。

1.运行程序为:kingpoint.lenet5.LenetTrain

2.按照第二步所示,形成数据集,并存到HDFS

3.程序:

(1)LeNet5

package kingpoint.lenet5

import com.intel.analytics.bigdl._
import com.intel.analytics.bigdl.nn._
import com.intel.analytics.bigdl.numeric.NumericFloat

/**
 * Lenet5模型
 * Created by llq on 2017/6/13.
 */
object LeNet5 {



  /**
   * 自定义层数参数设置
   * @param input
   * @param c1
   * @param s2
   * @param c3
   * @param s4
   * @param c5
   * @param f6
   * @param output
   * @return
   */
  def apply(input: String,c1: String,s2:String,c3:String,s4:String,c5:String,f6:String,output:String): Module[Float] = {
    val inputImage=input.split(",").map(_.toInt)
    val c1Image=c1.split(",").map(_.toInt)
    val s2Image=s2.split(",").map(_.toInt)
    val c3Image=c3.split(",").map(_.toInt)
    val s4Image=s4.split(",").map(_.toInt)
    val c5Image=c5.toInt
    val f6Image=f6.split(",").map(_.toInt)
    val outputImage=output.split(",").map(_.toInt)

    val model = Sequential()
    model.add(Reshape(Array(inputImage:_*)))
      //C1层:输入1张图像,6个输出feature maps;卷积核为5*5
      .add(SpatialConvolution(c1Image(0), c1Image(1), c1Image(2), (3)).setName("conv1_5x5"))
      //激活函数
      .add(Tanh())
       //S2层:pooling层,图像长和宽减半(kW, kH, dW, dH);(kernel width,kernel height,step size in width,step size in height)
      .add(SpatialMaxPooling(s2Image(0), s2Image(1), s2Image(2), s2Image(3)))
      .add(Tanh())
      //C3层(12个feature map)
      .add(SpatialConvolution(c3Image(0), c3Image(1), c3Image(2), c3Image(3)).setName("conv2_5x5"))
      //S4层
      .add(SpatialMaxPooling(s4Image(0), s4Image(1), s4Image(2), s4Image(3)))
      //C5层
      .add(Reshape(Array(c5Image)))
      //F6层
      .add(Linear(f6Image(0), f6Image(1)).setName("fc1"))
      .add(Tanh())
      //OUTPUT层
      .add(Linear(outputImage(0), outputImage(1)).setName("fc2"))
      .add(LogSoftMax())
  }

  /**
   * 手写识别体Mnist的训练层参数设置
   * @param classNum
   * @return
   */
  def apply(classNum: Int): Module[Float] = {
    val model = Sequential()
    model.add(Reshape(Array(1, 28, 28*3)))
      //C1层:输入1张图像,6个输出feature maps;卷积核为5*5
      .add(SpatialConvolution(1, 6, 5, 5).setName("conv1_5x5"))
      //激活函数
      .add(Tanh())
       //S2层:pooling层,图像长和宽减半(kW, kH, dW, dH);(kernel width,kernel height,step size in width,step size in height)
      .add(SpatialMaxPooling(2, 2, 2, 2))
      .add(Tanh())
      //C3层(12个feature map)
      .add(SpatialConvolution(6, 12, 5, 5).setName("conv2_5x5"))
      //S4层
      .add(SpatialMaxPooling(2, 2, 2, 2))
      //C5层
      .add(Reshape(Array(12 * 4 * 18)))
      //F6层
      .add(Linear(12 * 4 * 18, 100).setName("fc1"))
      .add(Tanh())
      //OUTPUT层
      .add(Linear(100, classNum).setName("fc2"))
      .add(LogSoftMax())
  }
}

(2)LenetTrain

package kingpoint.lenet5

import java.io.File

import com.intel.analytics.bigdl._
import com.intel.analytics.bigdl.dataset.DataSet.SeqFileFolder
import com.intel.analytics.bigdl.dataset.image._
import com.intel.analytics.bigdl.dataset.{ByteRecord, DataSet}
import com.intel.analytics.bigdl.nn.ClassNLLCriterion
import com.intel.analytics.bigdl.optim._
import com.intel.analytics.bigdl.utils.{Engine, LoggerFilter, T}
import org.apache.hadoop.io.Text
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{SaveMode}
import org.apache.spark.sql.hive.HiveContext

import scala.collection.mutable.ArrayBuffer


/**
 * 存放图片信息:label+data+fileName
 * @param label
 * @param data
 * @param imageName
 */
case class LabeledDataFileName(label:Float,data:Array[Byte],imageName:String)

/**
 * 存放模型路径和准确率
 * @param modelName
 * @param accuary
 */
case class modelNameAccuary(modelName:String,accuary:String)

/**
 * 从HDFS上读取图片文件Seq
 * Created by llq on 2017/6/6.
 */
object LenetTrain {
  LoggerFilter.redirectSparkInfoLogs()
  Logger.getLogger("com.intel.analytics.bigdl.optim").setLevel(Level.INFO)

  val testMean = 0.13251460696903547
  val testStd = 0.31048024

  /**
   * 读取SeqFile的信息,形成LabeledFileName
   * @param url
   * @param sc
   * @return
   */
  def imagesLoadSeq(url: String, sc: SparkContext): RDD[LabeledDataFileName] = {
    sc.sequenceFile(url, classOf[Text], classOf[Text]).map(image => {
      LabeledDataFileName(SeqFileFolder.readLabel(image._1).toInt,
        image._2.copyBytes(),
        SeqFileFolder.readName(image._1))
    })
  }

  /**
   * 读取图片信息,形成Array[ByteRecord]
   * @param imagesByteRdd
   * @return
   */
  def inLoad(imagesByteRdd:RDD[LabeledDataFileName]): RDD[ByteRecord]={

    imagesByteRdd.mapPartitions(iter=>
      iter.map{labeledDataFileName=>
        var img=new ArrayBuffer[Byte]()
        img ++= labeledDataFileName.data
        img.remove(0,8)
        ByteRecord(img.toArray,labeledDataFileName.label)
      })
  }

  /**
   * 遍历model保存路径,提取最后一次迭代的结果
   * @param file
   */
  def lsLinuxCheckPointPath(file:File): String ={
    val modelPattern="model".r
    val numberPattern="[0-9]+".r
    var epcho=0
    if(file.isDirectory){
      val fileArray=file.listFiles()
      for(i<- 0 to fileArray.length-1){
        //识别出model
        if(modelPattern.findFirstIn(fileArray(i).getName).mkString(",")!=""){
          //取出最大一次的迭代值
          val epchoNumber=numberPattern.findFirstIn(fileArray(i).getName).mkString(",").toInt
          if(epchoNumber>epcho){
            epcho=epchoNumber
          }
        }
      }
    }else{
      throw new Exception("the path is not right")
    }
    "model."+epcho
  }

  /**
   * 主方法,读取SeqFile,并训练lenet5模型
   * @param args
   */
  def main (args: Array[String]){
    val conf = Engine.createSparkConf()
      .setAppName("kingpoint.lenet5.LenetTrain")
    val sc = new SparkContext(conf)
    val hiveContext=new HiveContext(sc)
    Engine.init

    /**
     * 参数设置
     */
    if(args.length<18){
      System.err.println("Error:the parameter is less than 18")
      System.exit(1)
    }
    //Hdfs上存放图片文件的路径(hdfs://hadoop-01.com:8020/user/root/dlData/train/)
    val hdfsPath=args(0)
    //设置分割数据集的比例:训练集和验证集比例(7,3)
    val trainValidationRatio=args(1)
    //图片高度(28)
    val imageHigh=args(2).toInt
    //图片宽度(28*3)
    val imageWidth=args(3).toInt

    //lenet模型参数
    val input=args(4)         //输入层(one image+image high+image width)(1,28,84)
    val c1=args(5)            //c1层:(输入1张图像,输出6个feature map,卷积核为5*5)(1,6,5,5)
    val s2=args(6)            //S2层:pooling层:(kernel width,kernel height,step size in width,step size in height)(2,2,2,2)
    val c3=args(7)            //C3层:(输入6张图像,输出12个feature map,卷积核为5*5)(6,12,5,5)
    val s4=args(8)            //S4层:pooling层:(kernel width,kernel height,step size in width,step size in height)(2,2,2,2)
    val c5=args(9)            //C5层(12 * 4 * 18)(864)
    val f6=args(10)           //F6层(12 * 4 * 18,100)(864,100)
    val output=args(11)       //OUTPUT层(输入100个神经元,输出10个神经元:分类类别)(100,10)
    val learningRate=args(12).toDouble      //学习率(0.01)
    val learningRateDecay=args(13).toDouble //(0.0)
    val maxEpoch=args(14).toInt             //设置最大Epoch值为多少之后停止。(1)
    val batchSize=args(15).toInt            //batch size(4)
    val modelSave=args(16)                  //模型保存路径(/root/data/model)
    val outputTableName=args(17)            //模型训练后参数在hive中保存的名称(dl.lenet_train)

    /**
     * 读取数据,并转换数据
     */
    //读出图片的label+data+filename=>RDD[LabeledDataFileName]
    val imagesByteRdd=imagesLoadSeq(hdfsPath,sc).coalesce(32, true)

    //分割测试集和验证集
    val trainRatio=trainValidationRatio.split(",")(0).toInt
    val validataionRatio=trainValidationRatio.split(",")(1).toInt
    val imagesByteSplitRdd=imagesByteRdd.randomSplit(Array(trainRatio,validataionRatio))
    val trainSplitRdd=imagesByteSplitRdd(0)
    val validationSplitRdd=imagesByteSplitRdd(1)

    //测试集,转换为灰度图->正则化->Batch(把数据分成多少个batch,相当于分组,一组进行权值更新)
    val trainSet = DataSet.rdd(inLoad(trainSplitRdd)) ->
      BytesToGreyImg(imageHigh, imageWidth) -> GreyImgNormalizer(testMean, testStd) -> GreyImgToBatch(batchSize)
    val validationSet = DataSet.rdd(inLoad(validationSplitRdd)) ->
      BytesToGreyImg(imageHigh, imageWidth) -> GreyImgNormalizer(testMean, testStd) -> GreyImgToBatch(batchSize)

    /**
     * 模型参数设置和训练
     */
    //建立lenet5模型,并且设置相应的参数
    val model = LeNet5(input,c1,s2,c3,s4,c5,f6,output)

    //设置学习率(梯度下降的时候用到)
    val state =
      T(
        "learningRate" -> learningRate,
        "learningRateDecay" -> learningRateDecay
      )

    //模型参数设置;训练集;根据输出误差更新权重
    val optimizer = Optimizer(model = model, dataset = trainSet,criterion = new ClassNLLCriterion[Float]())

    optimizer.setCheckpoint(modelSave, Trigger.everyEpoch)

    //开始训练模型:设置验证集;学习率;设置迭代次数;开始训练触发
    optimizer
      .setValidation(
        trigger = Trigger.everyEpoch,
        dataset = validationSet,
        vMethods = Array(new Top1Accuracy, new Top5Accuracy[Float], new Loss[Float]))
      .setState(state)
      .setEndWhen(Trigger.maxEpoch(maxEpoch))    //设置最大Epoch值为多少之后停止。
      .optimize()

    //遍历model名称,取出最后一次迭代的model名字。再合并成全路径
    val modelEpochFile=optimizer.getCheckpointPath().get+"/"+lsLinuxCheckPointPath(new File(optimizer.getCheckpointPath().get))

    //获得准确率
    val validator = Validator(model, validationSet)
    val result = validator.test(Array(new Top1Accuracy[Float]))

    /**
     * 模型路径和准确率存放
     */
    val modelNameAccuaryRdd=sc.parallelize(List(modelNameAccuary(modelEpochFile,result(0)._1.toString)))
    val modelNameAccuaryDf=hiveContext.createDataFrame(modelNameAccuaryRdd)

    //保存到hive中
    modelNameAccuaryDf.show()
    modelNameAccuaryDf.write.mode(SaveMode.Overwrite).saveAsTable(outputTableName)
  }
}

4.执行命令

spark-submit \
--master local[4] \
--driver-memory 2g \
--executor-memory 2g \
--driver-class-path /root/data/dlLibs/lib/bigdl-0.1.0-jar-with-dependencies.jar \
--class "kingpoint.lenet5.LenetTrain" /root/data/SparkBigDL.jar \
hdfs://hadoop-01.com:8020/user/root/dlData/train/ \
7,3 \
28 84 \
1,28,84 \
1,6,5,5 \
2,2,2,2 \
6,12,5,5 \
2,2,2,2 \
864 \
864,100 \
100,10 \
0.01 \
0.0 \
1 \
4 \
/root/data/model \
dl.lenet_train

(1)Hdfs上存放图片文件的路径:hdfs://hadoop-01.com:8020/user/root/dlData/train/

(2)设置分割数据集的比例:训练集和验证集比例,格式为:7,3

(3)图片高度:28

(4)图片宽度:84

(5)输入层(one image+image high+image width)1,28,84

(6)c1层:(输入1张图像,输出6feature map,卷积核为5*5):1,6,5,5

(7)S2层:pooling:kernel widthkernel heightstep size in widthstep size in height):2,2,2,2

(8)C3层:(输入6张图像,输出12feature map,卷积核为5*5):6,12,5,5

(9)S4层:pooling:kernel widthkernel heightstep size in widthstep size in height):2,2,2,2

(10)C5层(12 * 4 * 18):864

(11)F6层(12 * 4 * 18100):864,100

(12)OUTPUT层(输入100个神经元,输出10个神经元:分类类别):100,10

(13)学习率(0.01)

(14)learningRateDecay0.0

(15)设置最大Epoch值为多少之后停止:1

(16)batch size4

(17)模型保存路径:/root/data/model

(18)模型训练后参数在hive中保存的名称:dl.lenet_train


5.输出结果

保存在hive里面,输出字段为:模型保存路径(modelName+验证集的准确率(accuary

如下图所示:(注意,当需要测试模型时,需要查看modelName的值,并把这个值作为参数填写到测试模型时的参数当中)


(三)利用测试集测试训练好的lenet5模型。

1.运行程序为:kingpoint.lenet5.LenetTest

2.按照第三步所示,训练好模型,并保存到linux

3.程序

(1)ToByteRecords

package kingpoint.image

/**
 * 转换Row=》ByteRecord
 * Created by llq on 2017/6/13.
 */
import com.intel.analytics.bigdl.dataset.{ByteRecord, Transformer}
import org.apache.log4j.Logger
import org.apache.spark.sql.Row

import scala.collection.Iterator

object ToByteRecords {
  val logger = Logger.getLogger(getClass)

  def apply(colName: String = "data", label:String= "label"): ToByteRecords = {
    new ToByteRecords(colName,label)
  }
}

/**
 * transform [[Row]] to [[ByteRecord]]
 * @param colName column name
 * @param label label name
 */
class ToByteRecords(colName: String,label:String)
  extends Transformer[Row, ByteRecord] {

  override def apply(prev: Iterator[Row]): Iterator[ByteRecord] = {
    prev.map(
      img => {
        val pixelLength=img.getAs[Array[Byte]](colName).length-8
        val byteData=new Array[Byte](pixelLength)
        for(j<-0 to pixelLength-1){
          byteData(j)=img.getAs[Array[Byte]](colName)(j+8)
        }
        ByteRecord(byteData, img.getAs[Float](label))
      }
    )
  }
}

(2) GreyImgToImageVector

package kingpoint.image

/**
 * grey img to (label,denseVector)
 * Created by llq on 2017/6/13.
 */
import com.intel.analytics.bigdl.dataset.Transformer
import com.intel.analytics.bigdl.dataset.image.LabeledGreyImage
import org.apache.log4j.Logger
import org.apache.spark.mllib.linalg.DenseVector

import scala.collection.Iterator

object GreyImgToImageVector {
  val logger = Logger.getLogger(getClass)

  def apply(): GreyImgToImageVector = {
    new GreyImgToImageVector()
  }
}

/**
 * Convert a Grey image to (label,denseVector) of spark mllib
 */
class GreyImgToImageVector()
  extends Transformer[LabeledGreyImage, (Float,DenseVector)] {

  private var featureData: Array[Float] = null

  override def apply(prev: Iterator[LabeledGreyImage]): Iterator[(Float,DenseVector)] = {
    prev.map(
      img => {
        if (null == featureData) {
          featureData = new Array[Float](img.height() * img.width())
        }
        featureData=img.content
        (img.label(),new DenseVector(featureData.map(_.toDouble)))
      }
    )
  }
}

(3) LenetTest

package kingpoint.lenet5

import com.intel.analytics.bigdl.dataset.DataSet.SeqFileFolder
import com.intel.analytics.bigdl.dataset.Transformer
import com.intel.analytics.bigdl.dataset.image.{BytesToGreyImg, GreyImgNormalizer}
import com.intel.analytics.bigdl.nn.Module
import com.intel.analytics.bigdl.utils.{Engine, LoggerFilter}
import kingpoint.image.{GreyImgToImageVector, ToByteRecords}
import org.apache.hadoop.io.Text
import org.apache.log4j.{Level, Logger}
import org.apache.spark.SparkContext
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.ml.{DLClassifier => SparkDLClassifier}
import org.apache.spark.mllib.linalg.DenseVector
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.{SaveMode, DataFrame, Row}


/**
 * 数据预处理后,在工作流时存放图片信息:label+data+fileName
 * @param label
 * @param features
 * @param imageName
 */
case class LabeledDataFloatImageName(label:Float,features:DenseVector,imageName:String)

/**
 * 存放模型评估参数:count+accuracy
 * @param count
 * @param accuracy
 */
case class countAccuary(count:Double,accuracy:Double)
/**
 * lenet模型测试
 * Created by llq on 2017/6/13.
 */
object LenetTest {
  LoggerFilter.redirectSparkInfoLogs()
  Logger.getLogger("com.intel.analytics.bigdl.optim").setLevel(Level.INFO)

  val testMean = 0.13251460696903547
  val testStd = 0.31048024

  /**
   * 读取SeqFile的信息,形成LabeledFileName
   * @param url
   * @param sc
   * @return
   */
  def imagesLoadSeq(url: String, sc: SparkContext): RDD[LabeledDataFileName] = {
    sc.sequenceFile(url, classOf[Text], classOf[Text]).map(image => {
      LabeledDataFileName(SeqFileFolder.readLabel(image._1).toInt,
        image._2.copyBytes(),
        SeqFileFolder.readName(image._1))
    })
  }

  /**
   * 工作流df转换
   * 合并:label+转换后的data+imageName
   * @param data
   * @param f
   * @return
   */
  def transformDF(data: DataFrame, f: Transformer[Row, (Float,DenseVector)]): DataFrame = {
    //利用工作流转换数据,形成RDD[LabeledGreyImage]
    val vectorRdd = data.rdd.mapPartitions(f(_))
    //合并:转换后的数据+名字+label
    val dataRDD = data.rdd.zipPartitions(vectorRdd) { (a, b) =>
      b.zip(a.map(_.getAs[String]("imageName")))
        .map(
          v => LabeledDataFloatImageName(v._1._1, v._1._2,v._2)
        )
    }
    data.sqlContext.createDataFrame(dataRDD)
  }

  /**
   * 统计准确率
   * @param testResult
   * @return
   */
  def evaluationAccuracy(testResult:DataFrame): countAccuary ={
    //label-predict
    val labelSubPredictArray=testResult.select("label","predict").rdd.map{row=>
      val label=row.getAs[Float]("label")
      val predict=row.getAs[Int]("predict")
      label-predict
    }.collect()

    //统计准确率
    var correct:Double=0.0
    for(i<-0 to labelSubPredictArray.length-1){
      if(labelSubPredictArray(i)==0){
        correct += 1
      }
    }
    val accuary=correct/labelSubPredictArray.length
    countAccuary(labelSubPredictArray.length,accuary)
  }

  def main(args: Array[String]) {
    val conf = Engine.createSparkConf()
      .setAppName("kingpoint.lenet5.LenetTrain")
    val sc = new SparkContext(conf)
    Engine.init
    val hiveContext = new HiveContext(sc)

    /**
     * 参数设置
     */
    if(args.length<7){
      System.err.println("Error:the parameter is less than 7")
      System.exit(1)
    }
    //Hdfs上存放测试图片文件的路径(hdfs://hadoop-01.com:8020/user/root/dlData/test/)
    val hdfsPath=args(0)
    //model路径(/root/data/model/20170615_101109/model.121)
    val modelPath=args(1)
    //batchSize(16)
    val batchSize=args(2).toInt
    //图片高度(28)
    val imageHigh=args(3).toInt
    //图片宽度(28*3)
    val imageWidth=args(4).toInt
    //模型测试后参数在hive中保存的名称(dl.lenet_test)
    val outputTableName=args(5)
    //模型评估参数在hive中保存的名称(dl.lenet_test_evaluation)
    val outputTableNameEvaluation=args(6)

    //读出图片的label+data+filename=>RDD[LabeledDataFileName]
    val imagesByteRdd=imagesLoadSeq(hdfsPath,sc).coalesce(32, true)

    /**
     * 模型导入和测试
     */
    //导入模型
    val model =  Module.load[Float](modelPath)

    val valTrans = new SparkDLClassifier[Float]()
      .setInputCol("features")
      .setOutputCol("predict")

    val paramsTrans = ParamMap(
      valTrans.modelTrain -> model,
      valTrans.batchShape ->
        Array(batchSize, 3, imageHigh, imageWidth/3))

    //数据集预处理
    val transf = ToByteRecords() ->
      BytesToGreyImg(imageHigh, imageWidth) ->
      GreyImgNormalizer(testMean, testStd) ->
      GreyImgToImageVector()

    //形成预测结果DF
    val valDF = transformDF(hiveContext.createDataFrame(imagesByteRdd), transf)
    val testResult=valTrans.transform(valDF, paramsTrans).select("label","imageName","predict")
    testResult.show()

    //准确率,并形成df
    val countAccuracyDf=hiveContext.createDataFrame(sc.parallelize(Seq(evaluationAccuracy(testResult))))
    countAccuracyDf.show()

    /**
     * 结果保存
     */
    //保存到hive中
    testResult.write.mode(SaveMode.Overwrite).saveAsTable(outputTableName)
    countAccuracyDf.write.mode(SaveMode.Overwrite).saveAsTable(outputTableNameEvaluation)
  }


}

4.执行命令:

spark-submit \
--master local[4] \
--driver-memory 2g \
--executor-memory 2g \
--driver-class-path /root/data/dlLibs/lib/bigdl-0.1.0-jar-with-dependencies.jar \
--class "kingpoint.lenet5.LenetTest" /root/data/SparkBigDL.jar \
hdfs://hadoop-01.com:8020/user/root/dlData/test/ \
/root/data/model/20170615_101109/model.121 \
16 \
28 84 \
dl.lenet_test \
dl.lenet_test_evaluation

(1)Hdfs上存放测试图片文件的路径:hdfs://hadoop-01.com:8020/user/root/dlData/test/

(2)model路径:/root/data/model/20170615_101109/model.121

(3)batchSize16

(4)图片高度:28

(5)图片宽度:84

(6)模型训练后参数在hive中保存的名称:dl.lenet_test

(7)模型评估参数在hive中保存的名称:dl.lenet_test_evaluation

 

4.保存结果

(1)dl.lenet_test


(2)dl.lenet_test_evaluation





评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

洛克-李

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值