Spark流处理图片转格式

Spark流处理图片转格式

说明

Qimage是我自定义的数据类型,作为value,所以没有用WritableComparable

QimageInputFormat<Text,Qimage>是自定义的输入格式

QimageRecordReader<Text,Qimage>是自定义的RecordReader

QimageOutputFormat<Text,Qimage>是自定义的输出格式

QimageRecordWriter<Text,Qimage>是自定义的RecordWriter

这个数据结构已经在hadoop 上跑通了,现在转移到spark上进行流式的处理,实现了输入png格式的图片,输出bmp格式的图片,在写代码的时候遇到了几个问题。

1) 在spark中怎么使用自定义的数据类型

2) 怎么将自定义的数据类型保存到hdfs上

spark的流式为hdfs目录检测方式。

Data Serialization

Serialization plays an important role inthe performance of any distributed application. Formats that are slow toserialize objects into, or consume a large number of bytes, will greatly slowdown the computation. Often, this will be the first thing you should tune tooptimize a Spark application. Spark aims to strike a balance betweenconvenience (allowing you to work with any Java type in your operations) andperformance. It provides two serialization libraries:

·        Java serialization: By default, Spark serializes objects using Java’s ObjectOutputStream framework, and can work with anyclass you create that implements java.io.Serializable. You can also control the performance of your serializationmore closely by extendingjava.io.Externalizable. Java serialization is flexible but often quite slow, and leadsto large serialized formats for many classes.

·        Kryo serialization: Spark can also use the Kryo library (version 2) to serializeobjects more quickly. Kryo is significantly faster and more compact than Javaserialization (often as much as 10x), but does not support all Serializable types and requires you to register the classes you’ll use in the programin advance for best performance.

You can switch to using Kryo byinitializing your job with a SparkConf andcalling conf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer"). This setting configures the serializerused for not only shuffling data between worker nodes but also when serializingRDDs to disk. The only reason Kryo is not the default is because of the customregistration requirement, but we recommend trying it in any network-intensiveapplication.

Finally, to register your classes withKryo, create a public class that extends org.apache.spark.serializer.KryoRegistrator and set thespark.kryo.registrator config property to point to it, asfollows:

importcom.esotericsoftware.kryo.Kryo
importorg.apache.spark.serializer.KryoRegistrator
 
classMyRegistratorextendsKryoRegistrator{
  overridedefregisterClasses(kryo:Kryo){
    kryo.register(classOf[MyClass1])
    kryo.register(classOf[MyClass2])
  }
}
 
valconf=newSparkConf().setMaster(...).setAppName(...)
conf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")
conf.set("spark.kryo.registrator","mypackage.MyRegistrator")
valsc=newSparkContext(conf)

The Kryodocumentation describesmore advanced registration options, such as adding custom serialization code.

If your objects are large, you may alsoneed to increase the spark.kryoserializer.buffer.mb config property. The default is 2, butthis value needs to be large enough to hold the largest objectyou will serialize.

Finally, if you don’t register yourclasses, Kryo will still work, but it will have to store the full class namewith each object, which is wasteful.

有关自定义数据类型在spark上的输入输出函数

来源于官方源码

textFileStream

/**

   * Create a inputstream that monitors a Hadoop-compatible filesystem

   * for new files andreads them as text files (using key as LongWritable, value

   * as Text and inputformat as TextInputFormat). Files must be written to the

   * monitored directoryby "moving" them from another location within the same

   * file system. Filenames starting with . are ignored.

   * @param directoryHDFS directory to monitor for new file

   */

  deftextFileStream(directory: String): DStream[String] = {

   fileStream[LongWritable, Text,TextInputFormat](directory).map(_._2.toString)

  }

fileStream

  /**

   * Create a inputstream that monitors a Hadoop-compatible filesystem

   * for new files andreads them using the given key-value types and input format.

   * Files must bewritten to the monitored directory by "moving" them from another

   * location withinthe same file system. File names starting with . are ignored.

   * @param directoryHDFS directory to monitor for new file

   * @tparam K Keytype for reading HDFS file

   * @tparam V Valuetype for reading HDFS file

   * @tparam F Inputformat for reading HDFS file

   */

  deffileStream[

   K: ClassTag,

   V: ClassTag,

    F<: NewInputFormat[K, V]: ClassTag

  ](directory: String): DStream[(K, V)] = {

    newFileInputDStream[K, V, F](this, directory)

  }

newAPIHadoopFile

  /**

   * Get an RDD for agiven Hadoop file with an arbitrary new API InputFormat

   * and extraconfiguration options to pass to the input format.

   *

   * '''Note:'''Because Hadoop's RecordReader class re-uses the same Writable object for each

   * record, directlycaching the returned RDD will create many references to the same object.

   * If you plan todirectly cache Hadoop writable objects, you should first copy them using

   * a `map` function.

   */

  defnewAPIHadoopFile[K, V, F <: NewInputFormat[K, V]](

     path: String,

     fClass: Class[F],

     kClass: Class[K],

     vClass: Class[V],

     conf: Configuration = hadoopConfiguration): RDD[(K, V)] = {

   val job = new NewHadoopJob(conf)

   NewFileInputFormat.addInputPath(job, new Path(path))

   val updatedConf = job.getConfiguration

   new NewHadoopRDD(this, fClass, kClass, vClass, updatedConf)

  }

 

  /**

   * Get an RDD for agiven Hadoop file with an arbitrary new API InputFormat

   * and extraconfiguration options to pass to the input format.

   *

   * '''Note:'''Because Hadoop's RecordReader class re-uses the same Writable object for each

   * record, directlycaching the returned RDD will create many references to the same object.

   * If you plan todirectly cache Hadoop writable objects, you should first copy them using

   * a `map` function.

   */

  defnewAPIHadoopRDD[K, V, F <: NewInputFormat[K, V]](

     conf: Configuration = hadoopConfiguration,

     fClass: Class[F],

     kClass: Class[K],

     vClass: Class[V]): RDD[(K, V)] = {

   new NewHadoopRDD(this, fClass, kClass, vClass, conf)

  }

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值