Spark流处理图片转格式

最新推荐文章于 2022-04-30 09:49:16 发布

zzzzzqf

最新推荐文章于 2022-04-30 09:49:16 发布

阅读量1.1k

点赞数

文章标签： spark 流处理图片处理

本文链接：https://blog.csdn.net/zzzzzqf/article/details/24851355

版权

Spark流处理图片转格式

说明

Qimage是我自定义的数据类型，作为value，所以没有用WritableComparable

QimageInputFormat<Text,Qimage>是自定义的输入格式

QimageRecordReader<Text,Qimage>是自定义的RecordReader

QimageOutputFormat<Text,Qimage>是自定义的输出格式

QimageRecordWriter<Text,Qimage>是自定义的RecordWriter

这个数据结构已经在hadoop 上跑通了，现在转移到spark上进行流式的处理，实现了输入png格式的图片，输出bmp格式的图片，在写代码的时候遇到了几个问题。

1) 在spark中怎么使用自定义的数据类型

2) 怎么将自定义的数据类型保存到hdfs上

spark的流式为hdfs目录检测方式。

Data Serialization

Serialization plays an important role inthe performance of any distributed application. Formats that are slow toserialize objects into, or consume a large number of bytes, will greatly slowdown the computation. Often, this will be the first thing you should tune tooptimize a Spark application. Spark aims to strike a balance betweenconvenience (allowing you to work with any Java type in your operations) andperformance. It provides two serialization libraries:

· Java serialization: By default, Spark serializes objects using Java’s ObjectOutputStream framework, and can work with anyclass you create that implements java.io.Serializable. You can also control the performance of your serializationmore closely by extendingjava.io.Externalizable. Java serialization is flexible but often quite slow, and leadsto large serialized formats for many classes.

· Kryo serialization: Spark can also use the Kryo library (version 2) to serializeobjects more quickly. Kryo is significantly faster and more compact than Javaserialization (often as much as 10x), but does not support all Serializable types and requires you to register the classes you’ll use in the programin advance for best performance.

You can switch to using Kryo byinitializing your job with a SparkConf andcalling conf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer"). This setting configures the serializerused for not only shuffling data between worker nodes but also when serializingRDDs to disk. The only reason Kryo is not the default is because of the customregistration requirement, but we recommend trying it in any network-intensiveapplication.

Finally, to register your classes withKryo, create a public class that extends org.apache.spark.serializer.KryoRegistrator and set thespark.kryo.registrator config property to point to it, asfollows:

importcom.esotericsoftware.kryo.Kryo

importorg.apache.spark.serializer.KryoRegistrator

classMyRegistratorextendsKryoRegistrator{

  overridedefregisterClasses(kryo:Kryo){

    kryo.register(classOf[MyClass1])

    kryo.register(classOf[MyClass2])

valconf=newSparkConf().setMaster(...).setAppName(...)

conf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")

conf.set("spark.kryo.registrator","mypackage.MyRegistrator")

valsc=newSparkContext(conf)

The Kryodocumentation describesmore advanced registration options, such as adding custom serialization code.

If your objects are large, you may alsoneed to increase the spark.kryoserializer.buffer.mb config property. The default is 2, butthis value needs to be large enough to hold the largest objectyou will serialize.

Finally, if you don’t register yourclasses, Kryo will still work, but it will have to store the full class namewith each object, which is wasteful.

有关自定义数据类型在spark上的输入输出函数

来源于官方源码

textFileStream

/**

* Create a inputstream that monitors a Hadoop-compatible filesystem

* for new files andreads them as text files (using key as LongWritable, value

* as Text and inputformat as TextInputFormat). Files must be written to the

* monitored directoryby "moving" them from another location within the same

* file system. Filenames starting with . are ignored.

* @param directoryHDFS directory to monitor for new file

deftextFileStream(directory: String): DStream[String] = {

fileStream[LongWritable, Text,TextInputFormat](directory).map(_._2.toString)

}

fileStream

/**

* Create a inputstream that monitors a Hadoop-compatible filesystem

* for new files andreads them using the given key-value types and input format.

* Files must bewritten to the monitored directory by "moving" them from another

* location withinthe same file system. File names starting with . are ignored.

* @param directoryHDFS directory to monitor for new file

* @tparam K Keytype for reading HDFS file

* @tparam V Valuetype for reading HDFS file

* @tparam F Inputformat for reading HDFS file

deffileStream[

K: ClassTag,

V: ClassTag,

F<: NewInputFormat[K, V]: ClassTag

](directory: String): DStream[(K, V)] = {

newFileInputDStream[K, V, F](this, directory)

}

newAPIHadoopFile

/**

* Get an RDD for agiven Hadoop file with an arbitrary new API InputFormat

* and extraconfiguration options to pass to the input format.

* '''Note:'''Because Hadoop's RecordReader class re-uses the same Writable object for each

* record, directlycaching the returned RDD will create many references to the same object.

* If you plan todirectly cache Hadoop writable objects, you should first copy them using

* a `map` function.

defnewAPIHadoopFile[K, V, F <: NewInputFormat[K, V]](

path: String,

fClass: Class[F],

kClass: Class[K],

vClass: Class[V],

conf: Configuration = hadoopConfiguration): RDD[(K, V)] = {

val job = new NewHadoopJob(conf)

NewFileInputFormat.addInputPath(job, new Path(path))

val updatedConf = job.getConfiguration

new NewHadoopRDD(this, fClass, kClass, vClass, updatedConf)

}

/**

* Get an RDD for agiven Hadoop file with an arbitrary new API InputFormat

* and extraconfiguration options to pass to the input format.

* '''Note:'''Because Hadoop's RecordReader class re-uses the same Writable object for each

* record, directlycaching the returned RDD will create many references to the same object.

* If you plan todirectly cache Hadoop writable objects, you should first copy them using

* a `map` function.

defnewAPIHadoopRDD[K, V, F <: NewInputFormat[K, V]](

conf: Configuration = hadoopConfiguration,

fClass: Class[F],

kClass: Class[K],

vClass: Class[V]): RDD[(K, V)] = {

new NewHadoopRDD(this, fClass, kClass, vClass, conf)

}

zzzzzqf

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark流处理图片转格式

Spark流处理图片转格式说明Qimage是我自定义的数据类型，作为value，所以没有用WritableComparableQimageInputFormat是自定义的输入格式QimageRecordReader是自定义的RecordReaderQimageOutputFormat是自定义的输出格式QimageRecordWriter是自定义的RecordWriter
复制链接

扫一扫