SequenceFile文件是Hadoop用来存储二进制形式的key-value对而设计的一种平面文件(Flat File)。
Spark 有专门用来读取 SequenceFile 的接口。在 SparkContext 中,可以调用 sequenceFile keyClass, valueClass。
scala> val data = sc.parallelize(List((2,"aa"),(3,"bb"),(4,"cc"),(5,"dd"),(6,"ee")))
data: org.apache.spark.rdd.RDD[(Int, String)]
= ParallelCollectionRDD[1] at parallelize at <console>:24
scala> data.saveAsSequenceFile("hdfs://hadoop102:9000/sequdata")
scala> val sdata = sc.sequenceFile[Int,String]("hdfs://hadoop102:9000/sequdata/p*")
sdata: org.apache.spark.rdd.RDD[(Int, String)]
= MapPartitionsRDD[6] at sequenceFile at <console>:24
scala> sdata.collect
res1: Array[(Int, String)] = Array((2,aa), (3,bb), (4,cc), (5,dd), (6,ee))
可以直接调用 saveAsSequenceFile(path) 保存你的PairRDD,它会帮你写出数据。需要键和值能够自动转为Writable类型。
Scala 类型 | Java类型 | Hadoop Writable 类型 |
---|---|---|
Int | Integer | InWritable 或VInWritable |
Long | Long | LongWritable或VLongWritable |
Float | Float | FloatWritable |
Double | Double | DoubleWritable |
Boolean | Boolean | BooleanWritable |
Array[Byte] | byte[] | BytesWritable |
String | String | Text |
Array[T] | T[] | ArrayWritable<TW> |
List[T] | List[T] | ArrayWritable<TW> |
Map[A,B] | Map<A,B> | MapWritable<AW,BW> |