saveAsNewAPIHadoopFIle和saveAsHadoopFile的的区别
引用的outputFormat的类路径不同,saveAsNewAPIHadoopFIle用的OutputFormat是import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat,saveAsHadoopFile用的是org.apache.hadoop.mapred.TextOutputFormat
用saveAsHadoopFile保存RDD到HDFS文件系统上
scala> import org.apache.hadoop.io.{IntWritable, Text}
import org.apache.hadoop.io.{IntWritable, Text}
scala> import org.apache.hadoop.mapred.TextOutputFormat
import org.apache.hadoop.mapred.TextOutputFormat
scala> val user = sc.parallelize(Array(("jack",20),("jim",10)))
user: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[5] at parallelize at <console>:32
scala> user.saveAsHadoopFile("hdfs://localhost:9000/test/hadoopkv",
| classOf[Text],classOf[IntWritable],classOf[TextOutputFormat[Text,IntWritable]])
saveAsNewAPIHadoopFIle保存文件
scala> import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
scala> val user = sc.parallelize(Array(("jack",20),("jim",10)))
user: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[6] at parallelize at <console>:36
scala> user.saveAsNewAPIHadoopFile("hdfs://localhost:9000/test/hadoopkv1", classOf[Text],classOf[IntWritable],classOf[TextOutputFormat[Text,IntWritable]])
读取hadoop文件,用hadoopFile这种方式读取hdfs上的文件是,需要在sparkconf注册注册序列话的类,否则会包类似:java.io.NotSerializableException: org.apache.hadoop.io.Text这样的错误,造成这些错误的原因是hadoop常用的数据类型(IntWritable,Text等)未序列化,可以如下操作解决这个问题
val conf = new SparkConf().setMaster("local").setAppName("quickspark002")
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
#conf.registerKryoClasses(Array(classOf[Text],classOf[org.apache.hadoop.io.LongWritable]))
conf.set("spark.kryoserializer.classesToRegister","org.apache.hadoop.io.Text,org.apache.hadoop.io.LongWritable")
val sc = new SparkContext(conf)
hadoopFile读取文件
import org.apache.hadoop.mapred.KeyValueTextInputFormat
sc.hadoopFile("hdfs://localhost:9000/test/hadoopkv1",classOf[KeyValueTextInputFormat],classOf[Text],classOf[Text]).collect().foreach{
case(name,age)=> {
println(name + "======" + age)
}
}
jack======20
jim======10
newAPIHadoopFile读取文件
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat
sc.newAPIHadoopFile[Text,Text,KeyValueTextInputFormat]("hdfs://localhost:9000/test/hadoopkv1").foreach{case(k,v)=>{
println(k+"-------"+v)
}}
jack-------20
jim-------10