Spark 读/写 lzo 文件 pairRDD

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/AbnerSunYH/article/details/79028558

Spark 读/写 lzo 文件 pairRDD

具体代码 Java

    SparkConf conf = new SparkConf().setMaster("local").setAppName("CheckLog");
    JavaSparkContext sc = new JavaSparkContext(conf);
    org.apache.hadoop.conf.Configuration configuration = new Configuration();
    configuration.set("io.compression.codecs", "org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,com.hadoop.compression.lzo.LzopCodec");
    configuration.set("io.compression.codec.lzo.class", "com.hadoop.compression.lzo.LzoCodec");
    //读取 lzo 文件
        JavaPairRDD<LongWritable, Text> pairRDD =
                sc.newAPIHadoopFile(readLzoFileName,
                        LzoTextInputFormat.class,
                        LongWritable.class,
                        Text.class,
                        configuration);
        System.out.println(pairRDD.keys().count());
    //写入 lzo 文件
        pairRDD.values().saveAsTextFile(saveLzoFilePath, LzopCodec.class);

pairRDD 的 key 表示文件每一行开头的 offset,读取文本文件正常的话不应该存在相同的 key 值,读取 lzo 文件发现有很多相同的 key 值。
pairRDD 重复 key

为什么 lzo 有很多重复的 key 呢,因为 lzo 是按块压缩的,猜测 key 可能是每个块开头的offset,即每一块中的所有行都是同一个key。

验证

com.hadoop.compression.lzo.LzoIndex 读取 lzo 文件,尝试生成 index 文件。debug 可以看到,第二个 pos 确实是 pairRDD 第二个 key 的值,为 42682。所以猜测是正确的,pairRDD.keys().count() 输出的行数和解压文件的行数是一则的。即用 newAPIHadoopFile 读取到的 pairRDD ,key 为每个块开头的 offset,value 为原始文件的每一行。可以放心使用了。
debug LzoIndex
lzo index

错误

18/01/10 20:57:19 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.io.IOException: Codec for file hdfs://192.168.68.102:9000/home/workspace/hdfs_admin/logs/home___mg_dc/all.log.lzo not found, cannot run
    at com.hadoop.mapreduce.LzoLineRecordReader.initialize(LzoLineRecordReader.java:99)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:180)
	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:177)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
18/01/10 20:57:19 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.io.IOException: Codec for file hdfs://192.168.68.102:9000/home/workspace/hdfs_admin/logs/home___mg_dc/all.log.lzo not found, cannot run
    at com.hadoop.mapreduce.LzoLineRecordReader.initialize(LzoLineRecordReader.java:99)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:180)
	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:177)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
18/01/10 20:57:19 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job

缺少

configuration.set("io.compression.codecs", "org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,com.hadoop.compression.lzo.LzopCodec");
configuration.set("io.compression.codec.lzo.class", "com.hadoop.compression.lzo.LzoCodec");
展开阅读全文

没有更多推荐了,返回首页