Spark 读/写 lzo 文件 pairRDD

Spark 读/写 lzo 文件 pairRDD

具体代码 Java

    SparkConf conf = new SparkConf().setMaster("local").setAppName("CheckLog");
    JavaSparkContext sc = new JavaSparkContext(conf);
    org.apache.hadoop.conf.Configuration configuration = new Configuration();
    configuration.set("io.compression.codecs", "org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,com.hadoop.compression.lzo.LzopCodec");
    configuration.set("io.compression.codec.lzo.class", "com.hadoop.compression.lzo.LzoCodec");
    //读取 lzo 文件
        JavaPairRDD<LongWritable, Text> pairRDD =
                sc.newAPIHadoopFile(readLzoFileName,
                        LzoTextInputFormat.class,
                        LongWritable.class,
                        Text.class,
                        configuration);
        System.out.println(pairRDD.keys().count());
    //写入 lzo 文件
        pairRDD.values().saveAsTextFile(saveLzoFilePath, LzopCodec.class);

pairRDD 的 key 表示文件每一行开头的 offset,读取文本文件正常的话不应该存在相同的 key 值,读取 lzo 文件发现有很多相同的 key 值。
pairRDD 重复 key

为什么 lzo 有很多重复的 key 呢,因为 lzo 是按块压缩的,猜测 key 可能是每个块开头的offset,即每一块中的所有行都是同一个key。

验证

com.hadoop.compression.lzo.LzoIndex 读取 lzo 文件,尝试生成 index 文件。debug 可以看到,第二个 pos 确实是 pairRDD 第二个 key 的值,为 42682。所以猜测是正确的,pairRDD.keys().count() 输出的行数和解压文件的行数是一则的。即用 newAPIHadoopFile 读取到的 pairRDD ,key 为每个块开头的 offset,value 为原始文件的每一行。可以放心使用了。
debug LzoIndex
lzo index

错误

18/01/10 20:57:19 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.io.IOException: Codec for file hdfs://192.168.68.102:9000/home/workspace/hdfs_admin/logs/home___mg_dc/all.log.lzo not found, cannot run
    at com.hadoop.mapreduce.LzoLineRecordReader.initialize(LzoLineRecordReader.java:99)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:180)
	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:177)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
18/01/10 20:57:19 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.io.IOException: Codec for file hdfs://192.168.68.102:9000/home/workspace/hdfs_admin/logs/home___mg_dc/all.log.lzo not found, cannot run
    at com.hadoop.mapreduce.LzoLineRecordReader.initialize(LzoLineRecordReader.java:99)
    at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:180)
	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:177)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)
    at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
18/01/10 20:57:19 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job

缺少

configuration.set("io.compression.codecs", "org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,com.hadoop.compression.lzo.LzopCodec");
configuration.set("io.compression.codec.lzo.class", "com.hadoop.compression.lzo.LzoCodec");
  • 2
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值