5.2 Avro序列化

最新推荐文章于 2021-09-15 21:15:45 发布

Avalonist

最新推荐文章于 2021-09-15 21:15:45 发布

阅读量483

点赞数

分类专栏： [精通Hadoop] 文章标签： Avro

[精通Hadoop] 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

5.2 Avro序列化

Avro是一个流行的序列化框架，其主要特点如下：

支持多种数据结构的序列化。
支持多种编程语言，而且序列化速度快，字节紧凑。
Avro代码生成功能是可选的。无需生成类或代码，即可读写数据或使用RPC传输数据。

Avro使用schema来读取和写入数据。schema有助于简洁标识序列化后的对象。在Java序列化中，对象类型的元数据会被写入序列化后的字节流中，而schema的自解释能力可以避免这么做。JavaScript对象表示法（JSON）用于描述schema，这是一种在网络编程中很流行的对象表示法。在处理数据时，通过新旧schema并存的方式来应对schema的变化。

以下是Avro中的两个schema文件。第一个文件是worldcitiespop.txt文件对应的schema文件，第二个文件是countrycodes.txt对应的schema文件：
worldcitiespop.avschema

{"namespace": "MasteringHadoop.avro",
 "type": "record",
 "name": "City",
 "fields": [
     {"name": "countryCode", "type": "string"},
     {"name": "cityName",  "type": "string"},
     {"name": "cityFullName", "type": "string"},
     {"name": "regionCode", "type": ["int","null"]},
     {"name": "population", "type": ["long", "null"]},
     {"name": "latitude", "type": ["float", "null"]},
     {"name": "longitude", "type": ["float", "null"]}
 ]
}

allcountries.avschema

{"namespace": "MasteringHadoop.avro",
 "type": "record",
 "name": "Country",
 "fields": [
     {"name": "countryCode", "type": "string"},
     {"name": "countryName",  "type": "string"}
 ]
}

Schema是自解释的，同时JSON表示法提高了可读性。Avro支持所有标准的原生数据类型，另外，Avro还支持复合数据类型，如联合（union）。Null值字段是null和其字段类型的联合。联合在语法的形式上表现为JSON数组。

我们用之前定义的City schema，把基于CSV文本格式的文件worldcitiespop.txt转换成Avro文件。以下代码演示了写入Avro文件的重要步骤。静态方法CsvToAvro包含主要的转换代码。这个方法获取参数csvFilePath，avroFilePath（输出文件的路径）和schema文件的存放路径。Avro中有个特别的Schema类，对schema文件的解析就是初始化该类的对象。schema不会生成代码，所以我们使用GenericRecord来初始化schema，并用它来写入数据点。如果schema被用来生成代码，那么结果就是City类，会和其他Java类一样，直接导入（import）到以下代码中。

DataFileWriter类把实际记录写入到文件。它有个create方法，用于创建Avro的输出文件。使用BufferedReader对象，可以让我们从CSV文件中一次一行地读取每个城市记录。getCity辅助方法读取一行，然后以符号逗号把一行切分为各个标记字符串，并产生一个GenericRecord对象。GenericData.Record类用于实例化Avro记录，其构造函数的参数是一个Schema对象。

调用put方法并传入参数，记录字段名和对应的值，就可写入GenericRecord对象。isNumeric方法用于验证经过标记处理后的字符串是否是数字。坏记录会被跳过，从而不会被写入Avro文件。如果某个字段没有使用put方法进行设值，那么这个字段的值会被认为是null：
MasteringHadoopCsvToAvro.java

package MasteringHadoop;

import org.apache.avro.Schema;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericDatumWriter;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.io.DatumWriter;

import java.io.*;

public class MasteringHadoopCsvToAvro {
   public static void CsvToAvro(String csvFilePath, String avroFilePath, String schemaFile) throws IOException{
        //Read the schema
        Schema schema  = (new Schema.Parser()).parse(new File(schemaFile));
        File avroFile = new File(avroFilePath);

        DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<>(schema);
        DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<>(datumWriter);
        dataFileWriter.create(schema,avroFile);

        BufferedReader bufferedReader = new BufferedReader(new FileReader(csvFilePath));
        String commaSeparatedLine;
        while((commaSeparatedLine = bufferedReader.readLine()) != null){
            GenericRecord city = getCountry(commaSeparatedLine, schema);
            if(city != null)
                dataFileWriter.append(city);
        }
        dataFileWriter.close();
    }

    private static GenericRecord getCountry(String commaSeparatedLine, Schema schema){
        GenericRecord country = null;
        String[] tokens = commaSeparatedLine.split(",");

        if(tokens.length == 2){
            country = new GenericData.Record(schema);
            country.put("countryCode", tokens[0]);
            country.put("countryName", tokens[1]);
        }
        return country;
    }

    private static GenericRecord getCity(String commaSeparatedLine, Schema schema){

         GenericRecord city = null;
         String[] tokens = commaSeparatedLine.split(",");
        //Filter out the bad tokens
         if(tokens.length == 7){
             city = new GenericData.Record(schema);
             city.put("countryCode", tokens[0]);
             city.put("cityName", tokens[1]);
             city.put("cityFullName", tokens[2]);

             if(tokens[3] != null && tokens[3].length() > 0 && isNumeric(tokens[3])){
                 city.put("regionCode", Integer.parseInt(tokens[3]));
             }

             if(tokens[4] != null && tokens[4].length() > 0 && isNumeric(tokens[4])){
                city.put("population", Long.parseLong(tokens[4]));
             }

             if(tokens[5] != null && tokens[5].length() > 0 && isNumeric(tokens[5])){
                 city.put("latitude", Float.parseFloat(tokens[5]));
             }

             if(tokens[6] != null && tokens[6].length() > 0 && isNumeric(tokens[6])){
                 city.put("longitude", Float.parseFloat(tokens[6]));
             }
         }
         return city;
    }

    public static void main(String[] args){
         try{
             CsvToAvro(args[0], args[1], args[2]);
         }
         catch(IOException iox){
             iox.printStackTrace();
         }
         System.out.println("Task has Finished!");
    }

    public static boolean isNumeric(String str){
        try{
            double d = Double.parseDouble(str);
        }
        catch(NumberFormatException nfe){
            return false;
        }
        return true;
    }
}

执行参数：

./input/countrycodes.txt ./output/countrycodes.avro  ./input/allcountries.avschema

**5.2.1 Avro与MapReduce

Hadoop广泛支持在MapReduce作业中使用Avro序列化和反序列化。在Hadoop 1.x中，需要使用特殊的类，AvroMapper与AvroReducer。然而，在Hadoop 2.x中，只需重用内置的Mapper与Reducer类即可。AvroKey可以作为Mapper与Reducer类的输入或输出类型。

AvroKeyInputFormat是一个特殊的InputFormat类，用于从输入文件中读取AvroKey。worldcitiespop.avro由之前的程序生成，以下代码读取这个文件并计算每个国家的人口数。
MasteringHadoopAvroMapReduce.java

package MasteringHadoop;

import org.apache.avro.Schema;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.mapred.AvroKey;
import org.apache.avro.mapreduce.AvroJob;
import org.apache.avro.mapreduce.AvroKeyInputFormat;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

import java.io.File;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

public class MasteringHadoopAvroMapReduce {
    private static String citySchema = "{\"namespace\": \"MasteringHadoop.avro\",\n" +
            " \"type\": \"record\",\n" +
            " \"name\": \"City\",\n" +
            " \"fields\": [\n" +
            "     {\"name\": \"countryCode\", \"type\": \"string\"},\n" +
            "     {\"name\": \"cityName\",  \"type\": \"string\"},\n" +
            "     {\"name\": \"cityFullName\", \"type\": \"string\"},\n" +
            "     {\"name\": \"regionCode\", \"type\": [\"int\",\"null\"]},\n" +
            "     {\"name\": \"population\", \"type\": [\"long\", \"null\"]},\n" +
            "     {\"name\": \"latitude\", \"type\": [\"float\", \"null\"]},\n" +
            "     {\"name\": \"longitude\", \"type\": [\"float\", \"null\"]}\n" +
            " ]\n" +
            "}";

    public static class MasteringHadoopAvroMapper extends Mapper<AvroKey<GenericRecord>, NullWritable, Text, LongWritable>{

        private Text ccode = new Text();
        private LongWritable population = new LongWritable();
        private String inputSchema;

        @Override
        protected void setup(Context context) throws IOException, InterruptedException {
            inputSchema = context.getConfiguration().get("citySchema");
         }

        @Override
        protected void map(AvroKey<GenericRecord> key, NullWritable value, Context context) throws IOException, InterruptedException {

            GenericRecord record = key.datum();
            String countryCode = (String) record.get("countryCode");
            Long cityPopulation = (Long) record.get("population");

            if(cityPopulation != null){
                ccode.set(countryCode);
                population.set(cityPopulation.longValue());
                context.write(ccode, population);
            }
        }
    }

    public static class MasteringHadoopAvroReducer extends Reducer<Text, LongWritable, Text, LongWritable>{

        private LongWritable total = new LongWritable();

        @Override
        protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
            long totalPopulation = 0;

            for(LongWritable pop : values){
                totalPopulation += pop.get();
            }

            total.set(totalPopulation);
            context.write(key, total);
        }
    }

    public static void main(String args[]) throws IOException, InterruptedException, ClassNotFoundException, URISyntaxException{

        GenericOptionsParser parser = new GenericOptionsParser(args);
        Configuration config = parser.getConfiguration();
        String[] remainingArgs = parser.getRemainingArgs();

        config.set("citySchema", citySchema);

        Job job = Job.getInstance(config, "MasteringHadoop-AvroMapReduce");

        job.setMapOutputKeyClass(AvroKey.class);
        job.setMapOutputValueClass(Text.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

        job.addCacheFile(new URI(remainingArgs[2]));

        job.setMapperClass(MasteringHadoopAvroMapper.class);
        job.setReducerClass(MasteringHadoopAvroReducer.class);
        job.setNumReduceTasks(1);

        Schema schema  = (new Schema.Parser()).parse(new File(remainingArgs[2]));
        AvroJob.setInputKeySchema(job, schema);

        job.setInputFormatClass(AvroKeyInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        AvroKeyInputFormat.addInputPath(job, new Path(remainingArgs[0]));
        TextOutputFormat.setOutputPath(job, new Path(remainingArgs[1]));

        job.waitForCompletion(true);
    }
}

&emsp;&emsp;我们把schema信息作为字符串，通过另一种方式进行传播。在以上代码中，通过对Configuration对象设置一个键进行传播。当然，DistributedCache也可用于传播schema文件。setup方法重写后用于在Map任务中读取schema。

运行参数：

./input/worldcitiespop.avro ./output ./input/worldcitiespop.avschema

程序运行有错误！

5.2.2 Avro与Pig

5.2.3 Avro与Hive

**5.2.4 Avro与Protocol Buffers/Thrift

Avalonist

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
5.2 Avro序列化

5.2 Avro序列化 Avro是一个流行的序列化框架，其主要特点如下：支持多种数据结构的序列化。支持多种编程语言，而且序列化速度快，字节紧凑。Avro代码生成功能是可选的。无需生成类或代码，即可读写数据或使用RPC传输数据。 Avro使用schema来读取和写入数据。schema有助于简洁标识序列化后的对象。在Java序列化中，对象类型的元数据会被写入序列化后的字节流中，
复制链接

扫一扫

专栏目录