5.2 Avro序列化
Avro是一个流行的序列化框架,其主要特点如下:
- 支持多种数据结构的序列化。
- 支持多种编程语言,而且序列化速度快,字节紧凑。
- Avro代码生成功能是可选的。无需生成类或代码,即可读写数据或使用RPC传输数据。
Avro使用schema来读取和写入数据。schema有助于简洁标识序列化后的对象。在Java序列化中,对象类型的元数据会被写入序列化后的字节流中,而schema的自解释能力可以避免这么做。JavaScript对象表示法(JSON)用于描述schema,这是一种在网络编程中很流行的对象表示法。在处理数据时,通过新旧schema并存的方式来应对schema的变化。
以下是Avro中的两个schema文件。第一个文件是worldcitiespop.txt文件对应的schema文件,第二个文件是countrycodes.txt对应的schema文件:
worldcitiespop.avschema
{"namespace": "MasteringHadoop.avro",
"type": "record",
"name": "City",
"fields": [
{"name": "countryCode", "type": "string"},
{"name": "cityName", "type": "string"},
{"name": "cityFullName", "type": "string"},
{"name": "regionCode", "type": ["int","null"]},
{"name": "population", "type": ["long", "null"]},
{"name": "latitude", "type": ["float", "null"]},
{"name": "longitude", "type": ["float", "null"]}
]
}
allcountries.avschema
{"namespace": "MasteringHadoop.avro",
"type": "record",
"name": "Country",
"fields": [
{"name": "countryCode", "type": "string"},
{"name": "countryName", "type": "string"}
]
}
Schema是自解释的,同时JSON表示法提高了可读性。Avro支持所有标准的原生数据类型,另外,Avro还支持复合数据类型,如联合(union)。Null值字段是null和其字段类型的联合。联合在语法的形式上表现为JSON数组。
我们用之前定义的City schema,把基于CSV文本格式的文件worldcitiespop.txt转换成Avro文件。以下代码演示了写入Avro文件的重要步骤。静态方法CsvToAvro包含主要的转换代码。这个方法获取参数csvFilePath,avroFilePath(输出文件的路径)和schema文件的存放路径。Avro中有个特别的Schema类,对schema文件的解析就是初始化该类的对象。schema不会生成代码,所以我们使用GenericRecord来初始化schema,并用它来写入数据点。如果schema被用来生成代码,那么结果就是City类,会和其他Java类一样,直接导入(import)到以下代码中。
DataFileWriter类把实际记录写入到文件。它有个create方法,用于创建Avro的输出文件。使用BufferedReader对象,可以让我们从CSV文件中一次一行地读取每个城市记录。getCity辅助方法读取一行,然后以符号逗号把一行切分为各个标记字符串,并产生一个GenericRecord对象。GenericData.Record类用于实例化Avro记录,其构造函数的参数是一个Schema对象。
调用put方法并传入参数,记录字段名和对应的值,就可写入GenericRecord对象。isNumeric方法用于验证经过标记处理后的字符串是否是数字。坏记录会被跳过,从而不会被写入Avro文件。如果某个字段没有使用put方法进行设值,那么这个字段的值会被认为是null:
MasteringHadoopCsvToAvro.java
package MasteringHadoop;
import org.apache.avro.Schema;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericDatumWriter;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.io.DatumWriter;
import java.io.*;
public class MasteringHadoopCsvToAvro {
public static void CsvToAvro(String csvFilePath, String avroFilePath, String schemaFile) throws IOException{
//Read the schema
Schema schema = (new Schema.Parser()).parse(new File(schemaFile));
File avroFile = new File(avroFilePath);
DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<>(schema);
DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<>(datumWriter);
dataFileWriter.create(schema,avroFile);
BufferedReader bufferedReader = new BufferedReader(new FileReader(csvFilePath));
String commaSeparatedLine;
while((commaSeparatedLine = bufferedReader.readLine()) != null){
GenericRecord city = getCountry(commaSeparatedLine, schema);
if(city != null)
dataFileWriter.append(city);
}
dataFileWriter.close();
}
private static GenericRecord getCountry(String commaSeparatedLine, Schema schema){
GenericRecord country = null;
String[] tokens = commaSeparatedLine.split(",");
if(tokens.length == 2){
country = new GenericData.Record(schema);
country.put("countryCode", tokens[0]);
country.put("countryName", tokens[1]);
}
return country;
}
private static GenericRecord getCity(String commaSeparatedLine, Schema schema){
GenericRecord city = null;
String[] tokens = commaSeparatedLine.split(",");
//Filter out the bad tokens
if(tokens.length == 7){
city = new GenericData.Record(schema);
city.put("countryCode", tokens[0]);
city.put("cityName", tokens[1]);
city.put("cityFullName", tokens[2]);
if(tokens[3] != null && tokens[3].length() > 0 && isNumeric(tokens[3])){
city.put("regionCode", Integer.parseInt(tokens[3]));
}
if(tokens[4] != null && tokens[4].length() > 0 && isNumeric(tokens[4])){
city.put("population", Long.parseLong(tokens[4]));
}
if(tokens[5] != null && tokens[5].length() > 0 && isNumeric(tokens[5])){
city.put("latitude", Float.parseFloat(tokens[5]));
}
if(tokens[6] != null && tokens[6].length() > 0 && isNumeric(tokens[6])){
city.put("longitude", Float.parseFloat(tokens[6]));
}
}
return city;
}
public static void main(String[] args){
try{
CsvToAvro(args[0], args[1], args[2]);
}
catch(IOException iox){
iox.printStackTrace();
}
System.out.println("Task has Finished!");
}
public static boolean isNumeric(String str){
try{
double d = Double.parseDouble(str);
}
catch(NumberFormatException nfe){
return false;
}
return true;
}
}
执行参数:
./input/countrycodes.txt ./output/countrycodes.avro ./input/allcountries.avschema
**5.2.1 Avro与MapReduce
Hadoop广泛支持在MapReduce作业中使用Avro序列化和反序列化。在Hadoop 1.x中,需要使用特殊的类,AvroMapper与AvroReducer。然而,在Hadoop 2.x中,只需重用内置的Mapper与Reducer类即可。AvroKey可以作为Mapper与Reducer类的输入或输出类型。
AvroKeyInputFormat是一个特殊的InputFormat类,用于从输入文件中读取AvroKey。worldcitiespop.avro由之前的程序生成,以下代码读取这个文件并计算每个国家的人口数。
MasteringHadoopAvroMapReduce.java
package MasteringHadoop;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.mapred.AvroKey;
import org.apache.avro.mapreduce.AvroJob;
import org.apache.avro.mapreduce.AvroKeyInputFormat;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import java.io.File;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
public class MasteringHadoopAvroMapReduce {
private static String citySchema = "{\"namespace\": \"MasteringHadoop.avro\",\n" +
" \"type\": \"record\",\n" +
" \"name\": \"City\",\n" +
" \"fields\": [\n" +
" {\"name\": \"countryCode\", \"type\": \"string\"},\n" +
" {\"name\": \"cityName\", \"type\": \"string\"},\n" +
" {\"name\": \"cityFullName\", \"type\": \"string\"},\n" +
" {\"name\": \"regionCode\", \"type\": [\"int\",\"null\"]},\n" +
" {\"name\": \"population\", \"type\": [\"long\", \"null\"]},\n" +
" {\"name\": \"latitude\", \"type\": [\"float\", \"null\"]},\n" +
" {\"name\": \"longitude\", \"type\": [\"float\", \"null\"]}\n" +
" ]\n" +
"}";
public static class MasteringHadoopAvroMapper extends Mapper<AvroKey<GenericRecord>, NullWritable, Text, LongWritable>{
private Text ccode = new Text();
private LongWritable population = new LongWritable();
private String inputSchema;
@Override
protected void setup(Context context) throws IOException, InterruptedException {
inputSchema = context.getConfiguration().get("citySchema");
}
@Override
protected void map(AvroKey<GenericRecord> key, NullWritable value, Context context) throws IOException, InterruptedException {
GenericRecord record = key.datum();
String countryCode = (String) record.get("countryCode");
Long cityPopulation = (Long) record.get("population");
if(cityPopulation != null){
ccode.set(countryCode);
population.set(cityPopulation.longValue());
context.write(ccode, population);
}
}
}
public static class MasteringHadoopAvroReducer extends Reducer<Text, LongWritable, Text, LongWritable>{
private LongWritable total = new LongWritable();
@Override
protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
long totalPopulation = 0;
for(LongWritable pop : values){
totalPopulation += pop.get();
}
total.set(totalPopulation);
context.write(key, total);
}
}
public static void main(String args[]) throws IOException, InterruptedException, ClassNotFoundException, URISyntaxException{
GenericOptionsParser parser = new GenericOptionsParser(args);
Configuration config = parser.getConfiguration();
String[] remainingArgs = parser.getRemainingArgs();
config.set("citySchema", citySchema);
Job job = Job.getInstance(config, "MasteringHadoop-AvroMapReduce");
job.setMapOutputKeyClass(AvroKey.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.addCacheFile(new URI(remainingArgs[2]));
job.setMapperClass(MasteringHadoopAvroMapper.class);
job.setReducerClass(MasteringHadoopAvroReducer.class);
job.setNumReduceTasks(1);
Schema schema = (new Schema.Parser()).parse(new File(remainingArgs[2]));
AvroJob.setInputKeySchema(job, schema);
job.setInputFormatClass(AvroKeyInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
AvroKeyInputFormat.addInputPath(job, new Path(remainingArgs[0]));
TextOutputFormat.setOutputPath(job, new Path(remainingArgs[1]));
job.waitForCompletion(true);
}
}
  我们把schema信息作为字符串,通过另一种方式进行传播。在以上代码中,通过对Configuration对象设置一个键进行传播。当然,DistributedCache也可用于传播schema文件。setup方法重写后用于在Map任务中读取schema。
运行参数:
./input/worldcitiespop.avro ./output ./input/worldcitiespop.avschema
程序运行有错误!