业务场景:交通网中存在大量的实时和历史过车数据,应用历史过车数据的均衡变化评价某个点位,某个时刻的瞬时交通流有很大的意义,基于流量变化趋势能分析出城市交通行车高峰期等特性.
移动平均: 详细描述见:
http://wiki.mbalib.com/wiki/%E7%A7%BB%E5%8A%A8%E5%B9%B3%E5%9D%87%E6%B3%95,
以每分钟统计的交通流数据和每分钟以移动平均算法统计的交通流数据做一个简单的比较:
图A 图B
上图中图A统计每一分钟某卡点车流量变化,诸多因素造车变化不均匀,与期望不符.图B使用移动平均算法[窗口长度为3,可变化,对结果造成的印象也在逐渐变化],current`=(cuffent+current[-1]+current[-2]+…+current[-n])/n 次数n为窗口大小,current`为移动平均后的值,current为移动平均前的值,从结果来看移动平均算法后的当前值与历史中的前n-1项相关辆,一定程度上平滑了数据的不稳定性诸如红绿灯等因素. 注:此处未提及加权移动平均,读者可自行设计实现.
处理思路:
Map -->( CarAvgOrder[组合键:(卡口ID,过车时间[降级为统计粒度])]) -->Combine[此过程中应按组合键全值进行分组,否则将出错] -->(CarAvgOrder,一个Map中当前KEY的数量) -->Reduce[同Combine]-->移动平均的实现[此处使用队列]-->多文件输出[文件格式:卡口编号,内容:时间\t数量,一行一条]
编程实现:
1.CarAvgOrder.java ,组合键用于二次排序,可参考Hadoop 二次排序相关文章.
class CarAvgOrder implements Writable,WritableComparable<CarAvgOrder> {
privateText tgsid;
privateLongWritable passTime;
// MUST
publicCarAvgOrder() {
tgsid= new Text();
passTime= new LongWritable();
}
publicint compareTo(CarAvgOrder order) {
intresult = this.tgsid.compareTo(order.getTgsid());
if(result == 0) {
result= passTime.compareTo(order.getPassTime());
}
returnresult;
}
publicvoid write(DataOutput out) throws IOException {
tgsid.write(out);
passTime.write(out);
}
publicvoid readFields(DataInput in) throws IOException {
tgsid.readFields(in);
passTime.readFields(in);
}
publicText getTgsid() {
returntgsid;
}
publicvoid setTgsid(Text tgsid) {
this.tgsid= tgsid;
}
publicLongWritable getPassTime() {
returnpassTime;
}
publicvoid setPassTime(LongWritable passTime) {
this.passTime= passTime;
}
publicCarAvgOrder(Text tgsid, LongWritable passTime) {
super();
this.tgsid= tgsid;
this.passTime= passTime;
}
}
2.CarAvgFlowMapper.java Mapper函数的实现,此处使用的技巧以在代码注释中体现,借助MR框架接收参数,将统计的时间粒度设为可变.
class CarAvgFlowMapper extendsMapper<LongWritable, Text, CarAvgOrder, IntWritable> {
privateSimpleDateFormat sdf;
// 数据统计粒度 N*分钟
privateInteger granularity = 1;
@Override
protectedvoid setup(Mapper<LongWritable, Text, CarAvgOrder, IntWritable>.Contextcontext)
throwsIOException, InterruptedException {
Configurationconf = context.getConfiguration();
Stringformat = conf.get(CarAvgFlowPerHour.INPUT_DATE_FORMAT, "yyyy-mm-ddHH:mm:ss");
granularity=conf.getInt(CarAvgFlowPerHour.GRANULARITY_SIZE,1);
sdf= new SimpleDateFormat(format);
}
@Override
protectedvoid map(LongWritable key, Text value,
Mapper<LongWritable,Text, CarAvgOrder, IntWritable>.Context context)
throwsIOException, InterruptedException {
IntWritablebase = new IntWritable(1);
Stringtemp = value.toString();
if(temp.length() > 13) {
temp= temp.substring(12);
String[]items = temp.split(",");
if(items.length > 10) {
try{
//过车时间
Datedate = sdf.parse(items[0].substring(9));
//MR二次排序过程中以分钟为最小单元,此处将日期降级至(1分钟*统计粒度),计算结束后进行还原
CarAvgOrdercao = new CarAvgOrder(new Text(items[14].substring(6)),
newLongWritable(date.getTime() / 1000 / (60*granularity)));
context.write(cao,base);
}catch (Exception e) {
e.printStackTrace();
}
}
}
}
}
3.CarAvgFlowCombine.java ,此处Combine函数可选,如数据量很大可借助Combine函数降低IO操作能显著提高效率,此处需要注意的是不能对Combine按卡口编号进行分组,这是区别于统计基于卡点流量和卡点时间流量的体现.
class CarAvgFlowCombine extendsReducer<CarAvgOrder, IntWritable, CarAvgOrder, IntWritable> {
@Override
protectedvoid reduce(CarAvgOrder cao, Iterable<IntWritable> values,
Reducer<CarAvgOrder,IntWritable, CarAvgOrder, IntWritable>.Context context)
throwsIOException, InterruptedException {
intsum = 0;
for(IntWritable lw : values) {
sum+= lw.get();
}
context.write(cao,new IntWritable(sum));
}
}
4.CarAvgPartitioner.java, 数据分区函数,可选,如分区数太多也不合理,此处以卡点数量进行分区.
class CarAvgPartitioner extendsPartitioner<CarAvgOrder, LongWritable> {
@Override
publicint getPartition(CarAvgOrder key, LongWritable value, int numPartitions) {
returnkey.getTgsid().hashCode() % numPartitions;
}
}
5.CarAvgFlowReduce.java ,Reduce函数实现,移动平均算法的体现,此处以队列存储当前时间及以前N-1项的数据,N为窗口大小,由MR框架传入,移动平均计算转换为求队列中所有数据的算数平均值. 统计结果以卡点为标识按文件输出,文件内按时间序列输出统计信息.
class CarAvgFlowReduce extendsReducer<CarAvgOrder, IntWritable, Text, Text> {
privateMultipleOutputs<Text, Text> mo;
privateInteger windowSize = 3;
privateInteger granularity=1;
privateQueue<Integer> queue ;
privateSimpleDateFormat sdf;
@Override
protectedvoid setup(Reducer<CarAvgOrder, IntWritable, Text, Text>.Context context)
throwsIOException, InterruptedException {
mo= new MultipleOutputs<Text, Text>(context);
Configurationconf = context.getConfiguration();
Stringformat = conf.get(CarAvgFlowPerHour.INPUT_DATE_FORMAT, "yyyy-mm-ddHH:mm:ss");
sdf= new SimpleDateFormat(format);
windowSize= conf.getInt(CarAvgFlowPerHour.WINDOW_SIZE, 3);
granularity=conf.getInt(CarAvgFlowPerHour.GRANULARITY_SIZE,1);
queue= new LinkedBlockingQueue<Integer>(windowSize + 1);
}
//
@Override
protectedvoid reduce(CarAvgOrder cao, Iterable<IntWritable> values,
Reducer<CarAvgOrder,IntWritable, Text, Text>.Context context) throws IOException,InterruptedException {
intsum = 0;
for(IntWritable lw : values) {
sum+= lw.get();
}
Stringtime = sdf.format(new Date(cao.getPassTime().get() * 1000* (60*granularity)));
//计算移动平均
queue.add(sum);
if(queue.size() > windowSize) {
queue.poll();
}
sum= 0;
for(Integer item : queue) {
sum+= item;
}
intavg_cont = queue.size() == windowSize ? windowSize : queue.size();
intmovAvg = sum / avg_cont;
mo.write(newText(String.valueOf((time))), new Text(String.valueOf(movAvg)),cao.getTgsid().toString());
}
@Override
protectedvoid cleanup(Reducer<CarAvgOrder, IntWritable, Text, Text>.Contextcontext)
throwsIOException, InterruptedException {
mo.close();
}
}
6.CarAvgFlowPerHour.java , MR驱动,接收入参启动MR作业.
/*移动平均计算每天的0-24点卡口流量移动平均值*/
public class CarAvgFlowPerHour {
publicstatic final String INPUT_DATE_FORMAT = "INPUT_DATE_FORMAT";
publicstatic final String WINDOW_SIZE = "WINDOW_SIZE";
publicstatic final String GRANULARITY_SIZE="GRANULARITY_SIZE";
publicstatic void main(String[] args) throws Exception {
Pathinput = new Path(args[0]);
Pathoutput = new Path(args[1]);
Configurationconf = new Configuration();
//2000-01-14 01:08:28
conf.set(CarAvgFlowPerHour.INPUT_DATE_FORMAT,"yyyy-MM-dd HH:mm:ss");
if(args.length >= 3) {
conf.set(WINDOW_SIZE,args[2]);
}
if(args.length>=4){
conf.set(GRANULARITY_SIZE,args[3]);
}
Jobjob = Job.getInstance(conf, "CarAvgFlowPerHour.java");
job.setJarByClass(cn.com.zjf.MR_04.CarAvgFlowPerHour.class);
job.setInputFormatClass(CombineTextInputFormat.class);
job.setMapperClass(CarAvgFlowMapper.class);
//按组合键最小单元数合并
job.setCombinerClass(CarAvgFlowCombine.class);
job.setReducerClass(CarAvgFlowReduce.class);
job.setMapOutputKeyClass(CarAvgOrder.class);
job.setMapOutputValueClass(IntWritable.class);
//partitioner
job.setPartitionerClass(CarAvgPartitioner.class);
//job.setGroupingComparatorClass(CarAvgComparator.class);
//
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileSystemfs = FileSystem.get(conf);
//预处理文件 .只读取写完毕的文件 .writed结尾 .只读取文件大小大于0的文件
{
FileStatuschilds[] = fs.globStatus(input, new PathFilter() {
publicboolean accept(Path path) {
if(path.toString().endsWith(".writed")) {
returntrue;
}
returnfalse;
}
});
Pathtemp = null;
for(FileStatus file : childs) {
temp= new Path(file.getPath().toString().replaceAll(".writed",""));
if(fs.listStatus(temp)[0].getLen() > 0) {
CombineTextInputFormat.addInputPath(job,temp);
}
}
}
CombineTextInputFormat.setMaxInputSplitSize(job,67108864);
if(fs.exists(output)) {
fs.delete(output,true);
}
FileOutputFormat.setOutputPath(job,output);
if(!job.waitForCompletion(true))
return;
}
}