Hadoop/MapReduce移动平均:时间序列数据平均值

翻译 2017年11月15日 12:54:30

例子1:时间序列数据(股票价格)
对于如下的收盘价序列数据:

时间序列   日期  收盘价
1  2013-10-01  10
2  2013-10-02  18
3  2013-10-03  20
4  2013-10-04  30
5  2013-10-07  24
6  2013-10-08  33
7  2013-10-09  27

要计算3天的移动平均数

时间序列   日期  移动平均    如何计算
1  2013-10-01  10.00   =10/12  2013-10-02  14.00   = 10+18/23  2013-10-03  16.00   =10+18+20/34  2013-10-04  22.66   =18+20+30/4)


例子2:时间序列数据(URL访问数)
计算一个特定时间窗口内各个日期访问不同URL的不同访问者人数的移动平均数。

URL    日期  不同访问者人数
------------------------
URL1   2013-10-01  400
URL1   2013-10-02  200
URL1   2013-10-03  300
URL1   2013-10-04  700
URL1   2013-10-05  800
URL2   2013-10-01  10

3天的URL访问数的移动平均数
URL    日期  移动平均数
-----------------------
URL1   2013-10-01  400
URL1   2013-10-02  200
URL1   2013-10-03  300
URL1   2013-10-04  700
URL1   2013-10-05  800
URL2   2013-10-01  10


一、POJO移动平均解决方案
解决方案1:使用队列
维护一个特定窗口大小的队列和一个累加和sum
对于每一个元素,先将其值累加到sum中并将其加入队尾
如果加入该元素后队列的大小没有超过特定窗口大小,则继续处理下一个元素
如果加入该元素后队列的大小超过了特定窗口大小,则将队首元素移除,【同时将sum减去队首元素的值】,这样可以保证累加和进行滑动
...
移动平均的计算,当队列不为空时,移动平均=累加和/队列大小

package yidongpingjun.pojo;

import java.util.Queue;
import java.util.LinkedList;

/** 
 * Simple moving average by using a queue data structure.
 *
 * @author Mahmoud Parsian
 *
 */
public class SimpleMovingAverage {

    private double sum = 0.0;
    private final int period;
    private final Queue<Double> window = new LinkedList<Double>();
 
    public SimpleMovingAverage(int period) {
        if (period < 1) {
           throw new IllegalArgumentException("period must be > 0");
        }
        this.period = period;
    }
 
    public void addNewNumber(double number) {
        sum += number;
        window.add(number);
        if (window.size() > period) {
            sum -= window.remove();
        }
    }
 
    public double getMovingAverage() {
        if (window.isEmpty()) {
            throw new IllegalArgumentException("average is undefined");
        }
        return sum / window.size();
    }
}

解决方案2:使用数组
使用一个简单数组模拟入队和出队操作。但因为使用Java的队列数据结构时使用到了链表,没有使用数组直接存取高效
需要定义一个变量作为类似指针,记录队首的位置。

package yidongpingjun.pojo;

/** 
 * Simple moving average by using an array data structure.
 *
 * @author Mahmoud Parsian
 *
 */
public class SimpleMovingAverageUsingArray {

    private double sum = 0.0;
    private final int period;
    private double[] window = null;
    private int pointer = 0;
    private int size = 0;
 
    public SimpleMovingAverageUsingArray(int period) {
        if (period < 1) {
           throw new IllegalArgumentException("period must be > 0");
        }
        this.period = period;
        window = new double[period];
    }
 
    public void addNewNumber(double number) {
        sum += number;
        if (size < period) {
            window[pointer++] = number;
            size++;
        }
        else {
            // size = period (size cannot be > period)
            pointer = pointer % period;
            sum -=  window[pointer];
            window[pointer++] = number;
        }
    }
 
    public double getMovingAverage() {
        if (size == 0) {
            throw new IllegalArgumentException("average is undefined");
        }
        return sum / size;
    }
}

测试主程序:

package yidongpingjun.pojo;

import org.apache.log4j.Logger;
import org.apache.log4j.BasicConfigurator;

/** 
 * Basic testing of Simple moving average.
 *
 * @author Mahmoud Parsian
 *
 */
public class TestSimpleMovingAverage { 

    private static final Logger THE_LOGGER = Logger.getLogger(TestSimpleMovingAverage.class);

    public static void main(String[] args) {
        // The invocation of the BasicConfigurator.configure method 
        // creates a rather simple log4j setup. This method is hardwired 
        // to add to the root logger a ConsoleAppender.
        BasicConfigurator.configure();
        
        // time series        1   2   3  4   5   6   7
        double[] testData = {10, 18, 20, 30, 24, 33, 27};
        int[] allWindowSizes = {3, 4};
        for (int windowSize : allWindowSizes) {
            SimpleMovingAverage sma = new SimpleMovingAverage(windowSize);
            THE_LOGGER.info("windowSize = " + windowSize);
            for (double x : testData) {
                sma.addNewNumber(x);
                THE_LOGGER.info("Next number = " + x + ", SMA = " + sma.getMovingAverage());
            }
            THE_LOGGER.info("---");
        }
    }
}

二、MapReduce/Hadoop移动平均解决方案
输入:<name-as-string><,><date-as-timestamp><,><value-as-double>
GOOD,2004-11-04,184.70
GOOD,2014-11-03,191.67
GOOD,2014-11-02,194.87
AAPL,2013-10-09,486.59
AAPL,2013-10-08,480.94
AAPL,2013-10-07,487.75
AAPL,2013-10-04,483.03
AAPL,2013-10-03,483.41
IBM,2013-09-30,185.18
IBM,2013-09-27,186.92
IBM,2013-09-26,190.22
IBM,2013-09-25,189.47
GOOD,2013-07-19,896.60
GOOD,2013-07-19,910.68
GOOD,2013-07-17,918.55


输出:<name-as-string><,><date-as-timestamp><,><moving-average-as-double>


只需要根据股票代码对数据分组,然后按时间戳对这些值排序,然后应用移动平均算法。
对时间序列数据进行排序至少有两种方法:
解决方案1:在内存中排序
新建一个数据结构TimeSeriesData,将时间date和收盘价value绑定在一起
先对每一行做map操作,将其映射为(name,新建一个数据结构TimeSeriesData)的键值对
reduce操作中,所有name相同的键值对会到达同一个reduce,其keyname,value为无序的TimeSeriesData集合,在这里将这个集合在内存中进行按时间排序
然后对排序后的集合进行移动平均,生成key为股票代码,value为时间和移动平均的键值对集合,并写入输出文件中

package yidongpingjun;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.text.SimpleDateFormat;
import org.apache.hadoop.io.Writable;


/**
 * 
 * TimeSeriesData represents a pair of 
 *  (time-series-timestamp, time-series-value).
 *  
 * @author Mahmoud Parsian
 *
 */
public class TimeSeriesData 
   implements Writable, Comparable<TimeSeriesData> {

	private long timestamp;
	private double value;
	
	public static TimeSeriesData copy(TimeSeriesData tsd) {
		return new TimeSeriesData(tsd.timestamp, tsd.value);
	}
	
	public TimeSeriesData(long timestamp, double value) {
		set(timestamp, value);
	}
	
	public TimeSeriesData() {
	}
	
	public void set(long timestamp, double value) {
		this.timestamp = timestamp;
		this.value = value;
	}	
	
	public long getTimestamp() {
		return this.timestamp;
	}
	
	public double getValue() {
		return this.value;
	}
	
	/**
	 * Deserializes the point from the underlying data.
	 * @param in a DataInput object to read the point from.
	 */
	public void readFields(DataInput in) throws IOException {
		this.timestamp  = in.readLong();
		this.value  = in.readDouble();
	}

	/**
	 * Convert a binary data into TimeSeriesData
	 * 
	 * @param in A DataInput object to read from.
	 * @return A TimeSeriesData object
	 * @throws IOException
	 */
	public static TimeSeriesData read(DataInput in) throws IOException {
		TimeSeriesData tsData = new TimeSeriesData();
		tsData.readFields(in);
		return tsData;
	}

	public String getDate() {
		return DateUtil.getDateAsString(this.timestamp);	
	}

   /**
    * Creates a clone of this object
    */
    public TimeSeriesData clone() {
       return new TimeSeriesData(timestamp, value);
    }

	@Override
	public void write(DataOutput out) throws IOException {
		out.writeLong(this.timestamp );
		out.writeDouble(this.value );

	}

	/**
	 * Used in sorting the data in the reducer
	 */
	@Override
	public int compareTo(TimeSeriesData data) {
		if (this.timestamp  < data.timestamp ) {
			return -1;
		} 
		else if (this.timestamp  > data.timestamp ) {
			return 1;
		}
		else {
		   return 0;
		}
	}
	
	public String toString() {
       return "("+timestamp+","+value+")";
    }
}


package yidongpingjun.memorysort;

import java.util.Date;
import java.io.IOException;


import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.commons.lang.StringUtils;
import yidongpingjun.DateUtil;
import yidongpingjun.TimeSeriesData;

/***
 * 
 * @author chenjie
 *输入:
 *GOOG,2004-11-04,184.70
    GOOG,2004-11-03,191.67
    GOOG,2004-11-02,194.87
    AAPL,2013-10-09,486.59
    AAPL,2013-10-08,480.94
    AAPL,2013-10-07,487.75
    AAPL,2013-10-04,483.03
    AAPL,2013-10-03,483.41
    IBM,2013-09-30,185.18
    IBM,2013-09-27,186.92
    IBM,2013-09-26,190.22
    IBM,2013-09-25,189.47
    GOOG,2013-07-19,896.60
    GOOG,2013-07-18,910.68
    GOOG,2013-07-17,918.55
 *
 *
 */
public class SortInMemory_MovingAverageMapper    
    extends Mapper<LongWritable, Text, Text, TimeSeriesData> {
 
   private final Text reducerKey = new Text();
   private final TimeSeriesData reducerValue = new TimeSeriesData();
   
   
   /**
    * value:GOOG,2004-11-04,184.70
    */
   public void map(LongWritable key, Text value, Context context)
       throws IOException, InterruptedException {
       String record = value.toString();
       if ((record == null) || (record.length() == 0)) {
          return;
       }
       String[] tokens = StringUtils.split(record.trim(), ",");
       if (tokens.length == 3) {
          Date date = DateUtil.getDate(tokens[1]);//2004-11-04,
          if (date == null) {
          	 return;
          }
          reducerKey.set(tokens[0]); // GOOG
          reducerValue.set(date.getTime(), Double.parseDouble(tokens[2]));
          context.write(reducerKey, reducerValue);
       }
       else {
          // log as error, not enough tokens
       }
   }
}

package yidongpingjun.memorysort;

import java.io.IOException;
import java.util.List;
import java.util.ArrayList;
import java.util.Collections;


//
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Reducer.Context;
//


import yidongpingjun.DateUtil;
import yidongpingjun.TimeSeriesData;


public class SortInMemory_MovingAverageReducer 
   extends Reducer<Text, TimeSeriesData, Text, Text> {

    int windowSize = 5; // default window size
   
	/**
	 *  will be run only once 
	 *  get parameters from Hadoop's configuration
	 */
	public void setup(Context context)
        throws IOException, InterruptedException {
        this.windowSize = context.getConfiguration().getInt("moving.average.window.size", 5);
        System.out.println("setup(): key="+windowSize);
    }

	public void reduce(Text key, Iterable<TimeSeriesData> values, Context context)	
		throws IOException, InterruptedException {
       
        System.out.println("reduce(): key="+key.toString());

		// build the unsorted list of timeseries
		List<TimeSeriesData> timeseries = new ArrayList<TimeSeriesData>();
		for (TimeSeriesData tsData : values) {
			TimeSeriesData copy = TimeSeriesData.copy(tsData);
			timeseries.add(copy);
		} 
		
		// sort the timeseries data in memory and
        // apply moving average algorithm to sorted timeseries
        Collections.sort(timeseries);
        System.out.println("reduce(): timeseries="+timeseries.toString());
        
        
        // calculate prefix sum
        double sum = 0.0;
        for (int i=0; i < windowSize-1; i++) {
        	sum += timeseries.get(i).getValue();
        }
        
        // now we have enough timeseries data to calculate moving average
		Text outputValue = new Text(); // reuse object
        for (int i = windowSize-1; i < timeseries.size(); i++) {
            System.out.println("reduce(): key="+key.toString() + "  i="+i);
        	sum += timeseries.get(i).getValue();
        	double movingAverage = sum / windowSize;
        	long timestamp = timeseries.get(i).getTimestamp();
        	outputValue.set(DateUtil.getDateAsString(timestamp) + "," + movingAverage);
        	// send output to HDFS
        	context.write(key, outputValue);
        	
        	// prepare for next iteration
        	sum -= timeseries.get(i-windowSize+1).getValue();
        }
	} // reduce

}


测试驱动类

package yidongpingjun.memorysort;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
//



import yidongpingjun.HadoopUtil;
import yidongpingjun.TimeSeriesData;

/**
 * MapReduce job for moving averages of time series data 
 * by using in memory sort (without secondary sort).
 *
 * @author Mahmoud Parsian
 *
 */
public class SortInMemory_MovingAverageDriver {
 
    private static final String INPATH = "input/gupiao1.txt";// 输入文件路径
    private static final String OUTPATH = "output/gupiao1";// 输出文件路径
    
    public static void main(String[] args) throws Exception {
       Configuration conf = new Configuration();
       String[] otherArgs = new String[3];
       otherArgs[0] = "2";
       otherArgs[1] = INPATH;
       otherArgs[2] = OUTPATH;
       if (otherArgs.length != 3) {
          System.err.println("Usage: SortInMemory_MovingAverageDriver <window_size> <input> <output>");
          System.exit(1);
       }
       System.out.println("args[0]: <window_size>="+otherArgs[0]);
       System.out.println("args[1]: <input>="+otherArgs[1]);
       System.out.println("args[2]: <output>="+otherArgs[2]);
       
       Job job = new Job(conf, "SortInMemory_MovingAverageDriver");

       // add jars to distributed cache
     //  HadoopUtil.addJarsToDistributedCache(job, "/lib/");
       
       // set mapper/reducer
       job.setMapperClass(SortInMemory_MovingAverageMapper.class);
       job.setReducerClass(SortInMemory_MovingAverageReducer.class);
       
       // define mapper's output key-value
       job.setMapOutputKeyClass(Text.class);
       job.setMapOutputValueClass(TimeSeriesData.class);
              
       // define reducer's output key-value
       job.setOutputKeyClass(Text.class);
       job.setOutputValueClass(Text.class);
       
       // set window size for moving average calculation
       int windowSize = Integer.parseInt(otherArgs[0]);
       job.getConfiguration().setInt("moving.average.window.size", windowSize);      
       
       // define I/O
       FileInputFormat.addInputPath(job, new Path(otherArgs[1]));
       FileOutputFormat.setOutputPath(job, new Path(otherArgs[2]));
       
       job.setInputFormatClass(TextInputFormat.class); 
       job.setOutputFormatClass(TextOutputFormat.class);
       
       System.exit(job.waitForCompletion(true) ? 0 : 1);
    }

}


输出结果:

AAPL   2013-10-04,483.22
AAPL   2013-10-07,485.39
AAPL   2013-10-08,484.345
AAPL   2013-10-09,483.765
GOOG   2004-11-03,193.26999999999998
GOOG   2004-11-04,188.18499999999997
GOOG   2013-07-17,551.625
GOOG   2013-07-18,914.615
GOOG   2013-07-19,903.6400000000001
IBM    2013-09-26,189.845
IBM    2013-09-27,188.57
IBM    2013-09-30,186.05

解决方案2:使用MapReduce框架排序(二次排序),使用股票名词和时间戳构成组合键,按股票名称进行分组,按照股票名称和时间戳排序。
新建一个数据结构TimeSeriesData,将时间date和收盘价value绑定在一起
新建一个数据结构CompositeKey,作为组合键,将股票代码和时间绑定在一起
映射器类SortByMRF_MovingAverageMapper,将输入【股票代码,时间,收盘价】映射为keyCompositeKeyvalueTimeSeriesData的键值对
既然keyvalue都变为了自定义复杂类型,那么如何根据key进行分区和排序,如何根据value进行排序,都需要自己定义
于是,新建一个数据结构CompositeKeyComparator,定义key如何进行排序:先按CompositeKey的股票代码进行排序,再按时间进行排序
新建一个数据结构NaturalKeyPartitioner,定义key如何进行分区:按照CompositeKey的股票代码进行分区,使得股票代码相同的记录能够到达同一个规约器reducer
新建一个数据结构NaturalKeyGroupingComparator,定义key如何进行分组:按照CompositeKey的股票代码进行分组
新建一个数据结构SortByMRF_MovingAverageReducer,定义如何进行规约:对于keyCompositeKeyvalue为根据时间排序的有序TimeSeriesData集合,计算移动平均

package yidongpingjun;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.text.SimpleDateFormat;
import org.apache.hadoop.io.Writable;


public class TimeSeriesData 
   implements Writable, Comparable<TimeSeriesData> {

	private long timestamp;
	private double value;
	
	public static TimeSeriesData copy(TimeSeriesData tsd) {
		return new TimeSeriesData(tsd.timestamp, tsd.value);
	}
	
	public TimeSeriesData(long timestamp, double value) {
		set(timestamp, value);
	}
	
	public TimeSeriesData() {
	}
	
	public void set(long timestamp, double value) {
		this.timestamp = timestamp;
		this.value = value;
	}	
	
	public long getTimestamp() {
		return this.timestamp;
	}
	
	public double getValue() {
		return this.value;
	}
	
	/**
	 * Deserializes the point from the underlying data.
	 * @param in a DataInput object to read the point from.
	 */
	public void readFields(DataInput in) throws IOException {
		this.timestamp  = in.readLong();
		this.value  = in.readDouble();
	}

	/**
	 * Convert a binary data into TimeSeriesData
	 * 
	 * @param in A DataInput object to read from.
	 * @return A TimeSeriesData object
	 * @throws IOException
	 */
	public static TimeSeriesData read(DataInput in) throws IOException {
		TimeSeriesData tsData = new TimeSeriesData();
		tsData.readFields(in);
		return tsData;
	}

	public String getDate() {
		return DateUtil.getDateAsString(this.timestamp);	
	}

   /**
    * Creates a clone of this object
    */
    public TimeSeriesData clone() {
       return new TimeSeriesData(timestamp, value);
    }

	@Override
	public void write(DataOutput out) throws IOException {
		out.writeLong(this.timestamp );
		out.writeDouble(this.value );

	}

	/**
	 * Used in sorting the data in the reducer
	 */
	@Override
	public int compareTo(TimeSeriesData data) {
		if (this.timestamp  < data.timestamp ) {
			return -1;
		} 
		else if (this.timestamp  > data.timestamp ) {
			return 1;
		}
		else {
		   return 0;
		}
	}
	
	public String toString() {
       return "("+timestamp+","+value+")";
    }
}

package yidongpingjun.secondarysort;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
//
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;


public class CompositeKey implements WritableComparable<CompositeKey> {
    // natural key is (name)
    // composite key is a pair (name, timestamp)
	private String name;
	private long timestamp;

	public CompositeKey(String name, long timestamp) {
		set(name, timestamp);
	}
	
	public CompositeKey() {
	}

	public void set(String name, long timestamp) {
		this.name = name;
		this.timestamp = timestamp;
	}

	public String getName() {
		return this.name;
	}

	public long getTimestamp() {
		return this.timestamp;
	}

	@Override
	public void readFields(DataInput in) throws IOException {
		this.name = in.readUTF();
		this.timestamp = in.readLong();
	}

	@Override
	public void write(DataOutput out) throws IOException {
		out.writeUTF(this.name);
		out.writeLong(this.timestamp);
	}

	@Override
	public int compareTo(CompositeKey other) {
		if (this.name.compareTo(other.name) != 0) {
			return this.name.compareTo(other.name);
		} 
		else if (this.timestamp != other.timestamp) {
			return timestamp < other.timestamp ? -1 : 1;
		} 
		else {
			return 0;
		}

	}

	public static class CompositeKeyComparator extends WritableComparator {
		public CompositeKeyComparator() {
			super(CompositeKey.class);
		}

		public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
			return compareBytes(b1, s1, l1, b2, s2, l2);
		}
	}

	static { // register this comparator
		WritableComparator.define(CompositeKey.class,
				new CompositeKeyComparator());
	}

}


package yidongpingjun.secondarysort;

import java.util.Date;
import java.io.IOException;


//
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
import org.apache.commons.lang.StringUtils;
//


import yidongpingjun.DateUtil;
import yidongpingjun.TimeSeriesData;


public class SortByMRF_MovingAverageMapper extends MapReduceBase
        implements Mapper<LongWritable, Text, CompositeKey, TimeSeriesData> {

    // reuse Hadoop's Writable objects
    private final CompositeKey reducerKey = new CompositeKey();
    private final TimeSeriesData reducerValue = new TimeSeriesData();

    @Override
    public void map(LongWritable inkey, Text value,
            OutputCollector<CompositeKey, TimeSeriesData> output,
            Reporter reporter) throws IOException {
        String record = value.toString();
        if ((record == null) || (record.length() == 0)) {
            return;
        }
        String[] tokens = StringUtils.split(record, ",");
        if (tokens.length == 3) {
            // tokens[0] = name of timeseries as string
            // tokens[1] = timestamp
            // tokens[2] = value of timeseries as double
            Date date = DateUtil.getDate(tokens[1]);
            if (date == null) {
                return;
            }
            long timestamp = date.getTime();
            reducerKey.set(tokens[0], timestamp);
            reducerValue.set(timestamp, Double.parseDouble(tokens[2]));
            // emit key-value pair
            output.collect(reducerKey, reducerValue);
        } 
        else {
            // log as error, not enough tokens
        }
    }
}


package yidongpingjun.secondarysort;

import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;


public class CompositeKeyComparator extends WritableComparator {

    protected CompositeKeyComparator() {
        super(CompositeKey.class, true);
    }

    @Override
    public int compare(WritableComparable w1, WritableComparable w2) {
        CompositeKey key1 = (CompositeKey) w1;
        CompositeKey key2 = (CompositeKey) w2;

        int comparison = key1.getName().compareTo(key2.getName());
        if (comparison == 0) {
            // names are equal here
            if (key1.getTimestamp() == key2.getTimestamp()) {
                return 0;
            } else if (key1.getTimestamp() < key2.getTimestamp()) {
                return -1;
            } else {
                return 1;
            }
        } 
        else {
            return comparison;
        }
    }
}

package yidongpingjun.secondarysort;

import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.Partitioner;

import yidongpingjun.TimeSeriesData;


public class NaturalKeyPartitioner implements
        Partitioner<CompositeKey, TimeSeriesData> {

    @Override
    public int getPartition(CompositeKey key,
            TimeSeriesData value,
            int numberOfPartitions) {
        return Math.abs((int) (hash(key.getName()) % numberOfPartitions));
    }

    @Override
    public void configure(JobConf jobconf) {
    }

    /**
     * adapted from String.hashCode()
     */
    static long hash(String str) {
        long h = 1125899906842597L; // prime
        int length = str.length();
        for (int i = 0; i < length; i++) {
            h = 31 * h + str.charAt(i);
        }
        return h;
    }
}



package yidongpingjun.secondarysort;

import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;


public class NaturalKeyGroupingComparator extends WritableComparator {

    protected NaturalKeyGroupingComparator() {
        super(CompositeKey.class, true);
    }

    @Override
    public int compare(WritableComparable w1, WritableComparable w2) {
        CompositeKey key1 = (CompositeKey) w1;
        CompositeKey key2 = (CompositeKey) w2;
        return key1.getName().compareTo(key2.getName());
    }

}


package yidongpingjun.secondarysort;

import java.util.Iterator;
import java.io.IOException;


//
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.JobConf;
//


import yidongpingjun.DateUtil;
import yidongpingjun.TimeSeriesData;


public class SortByMRF_MovingAverageReducer extends MapReduceBase
        implements Reducer<CompositeKey, TimeSeriesData, Text, Text> {

    int windowSize = 5; // default window size

    /**
     * will be run only once get parameters from Hadoop's configuration
     */
    @Override
    public void configure(JobConf jobconf) {
        this.windowSize = jobconf.getInt("moving.average.window.size", 5);
    }

    @Override
    public void reduce(CompositeKey key,
            Iterator<TimeSeriesData> values,
            OutputCollector<Text, Text> output,
            Reporter reporter)
            throws IOException {

        // note that values are sorted.
        // apply moving average algorithm to sorted timeseries
        Text outputKey = new Text();
        Text outputValue = new Text();
        MovingAverage ma = new MovingAverage(this.windowSize);
        while (values.hasNext()) {
            TimeSeriesData data = values.next();
            ma.addNewNumber(data.getValue());
            double movingAverage = ma.getMovingAverage();
            long timestamp = data.getTimestamp();
            String dateAsString = DateUtil.getDateAsString(timestamp);
            //THE_LOGGER.info("Next number = " + x + ", SMA = " + sma.getMovingAverage());
            outputValue.set(dateAsString + "," + movingAverage);
            outputKey.set(key.getName());
            output.collect(outputKey, outputValue);
        }
        //
    } 

}

package yidongpingjun.secondarysort;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.JobClient;
//




import yidongpingjun.HadoopUtil;
import yidongpingjun.TimeSeriesData;


public class SortByMRF_MovingAverageDriver {
    private static final String INPATH = "input/gupiao1.txt";// 输入文件路径
    private static final String OUTPATH = "output/gupiao2";// 输出文件路径
    
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
		JobConf jobconf = new JobConf(conf, SortByMRF_MovingAverageDriver.class);
		jobconf.setJobName("SortByMRF_MovingAverageDriver");
    
		String[] otherArgs = new String[3];
	       otherArgs[0] = "2";
	       otherArgs[1] = INPATH;
	       otherArgs[2] = OUTPATH;
       if (otherArgs.length != 3) {
          System.err.println("Usage: SortByMRF_MovingAverageDriver <window_size> <input> <output>");
          System.exit(1);
       }

       // add jars to distributed cache
     //  HadoopUtil.addJarsToDistributedCache(conf, "/lib/");
       
       // set mapper/reducer
       jobconf.setMapperClass(SortByMRF_MovingAverageMapper.class);
       jobconf.setReducerClass(SortByMRF_MovingAverageReducer.class);
       
       // define mapper's output key-value
       jobconf.setMapOutputKeyClass(CompositeKey.class);
       jobconf.setMapOutputValueClass(TimeSeriesData.class);
              
       // define reducer's output key-value
       jobconf.setOutputKeyClass(Text.class);
       jobconf.setOutputValueClass(Text.class);

       // set window size for moving average calculation
       int windowSize = Integer.parseInt(otherArgs[0]);
       jobconf.setInt("moving.average.window.size", windowSize);      
       
       // define I/O
	   FileInputFormat.setInputPaths(jobconf, new Path(otherArgs[1]));
	   FileOutputFormat.setOutputPath(jobconf, new Path(otherArgs[2]));
       
       jobconf.setInputFormat(TextInputFormat.class); 
       jobconf.setOutputFormat(TextOutputFormat.class);
	   jobconf.setCompressMapOutput(true);       
       
       // the following 3 setting are needed for "secondary sorting"
       // Partitioner decides which mapper output goes to which reducer 
       // based on mapper output key. In general, different key is in 
       // different group (Iterator at the reducer side). But sometimes, 
       // we want different key in the same group. This is the time for 
       // Output Value Grouping Comparator, which is used to group mapper 
       // output (similar to group by condition in SQL).  The Output Key 
       // Comparator is used during sort stage for the mapper output key.
       jobconf.setPartitionerClass(NaturalKeyPartitioner.class);
       jobconf.setOutputKeyComparatorClass(CompositeKeyComparator.class);
       jobconf.setOutputValueGroupingComparator(NaturalKeyGroupingComparator.class);
       
       JobClient.runJob(jobconf);
    }

}






package yidongpingjun;

import java.text.SimpleDateFormat;
import java.util.Date;


public class DateUtil {

	static final String DATE_FORMAT = "yyyy-MM-dd";
	static final SimpleDateFormat SIMPLE_DATE_FORMAT = 
	   new SimpleDateFormat(DATE_FORMAT);

    /**
     *  Returns the Date from a given dateAsString
     */
	public static Date getDate(String dateAsString)  {
        try {
        	return SIMPLE_DATE_FORMAT.parse(dateAsString);
        }
        catch(Exception e) {
        	return null;
        }
	}

    /**
     *  Returns the number of milliseconds since January 1, 1970, 
     *  00:00:00 GMT represented by this Date object.
     */
	public static long getDateAsMilliSeconds(Date date) throws Exception {
        return date.getTime();
	}
	
	
    /**
     *  Returns the number of milliseconds since January 1, 1970, 
     *  00:00:00 GMT represented by this Date object.
     */
	public static long getDateAsMilliSeconds(String dateAsString) throws Exception {
		Date date = getDate(dateAsString);	
        return date.getTime();
	}
	
	
	
	
	public static String getDateAsString(long timestamp) {
        return SIMPLE_DATE_FORMAT.format(timestamp);
	}	
	
}

package yidongpingjun;

import java.util.List;
import java.util.ArrayList;
import java.util.Arrays;
import java.io.IOException;
//
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.filecache.DistributedCache;




public class HadoopUtil {

   /**
    * Add all jar files to HDFS's distributed cache
    *
    * @param job job which will be run
    * @param hdfsJarDirectory a directory which has all required jar files
    */ 
   public static void addJarsToDistributedCache(Job job, 
                                                String hdfsJarDirectory) 
      throws IOException {
      if (job == null) {
         return;
      }
      addJarsToDistributedCache(job.getConfiguration(), hdfsJarDirectory);
   }

   /**
    * Add all jar files to HDFS's distributed cache
    *
    * @param Configuration conf which will be run
    * @param hdfsJarDirectory a directory which has all required jar files
    */ 
   public static void addJarsToDistributedCache(Configuration conf, 
                                                String hdfsJarDirectory) 
      throws IOException {
      if (conf == null) {
         return;
      }
      FileSystem fs = FileSystem.get(conf);
      List<FileStatus> jars = getDirectoryListing(hdfsJarDirectory, fs);
      for (FileStatus jar : jars) {
         Path jarPath = jar.getPath();
         DistributedCache.addFileToClassPath(jarPath, conf, fs);
      }
   }

   
   /**
    * Get list of files from a given HDFS directory
    * @param directory an HDFS directory name
    * @param fs an HDFS FileSystem
    */   
    public static List<FileStatus> getDirectoryListing(String directory, 
                                                       FileSystem fs) 
       throws IOException {
       Path dir = new Path(directory); 
       FileStatus[] fstatus = fs.listStatus(dir); 
       return Arrays.asList(fstatus);
    }
    
    public static List<String> listDirectoryAsListOfString(String directory, 
                                                           FileSystem fs) 
       throws IOException {
       Path path = new Path(directory); 
       FileStatus fstatus[] = fs.listStatus(path);
       List<String> listing = new ArrayList<String>();
       for (FileStatus f: fstatus) {
           listing.add(f.getPath().toUri().getPath());
       }
       return listing;
    }
    
    
   /**
    * Return true, if HDFS path doers exist; otherwise return false.
    * 
    */
   public static boolean pathExists(Path path, FileSystem fs)  {
      if (path == null) {
         return false;
      }
      
      try {
         return fs.exists(path);
      }
      catch(Exception e) {
          return false;
      }
   }   
   
}




相关文章推荐

时间序列之差分自回归移动平均法(ARIMA)

ARIMA模型的基本思想是将非平稳时间序列转化为平稳时间序列,然后将因变量仅对它的滞后值以及随机误差项的现值和滞后值进行回归所建立的模型。 ARMIA模型有四种形式:移动平均模型-MA(q)、自回归模...

金融时间序列分析:9. ARMA自回归移动平均模型

本文简单介绍了ARMA模型,包括其模型公式,统计特征,预测与分析…… ARMA简单来讲就是AR模型和MA模型的混合。 ARMA模型的提出是为了客服在表达数据时,经常出现高阶AR模型或MA模型,高阶模型...

Hadoop—MapReduce练习(数据去重、数据排序、平均成绩、倒排索引)

1.  wordcount程序 先以简单的wordcount为例。 Mapper: [java] view plain copy   ...

Hadoop—MapReduce练习(数据去重、数据排序、平均成绩、倒排索引)

先以简单的wordcount为例,

王家林最受欢迎的一站式云计算大数据和移动互联网解决方案课程 V1(20140809)之Hadoop企业级完整训练:Rocky的16堂课(HDFS&MapReduce&HBase&Hive&Zookee

Hadoop是云计算的事实标准软件框架,是云计算理念、机制和商业化的具体实现,是整个云计算技术学习中公认的核心和最具有价值内容。 如何从企业级开发实战的角度开始,在实际企业级动手操作中深入浅出并循序...

给某位同学的 按照窗口(每个窗口涵盖50个数据,窗口每10个数据步进一次)求标准差,平均值, (样本值-平均值)/标准差的程序

#include #include #include #include #include //#inlcude using namespace std; #define F...

MapReduce处理数据平均值与数值大小排行比较

一:计算数据平均值 在map中将名称作为key 数据为value写出去 /* * 计算平均成绩 * 名字作为key 分数值为value写出去 */ public class AverageM...

hadoop mapreduce求平均分

求平均分的关键在于,利用mapreduce过程中,一个key聚合在一起,输送到一个reduce的特性。 假设三门课的成绩如下: china.txt 张三 78 李四 89 王五 ...

简单的java Hadoop MapReduce程序(计算平均成绩)从打包到提交及运行

程序源码 import java.io.IOException; import java.util.Iterator; import java.util.StringTokenizer; imp...
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:Hadoop/MapReduce移动平均:时间序列数据平均值
举报原因:
原因补充:

(最多只允许输入30个字)