MapReduce之移动平均(以股票价格为例)

MapReduce之移动平均(以股票价格为例)

基本概念
时间序列数据

时间序列数据表示一个变量在一段时间内的值

移动平均

令A为一组有序对象的序列:
A = ( a 1 , a 2 , a 3 , . . . , a N ) A=(a_1,a_2,a_3,...,a_N) A=(a1,a2,a3,...,aN)
可以把A表示为
{ a i } i = 1 N \{a_i \}_{i=1}^{N} {ai}i=1N
n移动平均序列是由ai定义的一个新序列
{ S i } i = 1 N − n + 1 \{S_i\}_{i=1}^{N-n+1} {Si}i=1Nn+1
通过计算n项子序列的算术平均值计算得到:
S i = 1 n ∑ j = i i + n − 1 a j S_i=\frac{1}{n}\sum_{j=i}^{i+n-1}a_j Si=n1j=ii+n1aj

基本示例

股票收盘价时间序列数据

时间序列日期收盘价
12013-10-0110
22013-10-0218
32013-10-0320
42013-10-0430

股票收盘价3天的移动平均数

时间序列日期移动平均如何计算
12013-10-0110.00=(10)/1
22013-10-0214.00(10+18)/2
32013-10-0316.00(10+18+20)/3
42013-10-0422.66(18+20+30)/3
MapReduce移动平均解决方案思路
样例输入
GOOG,2004-11-04,184.70
GOOG,2004-11-03,191.67
GOOG,2004-11-02,194.87
AAPL,2013-10-9,486.59
AAPL,2013-10-8,480.94
AAPL,2013-10-7,487.75
AAPL,2013-10-4,483.03
AAPL,2013-10-3,483.41
IBM,2013-09-30,185.18
IBM,2013-09-30,186.92
IBM,2013-09-30,190.22
IBM,2013-09-30,189.47
GOOG,2013-07-19,896.60
GOOG,2013-07-18,910.68
GOOG,2013-07-17,918.55
样例输入
AAPL	2013-10-03,483.41
AAPL	2013-10-04,483.22
AAPL	2013-10-07,484.73
AAPL	2013-10-08,483.7825
AAPL	2013-10-09,484.34400000000005
GOOG	2004-11-02,194.87
GOOG	2004-11-03,193.26999999999998
GOOG	2004-11-04,190.41333333333333
GOOG	2013-07-17,372.4475
GOOG	2013-07-18,480.09399999999994
GOOG	2013-07-19,620.4399999999999
IBM	2013-09-30,186.92
IBM	2013-09-30,188.57
IBM	2013-09-30,188.87
IBM	2013-09-30,187.9475

在了解移动平均算法后,只需要根据股票代码对数据分组,然后按照时间戳对这些值排序,最用应用移动平均算法。

基于数组模拟队列的移动平均解决方案如下
public class MovingAverage {

        private double sum = 0.0;
        private final int period;
        private double[] window = null;
        private int pointer = 0;
        private int size = 0;

        public MovingAverage(int period) {
            if (period < 1) {
                throw new IllegalArgumentException("period must be > 0");
            }
            this.period = period;
            window = new double[period];
        }

        public void addNewNumber(double number) {
            sum += number;
            if (size < period) {
                window[pointer++] = number;
                size++;
            }
            else {
                // size = period (size cannot be > period)
                pointer = pointer % period;
                sum -= window[pointer];
                window[pointer++] = number;
            }
        }

        public double getMovingAverage() {
            if (size == 0) {
                throw new IllegalArgumentException("average is undefined");
            }
            //
            return sum / size;
        }
    }
实现过程

在整个过程中为移动平均实现二次排序,
则映射器的输出键应当是自然键(name-as-string)和自然键(timeserise-timestamp)的组合

将时间序列数据点表示为一个(timestamp,double)对
public static class TimeSeriesData implements WritableComparable<TimeSeriesData>{
        private long timestamp;
        private double value;
        public TimeSeriesData(){

        }

        public long getTimestamp() {
            return timestamp;
        }

        public void setTimestamp(long timestamp) {
            this.timestamp = timestamp;
        }

        public double getValue() {
            return value;
        }

        public void setValue(double value) {
            this.value = value;
        }

        public void set(long timestamp,double value){
            this.timestamp=timestamp;
            this.value=value;
        }

        @Override
        public String toString() {
            return "TimeSeriesData{" +
                    "timestamp=" + timestamp +
                    ", value=" + value +
                    '}';
        }
定义一个定制组合键(string,timestamp)
public static class CompositeKey implements WritableComparable<CompositeKey>{
        private String name;
        private long timestamp;
        public CompositeKey(){

        }
        public void set(String name,long timestamp){
            this.name=name;
            this.timestamp=timestamp;
        }

        public String getName() {
            return name;
        }

        public void setName(String name) {
            this.name = name;
        }

        public long getTimestamp() {
            return timestamp;
        }

        public void setTimestamp(long timestamp) {
            this.timestamp = timestamp;
        }

        public int compareTo(CompositeKey o) {
            if(this.name.compareTo(o.name)!=0){
                return this.name.compareTo(o.name);
            }else if(this.timestamp!=o.timestamp){
                return timestamp>o.timestamp?1:-1;
            }else{
                return 0;
            }
        }

        public void write(DataOutput dataOutput) throws IOException {
            dataOutput.writeUTF(this.name);
            dataOutput.writeLong(this.timestamp);
        }

        public void readFields(DataInput dataInput) throws IOException {
            this.name=dataInput.readUTF();
            this.timestamp=dataInput.readLong();
        }
    }

CompositeKey类会在shuffle阶段中根据字段“name”和“timestamp”完成排序,所以在接下来的过程中提供一个类来比较组合键对象,他的功能主要是提供compare()方法

定义compositeKey的排序顺序
public static class CompositeKeyComparator extends WritableComparator{
        protected CompositeKeyComparator(){
            super(CompositeKey.class,true);
        }
        public int compare(WritableComparable w1,WritableComparable w2){
            CompositeKey key1=(CompositeKey) w1;
            CompositeKey key2=(CompositeKey) w2;
            int comparsion=key1.getName().compareTo(key2.getName());
            if(comparsion==0){
                if(key1.getTimestamp()==key2.getTimestamp()){
                    return 0;
                }else {
                    return key1.getTimestamp()>key2.getTimestamp()?1:0;
                }
            }else{
                return comparsion;
            }
        }
    }

完成对组合键的排序后,通过实现Partitioner接口的NaturalKeyPartitioner类处理,将mapper生成的键空间分区

Partitoner阶段编码
public class NaturalKeyPartitioner extends Partitioner<CompositeKey, TimeSeriesData> {
        @Override
        public int getPartition(CompositeKey key, TimeSeriesData value,
                                int numberOfPartitions) {
            return Math.abs((int) (hash(key.getName()) % numberOfPartitions));
        }

        static long hash(String str) {
            long h = 1125899906842597L; // prime
            int length = str.length();
            for (int i = 0; i < length; i++) {
                h = 31 * h + str.charAt(i);
            }
            return h;
        }
    }

接下来使用插件类NaturalKeyGroupingComparator,在hadoop的shuffle阶段,将用这个类的按照键的自然键部分对组合键分组

GroupingComparator阶段编码
public static class NaturalKeyGroupingComparator extends WritableComparator {
        protected NaturalKeyGroupingComparator() {
            super(CompositeKey.class, true);
        }

        @Override
        public int compare(WritableComparable w1, WritableComparable w2) {
            CompositeKey key1 = (CompositeKey) w1;
            CompositeKey key2 = (CompositeKey) w2;
            return key1.getName().compareTo(key2.getName());
        }
    }
基本的日期转换工具
import java.text.SimpleDateFormat;
import java.util.Date;

public class DateUtil {

    static final String DATE_FORMAT = "yyyy-MM-dd";
    static final SimpleDateFormat SIMPLE_DATE_FORMAT =
            new SimpleDateFormat(DATE_FORMAT);

    public static Date getDate(String dateAsString)  {
        try {
            return SIMPLE_DATE_FORMAT.parse(dateAsString);
        }
        catch(Exception e) {
            return null;
        }
    }

    public static long getDateAsMilliSeconds(Date date) throws Exception {
        return date.getTime();
    }

    public static long getDateAsMilliSeconds(String dateAsString) throws Exception {
        Date date = getDate(dateAsString);
        return date.getTime();
    }

    public static String getDateAsString(long timestamp) {
        return SIMPLE_DATE_FORMAT.format(timestamp);
    }

}
mapper编码阶段

mapper阶段切割文本提取CompositeKey和TimeSeriesData生成<CompositeKey,TimeSeriesData>键值对

public static class MovingAverageMapper extends
            Mapper<LongWritable, Text, CompositeKey, TimeSeriesData> {
        private final CompositeKey reducerKey = new CompositeKey();
        private final TimeSeriesData reducerValue = new TimeSeriesData();

        public void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            String line = value.toString();
            if ((line == null) || (line.length() == 0)) {
                return;
            }
            String[] tokens = line.split(",");
            if (tokens.length == 3) {
                Date date = DateUtil.getDate(tokens[1]);
                if (date == null) {
                    return;

                }
                long timestamp = date.getTime();
                reducerKey.set(tokens[0], timestamp);
                reducerValue.set(timestamp, Double.parseDouble(tokens[2]));
                context.write(reducerKey, reducerValue);
            }
        }
    }
reduce编码阶段
public static class MovingAverageReducer extends Reducer<CompositeKey, TimeSeriesData, Text, Text> {
        int windowSize = 5;

        protected void reduce(CompositeKey key, Iterable<TimeSeriesData> values,
                              Context context) throws IOException, InterruptedException {
            Text outputKey = new Text();
            Text outputValue = new Text();
            MovingAverage ma = new MovingAverage(this.windowSize);
            for (TimeSeriesData data : values) {
                ma.addNewNumber(data.getValue());
                Double movingAverage = ma.getMovingAverage();
                long timestamp = data.getTimestamp();
                String dateAsString = DateUtil.getDateAsString(timestamp);
                outputValue.set(dateAsString + "," + movingAverage);
                outputKey.set(key.getName());
                context.write(outputKey, outputValue);
            }
        }
    }
完整代码如下
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.util.Date;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.lang.InterruptedException;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.Queue;

public class simpleMovingAverage {
    public static class MovingAverage {

        private double sum = 0.0;
        private final int period;
        private double[] window = null;
        private int pointer = 0;
        private int size = 0;

        public MovingAverage(int period) {
            if (period < 1) {
                throw new IllegalArgumentException("period must be > 0");
            }
            this.period = period;
            window = new double[period];
        }

        public void addNewNumber(double number) {
            sum += number;
            if (size < period) {
                window[pointer++] = number;
                size++;
            }
            else {
                pointer = pointer % period;
                sum -= window[pointer];
                window[pointer++] = number;
            }
        }

        public double getMovingAverage() {
            if (size == 0) {
                throw new IllegalArgumentException("average is undefined");
            }
           
            return sum / size;
        }
    }
    public static class CompositeKey implements WritableComparable<CompositeKey>{
        private String name;
        private long timestamp;
        public CompositeKey(){

        }
        public void set(String name,long timestamp){
            this.name=name;
            this.timestamp=timestamp;
        }

        public String getName() {
            return name;
        }

        public void setName(String name) {
            this.name = name;
        }

        public long getTimestamp() {
            return timestamp;
        }

        public void setTimestamp(long timestamp) {
            this.timestamp = timestamp;
        }

        public int compareTo(CompositeKey o) {
            if(this.name.compareTo(o.name)!=0){
                return this.name.compareTo(o.name);
            }else if(this.timestamp!=o.timestamp){
                return timestamp>o.timestamp?1:-1;
            }else{
                return 0;
            }
        }

        public void write(DataOutput dataOutput) throws IOException {
            dataOutput.writeUTF(this.name);
            dataOutput.writeLong(this.timestamp);
        }

        public void readFields(DataInput dataInput) throws IOException {
            this.name=dataInput.readUTF();
            this.timestamp=dataInput.readLong();
        }
    }

    public static class TimeSeriesData implements WritableComparable<TimeSeriesData>{
        private long timestamp;
        private double value;
        public TimeSeriesData(){

        }

        public long getTimestamp() {
            return timestamp;
        }

        public void setTimestamp(long timestamp) {
            this.timestamp = timestamp;
        }

        public double getValue() {
            return value;
        }

        public void setValue(double value) {
            this.value = value;
        }

        public void set(long timestamp,double value){
            this.timestamp=timestamp;
            this.value=value;
        }

        @Override
        public String toString() {
            return "TimeSeriesData{" +
                    "timestamp=" + timestamp +
                    ", value=" + value +
                    '}';
        }

        public int compareTo(TimeSeriesData o) {
            if(this.timestamp<o.timestamp){
                return -1;
            }else if(this.timestamp>o.timestamp){
                return 1;
            }else{
                return 0;
            }
        }

        public void write(DataOutput dataOutput) throws IOException {
            dataOutput.writeLong(timestamp);
            dataOutput.writeDouble(value);
        }

        public void readFields(DataInput dataInput) throws IOException {
            this.timestamp=dataInput.readLong();
            this.value=dataInput.readDouble();
        }
    }

    public static class CompositeKeyComparator extends WritableComparator{
        protected CompositeKeyComparator(){
            super(CompositeKey.class,true);
        }
        public int compare(WritableComparable w1,WritableComparable w2){
            CompositeKey key1=(CompositeKey) w1;
            CompositeKey key2=(CompositeKey) w2;
            int comparsion=key1.getName().compareTo(key2.getName());
            if(comparsion==0){
                if(key1.getTimestamp()==key2.getTimestamp()){
                    return 0;
                }else {
                    return key1.getTimestamp()>key2.getTimestamp()?1:0;
                }
            }else{
                return comparsion;
            }
        }
    }
    public static class NaturalKeyPartitioner extends Partitioner<CompositeKey, TimeSeriesData> {
        @Override
        public int getPartition(CompositeKey key, TimeSeriesData value,
                                int numberOfPartitions) {
            return Math.abs((int) (hash(key.getName()) % numberOfPartitions));
        }

      
        static long hash(String str) {
            long h = 1125899906842597L; 
            int length = str.length();
            for (int i = 0; i < length; i++) {
                h = 31 * h + str.charAt(i);
            }
            return h;
        }
    }
    public static class NaturalKeyGroupingComparator extends WritableComparator {
        protected NaturalKeyGroupingComparator() {
            super(CompositeKey.class, true);
        }

        @Override
        public int compare(WritableComparable w1, WritableComparable w2) {
            CompositeKey key1 = (CompositeKey) w1;
            CompositeKey key2 = (CompositeKey) w2;
            return key1.getName().compareTo(key2.getName());
        }
    }

    public static class MovingAverageMapper extends
            Mapper<LongWritable, Text, CompositeKey, TimeSeriesData> {
        private final CompositeKey reducerKey = new CompositeKey();
        private final TimeSeriesData reducerValue = new TimeSeriesData();

        public void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            String line = value.toString();
            if ((line == null) || (line.length() == 0)) {
                return;
            }
            String[] tokens = line.split(",");
            if (tokens.length == 3) {
                Date date = DateUtil.getDate(tokens[1]);
                if (date == null) {
                    return;

                }
                long timestamp = date.getTime();
                reducerKey.set(tokens[0], timestamp);
                reducerValue.set(timestamp, Double.parseDouble(tokens[2]));
                context.write(reducerKey, reducerValue);
            }
        }
    }
    public static class MovingAverageReducer extends Reducer<CompositeKey, TimeSeriesData, Text, Text> {
        int windowSize = 5;

        protected void reduce(CompositeKey key, Iterable<TimeSeriesData> values,
                              Context context) throws IOException, InterruptedException {
            Text outputKey = new Text();
            Text outputValue = new Text();
            MovingAverage ma = new MovingAverage(this.windowSize);
            for (TimeSeriesData data : values) {
                ma.addNewNumber(data.getValue());
                Double movingAverage = ma.getMovingAverage();
                long timestamp = data.getTimestamp();
                String dateAsString = DateUtil.getDateAsString(timestamp);
                outputValue.set(dateAsString + "," + movingAverage);
                outputKey.set(key.getName());
                context.write(outputKey, outputValue);
            }
        }
    }

    public static void main(String[] args) throws Exception {
        FileUtil.deleteDir("output");
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "simpleMovingAverage");
        job.setMapperClass(MovingAverageMapper.class);
        job.setReducerClass(MovingAverageReducer.class);
        job.setMapOutputKeyClass(CompositeKey.class);
        job.setMapOutputValueClass(TimeSeriesData.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        job.setPartitionerClass(NaturalKeyPartitioner.class);
        job.setGroupingComparatorClass(NaturalKeyGroupingComparator.class);
        job.setSortComparatorClass(CompositeKeyComparator.class);
        job.setNumReduceTasks(1);
        FileInputFormat.setInputPaths(job, new Path("input/file.txt"));
        FileOutputFormat.setOutputPath(job, new Path("output"));
        System.exit(job.waitForCompletion(true)?0:1);
    }
}
  • 0
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
MapReduce是一种分布式计算模型,它将大数据集分成小的数据块,然后通过多个计算节点进行并行处理,最后将结果合并起来。MapReduce包含两个阶段:Map阶段和Reduce阶段。 以join为例,假设我们有两个表,table1和table2,它们都有一个共同的字段ID。我们需要将这两个表按照ID字段进行join操作,得到一个新的表。 在MapReduce中,我们可以将join操作分成两个步骤:Map和Reduce。具体过程如下: 1. Map阶段 在Map阶段,我们需要将两个表中的ID字段提取出来,并将它们作为key,将整个记录作为value。对于table1和table2,我们可以分别进行Map操作,得到两个中间结果。 例如,table1中的一条记录是(ID1, value1),我们需要将它转换成(key=ID1, value=(1, value1)),其中1表示这条记录来自table1;table2中的一条记录是(ID1, value2),我们需要将它转换成(key=ID1, value=(2, value2)),其中2表示这条记录来自table2。 Map操作的输出会被按照key进行分组,得到一个key为ID1的记录组,其中包含了table1和table2中所有ID为ID1的记录。 2. Reduce阶段 在Reduce阶段,我们需要对每个key进行处理,将来自table1和table2的记录进行join操作。 具体来说,我们可以遍历这个key对应的记录组,将其中来自table1的记录和来自table2的记录进行匹配,匹配成功的记录将被输出到最终结果中。 例如,如果我们遍历到了一组key为ID1的记录组,其中包含了(table1, value1)和(table2, value2)两条记录,我们需要将它们进行join操作,得到(key=ID1, value=(value1, value2)),最终输出到结果表中。 Reduce操作的输出就是最终的join结果。 以上就是MapReduce中join操作的计算过程。通过MapReduce,我们可以将大规模数据的处理分成多个小的任务,通过并行处理来提高计算效率。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值