MapReduce之移动平均(以股票价格为例)
基本概念
时间序列数据
时间序列数据表示一个变量在一段时间内的值
移动平均
令A为一组有序对象的序列:
A
=
(
a
1
,
a
2
,
a
3
,
.
.
.
,
a
N
)
A=(a_1,a_2,a_3,...,a_N)
A=(a1,a2,a3,...,aN)
可以把A表示为
{
a
i
}
i
=
1
N
\{a_i \}_{i=1}^{N}
{ai}i=1N
n移动平均序列是由ai定义的一个新序列
{
S
i
}
i
=
1
N
−
n
+
1
\{S_i\}_{i=1}^{N-n+1}
{Si}i=1N−n+1
通过计算n项子序列的算术平均值计算得到:
S
i
=
1
n
∑
j
=
i
i
+
n
−
1
a
j
S_i=\frac{1}{n}\sum_{j=i}^{i+n-1}a_j
Si=n1j=i∑i+n−1aj
基本示例
股票收盘价时间序列数据
时间序列 | 日期 | 收盘价 |
---|---|---|
1 | 2013-10-01 | 10 |
2 | 2013-10-02 | 18 |
3 | 2013-10-03 | 20 |
4 | 2013-10-04 | 30 |
股票收盘价3天的移动平均数
时间序列 | 日期 | 移动平均 | 如何计算 |
---|---|---|---|
1 | 2013-10-01 | 10.00 | =(10)/1 |
2 | 2013-10-02 | 14.00 | (10+18)/2 |
3 | 2013-10-03 | 16.00 | (10+18+20)/3 |
4 | 2013-10-04 | 22.66 | (18+20+30)/3 |
MapReduce移动平均解决方案思路
样例输入
GOOG,2004-11-04,184.70
GOOG,2004-11-03,191.67
GOOG,2004-11-02,194.87
AAPL,2013-10-9,486.59
AAPL,2013-10-8,480.94
AAPL,2013-10-7,487.75
AAPL,2013-10-4,483.03
AAPL,2013-10-3,483.41
IBM,2013-09-30,185.18
IBM,2013-09-30,186.92
IBM,2013-09-30,190.22
IBM,2013-09-30,189.47
GOOG,2013-07-19,896.60
GOOG,2013-07-18,910.68
GOOG,2013-07-17,918.55
样例输入
AAPL 2013-10-03,483.41
AAPL 2013-10-04,483.22
AAPL 2013-10-07,484.73
AAPL 2013-10-08,483.7825
AAPL 2013-10-09,484.34400000000005
GOOG 2004-11-02,194.87
GOOG 2004-11-03,193.26999999999998
GOOG 2004-11-04,190.41333333333333
GOOG 2013-07-17,372.4475
GOOG 2013-07-18,480.09399999999994
GOOG 2013-07-19,620.4399999999999
IBM 2013-09-30,186.92
IBM 2013-09-30,188.57
IBM 2013-09-30,188.87
IBM 2013-09-30,187.9475
在了解移动平均算法后,只需要根据股票代码对数据分组,然后按照时间戳对这些值排序,最用应用移动平均算法。
基于数组模拟队列的移动平均解决方案如下
public class MovingAverage {
private double sum = 0.0;
private final int period;
private double[] window = null;
private int pointer = 0;
private int size = 0;
public MovingAverage(int period) {
if (period < 1) {
throw new IllegalArgumentException("period must be > 0");
}
this.period = period;
window = new double[period];
}
public void addNewNumber(double number) {
sum += number;
if (size < period) {
window[pointer++] = number;
size++;
}
else {
// size = period (size cannot be > period)
pointer = pointer % period;
sum -= window[pointer];
window[pointer++] = number;
}
}
public double getMovingAverage() {
if (size == 0) {
throw new IllegalArgumentException("average is undefined");
}
//
return sum / size;
}
}
实现过程
在整个过程中为移动平均实现二次排序,
则映射器的输出键应当是自然键(name-as-string)和自然键(timeserise-timestamp)的组合
将时间序列数据点表示为一个(timestamp,double)对
public static class TimeSeriesData implements WritableComparable<TimeSeriesData>{
private long timestamp;
private double value;
public TimeSeriesData(){
}
public long getTimestamp() {
return timestamp;
}
public void setTimestamp(long timestamp) {
this.timestamp = timestamp;
}
public double getValue() {
return value;
}
public void setValue(double value) {
this.value = value;
}
public void set(long timestamp,double value){
this.timestamp=timestamp;
this.value=value;
}
@Override
public String toString() {
return "TimeSeriesData{" +
"timestamp=" + timestamp +
", value=" + value +
'}';
}
定义一个定制组合键(string,timestamp)
public static class CompositeKey implements WritableComparable<CompositeKey>{
private String name;
private long timestamp;
public CompositeKey(){
}
public void set(String name,long timestamp){
this.name=name;
this.timestamp=timestamp;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public long getTimestamp() {
return timestamp;
}
public void setTimestamp(long timestamp) {
this.timestamp = timestamp;
}
public int compareTo(CompositeKey o) {
if(this.name.compareTo(o.name)!=0){
return this.name.compareTo(o.name);
}else if(this.timestamp!=o.timestamp){
return timestamp>o.timestamp?1:-1;
}else{
return 0;
}
}
public void write(DataOutput dataOutput) throws IOException {
dataOutput.writeUTF(this.name);
dataOutput.writeLong(this.timestamp);
}
public void readFields(DataInput dataInput) throws IOException {
this.name=dataInput.readUTF();
this.timestamp=dataInput.readLong();
}
}
CompositeKey类会在shuffle阶段中根据字段“name”和“timestamp”完成排序,所以在接下来的过程中提供一个类来比较组合键对象,他的功能主要是提供compare()方法
定义compositeKey的排序顺序
public static class CompositeKeyComparator extends WritableComparator{
protected CompositeKeyComparator(){
super(CompositeKey.class,true);
}
public int compare(WritableComparable w1,WritableComparable w2){
CompositeKey key1=(CompositeKey) w1;
CompositeKey key2=(CompositeKey) w2;
int comparsion=key1.getName().compareTo(key2.getName());
if(comparsion==0){
if(key1.getTimestamp()==key2.getTimestamp()){
return 0;
}else {
return key1.getTimestamp()>key2.getTimestamp()?1:0;
}
}else{
return comparsion;
}
}
}
完成对组合键的排序后,通过实现Partitioner接口的NaturalKeyPartitioner类处理,将mapper生成的键空间分区
Partitoner阶段编码
public class NaturalKeyPartitioner extends Partitioner<CompositeKey, TimeSeriesData> {
@Override
public int getPartition(CompositeKey key, TimeSeriesData value,
int numberOfPartitions) {
return Math.abs((int) (hash(key.getName()) % numberOfPartitions));
}
static long hash(String str) {
long h = 1125899906842597L; // prime
int length = str.length();
for (int i = 0; i < length; i++) {
h = 31 * h + str.charAt(i);
}
return h;
}
}
接下来使用插件类NaturalKeyGroupingComparator,在hadoop的shuffle阶段,将用这个类的按照键的自然键部分对组合键分组
GroupingComparator阶段编码
public static class NaturalKeyGroupingComparator extends WritableComparator {
protected NaturalKeyGroupingComparator() {
super(CompositeKey.class, true);
}
@Override
public int compare(WritableComparable w1, WritableComparable w2) {
CompositeKey key1 = (CompositeKey) w1;
CompositeKey key2 = (CompositeKey) w2;
return key1.getName().compareTo(key2.getName());
}
}
基本的日期转换工具
import java.text.SimpleDateFormat;
import java.util.Date;
public class DateUtil {
static final String DATE_FORMAT = "yyyy-MM-dd";
static final SimpleDateFormat SIMPLE_DATE_FORMAT =
new SimpleDateFormat(DATE_FORMAT);
public static Date getDate(String dateAsString) {
try {
return SIMPLE_DATE_FORMAT.parse(dateAsString);
}
catch(Exception e) {
return null;
}
}
public static long getDateAsMilliSeconds(Date date) throws Exception {
return date.getTime();
}
public static long getDateAsMilliSeconds(String dateAsString) throws Exception {
Date date = getDate(dateAsString);
return date.getTime();
}
public static String getDateAsString(long timestamp) {
return SIMPLE_DATE_FORMAT.format(timestamp);
}
}
mapper编码阶段
mapper阶段切割文本提取CompositeKey和TimeSeriesData生成<CompositeKey,TimeSeriesData>键值对
public static class MovingAverageMapper extends
Mapper<LongWritable, Text, CompositeKey, TimeSeriesData> {
private final CompositeKey reducerKey = new CompositeKey();
private final TimeSeriesData reducerValue = new TimeSeriesData();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
if ((line == null) || (line.length() == 0)) {
return;
}
String[] tokens = line.split(",");
if (tokens.length == 3) {
Date date = DateUtil.getDate(tokens[1]);
if (date == null) {
return;
}
long timestamp = date.getTime();
reducerKey.set(tokens[0], timestamp);
reducerValue.set(timestamp, Double.parseDouble(tokens[2]));
context.write(reducerKey, reducerValue);
}
}
}
reduce编码阶段
public static class MovingAverageReducer extends Reducer<CompositeKey, TimeSeriesData, Text, Text> {
int windowSize = 5;
protected void reduce(CompositeKey key, Iterable<TimeSeriesData> values,
Context context) throws IOException, InterruptedException {
Text outputKey = new Text();
Text outputValue = new Text();
MovingAverage ma = new MovingAverage(this.windowSize);
for (TimeSeriesData data : values) {
ma.addNewNumber(data.getValue());
Double movingAverage = ma.getMovingAverage();
long timestamp = data.getTimestamp();
String dateAsString = DateUtil.getDateAsString(timestamp);
outputValue.set(dateAsString + "," + movingAverage);
outputKey.set(key.getName());
context.write(outputKey, outputValue);
}
}
}
完整代码如下
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.util.Date;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.lang.InterruptedException;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.Queue;
public class simpleMovingAverage {
public static class MovingAverage {
private double sum = 0.0;
private final int period;
private double[] window = null;
private int pointer = 0;
private int size = 0;
public MovingAverage(int period) {
if (period < 1) {
throw new IllegalArgumentException("period must be > 0");
}
this.period = period;
window = new double[period];
}
public void addNewNumber(double number) {
sum += number;
if (size < period) {
window[pointer++] = number;
size++;
}
else {
pointer = pointer % period;
sum -= window[pointer];
window[pointer++] = number;
}
}
public double getMovingAverage() {
if (size == 0) {
throw new IllegalArgumentException("average is undefined");
}
return sum / size;
}
}
public static class CompositeKey implements WritableComparable<CompositeKey>{
private String name;
private long timestamp;
public CompositeKey(){
}
public void set(String name,long timestamp){
this.name=name;
this.timestamp=timestamp;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public long getTimestamp() {
return timestamp;
}
public void setTimestamp(long timestamp) {
this.timestamp = timestamp;
}
public int compareTo(CompositeKey o) {
if(this.name.compareTo(o.name)!=0){
return this.name.compareTo(o.name);
}else if(this.timestamp!=o.timestamp){
return timestamp>o.timestamp?1:-1;
}else{
return 0;
}
}
public void write(DataOutput dataOutput) throws IOException {
dataOutput.writeUTF(this.name);
dataOutput.writeLong(this.timestamp);
}
public void readFields(DataInput dataInput) throws IOException {
this.name=dataInput.readUTF();
this.timestamp=dataInput.readLong();
}
}
public static class TimeSeriesData implements WritableComparable<TimeSeriesData>{
private long timestamp;
private double value;
public TimeSeriesData(){
}
public long getTimestamp() {
return timestamp;
}
public void setTimestamp(long timestamp) {
this.timestamp = timestamp;
}
public double getValue() {
return value;
}
public void setValue(double value) {
this.value = value;
}
public void set(long timestamp,double value){
this.timestamp=timestamp;
this.value=value;
}
@Override
public String toString() {
return "TimeSeriesData{" +
"timestamp=" + timestamp +
", value=" + value +
'}';
}
public int compareTo(TimeSeriesData o) {
if(this.timestamp<o.timestamp){
return -1;
}else if(this.timestamp>o.timestamp){
return 1;
}else{
return 0;
}
}
public void write(DataOutput dataOutput) throws IOException {
dataOutput.writeLong(timestamp);
dataOutput.writeDouble(value);
}
public void readFields(DataInput dataInput) throws IOException {
this.timestamp=dataInput.readLong();
this.value=dataInput.readDouble();
}
}
public static class CompositeKeyComparator extends WritableComparator{
protected CompositeKeyComparator(){
super(CompositeKey.class,true);
}
public int compare(WritableComparable w1,WritableComparable w2){
CompositeKey key1=(CompositeKey) w1;
CompositeKey key2=(CompositeKey) w2;
int comparsion=key1.getName().compareTo(key2.getName());
if(comparsion==0){
if(key1.getTimestamp()==key2.getTimestamp()){
return 0;
}else {
return key1.getTimestamp()>key2.getTimestamp()?1:0;
}
}else{
return comparsion;
}
}
}
public static class NaturalKeyPartitioner extends Partitioner<CompositeKey, TimeSeriesData> {
@Override
public int getPartition(CompositeKey key, TimeSeriesData value,
int numberOfPartitions) {
return Math.abs((int) (hash(key.getName()) % numberOfPartitions));
}
static long hash(String str) {
long h = 1125899906842597L;
int length = str.length();
for (int i = 0; i < length; i++) {
h = 31 * h + str.charAt(i);
}
return h;
}
}
public static class NaturalKeyGroupingComparator extends WritableComparator {
protected NaturalKeyGroupingComparator() {
super(CompositeKey.class, true);
}
@Override
public int compare(WritableComparable w1, WritableComparable w2) {
CompositeKey key1 = (CompositeKey) w1;
CompositeKey key2 = (CompositeKey) w2;
return key1.getName().compareTo(key2.getName());
}
}
public static class MovingAverageMapper extends
Mapper<LongWritable, Text, CompositeKey, TimeSeriesData> {
private final CompositeKey reducerKey = new CompositeKey();
private final TimeSeriesData reducerValue = new TimeSeriesData();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
if ((line == null) || (line.length() == 0)) {
return;
}
String[] tokens = line.split(",");
if (tokens.length == 3) {
Date date = DateUtil.getDate(tokens[1]);
if (date == null) {
return;
}
long timestamp = date.getTime();
reducerKey.set(tokens[0], timestamp);
reducerValue.set(timestamp, Double.parseDouble(tokens[2]));
context.write(reducerKey, reducerValue);
}
}
}
public static class MovingAverageReducer extends Reducer<CompositeKey, TimeSeriesData, Text, Text> {
int windowSize = 5;
protected void reduce(CompositeKey key, Iterable<TimeSeriesData> values,
Context context) throws IOException, InterruptedException {
Text outputKey = new Text();
Text outputValue = new Text();
MovingAverage ma = new MovingAverage(this.windowSize);
for (TimeSeriesData data : values) {
ma.addNewNumber(data.getValue());
Double movingAverage = ma.getMovingAverage();
long timestamp = data.getTimestamp();
String dateAsString = DateUtil.getDateAsString(timestamp);
outputValue.set(dateAsString + "," + movingAverage);
outputKey.set(key.getName());
context.write(outputKey, outputValue);
}
}
}
public static void main(String[] args) throws Exception {
FileUtil.deleteDir("output");
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "simpleMovingAverage");
job.setMapperClass(MovingAverageMapper.class);
job.setReducerClass(MovingAverageReducer.class);
job.setMapOutputKeyClass(CompositeKey.class);
job.setMapOutputValueClass(TimeSeriesData.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setPartitionerClass(NaturalKeyPartitioner.class);
job.setGroupingComparatorClass(NaturalKeyGroupingComparator.class);
job.setSortComparatorClass(CompositeKeyComparator.class);
job.setNumReduceTasks(1);
FileInputFormat.setInputPaths(job, new Path("input/file.txt"));
FileOutputFormat.setOutputPath(job, new Path("output"));
System.exit(job.waitForCompletion(true)?0:1);
}
}