MapReduce之使用马尔可夫模型的智能邮件营销(四)
在这一篇博客中,继续使用上一篇博客MapReduce之使用马尔可夫模型的智能邮件营销(三) 生成的状态序列,来生成马尔可状态转移矩阵
使用MapReduce生成马尔可夫状态转移矩阵
这个MapReduce阶段的目标是生成一个马尔可夫状态转移矩阵,这个阶段的输入是状态序列,格式如下:
c
u
s
t
o
m
e
r
−
i
d
,
S
t
a
t
e
1
,
S
t
a
t
e
2
,
.
.
.
,
S
t
a
t
e
n
customer-id,State_1,State_2,...,State_n
customer−id,State1,State2,...,Staten
输出为一个
N
×
N
N \times N
N×N的矩阵,这个N是马尔可夫链模型的状态数(在这里N为9),矩阵中的各项指示从一个状态转移到另一个状态的概率,MapReduce阶段主要目的是统计状态转移的实例数,由于N=9,所以可能会得到81个状态转移。
mapper阶段任务
这个阶段的任务是处理状态转移,从状态序列中的获取customer-id的每一个“从状态”和“到状态” ,生成 < c u s t o m e r − i d , ( S t a t e 1 , S t a t e 2 ) > <customer-id,(State_1,State_2)> <customer−id,(State1,State2)>的键值对
mapper阶段编码
package com.deng.MarkovState;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class MarkovStateTransitionModelMapper extends Mapper<LongWritable, Text,PairOfStrings, IntWritable> {
private PairOfStrings reducerKey = new PairOfStrings();
private static final IntWritable ONE = new IntWritable(1);
int k=0;
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line=value.toString();
String[] items = line.split(",");
if (items.length > 2) {
for (int i = 1; i < (items.length -1); i++) {
reducerKey=new PairOfStrings(items[i],items[i+1]);
context.write(reducerKey, ONE);
}
}
}
}
其中自定义类PairOfString类设计如下
package com.deng.MarkovState;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
public class PairOfStrings implements Writable,WritableComparable<PairOfStrings> {
private String leftElement;
private String rightElement;
public PairOfStrings(){}
public PairOfStrings(String leftElement,String rightElement){
set(leftElement,rightElement);
}
public void set(String leftElement,String rightElement){
this.leftElement=leftElement;
this.rightElement=rightElement;
}
public String getLeftElement() {
return leftElement;
}
public String getRightElement() {
return rightElement;
}
@Override
public int compareTo(PairOfStrings o) {
if(this.leftElement.compareTo(o.leftElement)!=0){
return this.leftElement.compareTo(o.leftElement);
}else if(this.rightElement!=o.rightElement){
return this.rightElement.compareTo(o.rightElement);
}else{
return 0;
}
}
@Override
public void write(DataOutput dataOutput) throws IOException {
dataOutput.writeUTF(leftElement);
dataOutput.writeUTF(rightElement);
}
@Override
public void readFields(DataInput dataInput) throws IOException {
this.leftElement=dataInput.readUTF();
this.rightElement=dataInput.readUTF();
}
public String toString(){
StringBuffer sb=new StringBuffer();
sb.append("PairOfString[").append(getLeftElement()).append(",").append(getRightElement()).append(']');
return sb.toString();
}
}
combine阶段任务
combine阶段主要是对mapper阶段产生的数据在本地节点进行优化,减少数据传输量,并生成(“从状态”,“到状态”)的部分计数,生成 < ( S t a t e 1 , S t a t e 2 ) , c o u n t > <(State_1,State_2),count> <(State1,State2),count>的键值对
combine阶段编码
package com.deng.MarkovState;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class MarkovStateTransitionModelCombiner extends Reducer<PairOfStrings, IntWritable,PairOfStrings,IntWritable> {
public void reduce(PairOfStrings key,Iterable<IntWritable> values,Context context){
int partialSum=0;
for(IntWritable value:values){
partialSum+=value.get();
}
try {
context.write(key,new IntWritable(partialSum));
} catch (IOException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
reducer阶段任务
统计所有的("从状态"和“到状态”)的计数
reducer阶段编码
package com.deng.MarkovState;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class MarkovStateTransitionModelReducer extends Reducer<PairOfStrings, IntWritable, Text, IntWritable> {
protected void reduce(PairOfStrings key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int finalCount = 0;
for (IntWritable value : values) {
finalCount += value.get();
}
String fromState = key.getLeftElement();
String toState = key.getRightElement();
String outputkey = fromState + "," + toState+",";
context.write(new Text(outputkey), new IntWritable(finalCount));
}
}
所有阶段完整的驱动程序如下
采用作业链的方式实现
package com.deng.MarkovState;
import com.deng.util.FileUtil;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MarkovStateDriver {
public static void main(String[] args) throws Exception {
FileUtil.deleteDirs("output");
FileUtil.deleteDirs("output2");
FileUtil.deleteDirs("MarkovState");
Configuration conf=new Configuration();
String[] otherArgs=new String[]{"input/smart_email_training.txt","output"};
Job secondSortJob=new Job(conf,"Markov");
FileInputFormat.setInputPaths(secondSortJob,new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(secondSortJob,new Path(otherArgs[1]));
secondSortJob.setJarByClass(MarkovStateDriver.class);
secondSortJob.setMapperClass(SecondarySortProjectionMapper.class);
secondSortJob.setReducerClass(SecondarySortProjectionReducer.class);
secondSortJob.setMapOutputKeyClass(CompositeKey.class);
secondSortJob.setMapOutputValueClass(PairOfLongInt.class);
secondSortJob.setOutputKeyClass(NullWritable.class);
secondSortJob.setOutputValueClass(Text.class);
secondSortJob.setCombinerKeyGroupingComparatorClass(CompositeKeyComparator.class);
secondSortJob.setGroupingComparatorClass(NaturalKeyGroupingComparator.class);
if((secondSortJob.waitForCompletion(true)?0:1)==0){
Job stateTransition=new Job(conf,"MarkovStateTransition");
FileInputFormat.setInputPaths(stateTransition,new Path("output/part-r-00000"));
FileOutputFormat.setOutputPath(stateTransition,new Path("output2"));
stateTransition.setJarByClass(MarkovStateDriver.class);
stateTransition.setMapperClass(StateTrainitionMapper.class);
stateTransition.setNumReduceTasks(0);
stateTransition.setOutputKeyClass(Text.class);
stateTransition.setOutputValueClass(Text.class);
if((stateTransition.waitForCompletion(true)?0:1)==0){
Job markovState=new Job(conf,"MarkState");
markovState.setJarByClass(MarkovStateDriver.class);
markovState.setMapperClass(MarkovStateTransitionModelMapper.class);
markovState.setReducerClass(MarkovStateTransitionModelReducer.class);
// markovState.setPartitionerClass(MarkovStateTransitionModelPartitioner.class);
// markovState.setNumReduceTasks(81);
markovState.setMapOutputKeyClass(PairOfStrings.class);
markovState.setMapOutputValueClass(IntWritable.class);
markovState.setOutputKeyClass(Text.class);
markovState.setOutputValueClass(IntWritable.class);
// markovState.setCombinerKeyGroupingComparatorClass(MarkovStateKeyComparator.class);
FileInputFormat.setInputPaths(markovState,new Path("output2/part-m-00000"));
FileOutputFormat.setOutputPath(markovState,new Path("MarkovState"));
System.exit(markovState.waitForCompletion(true)?0:1);
}
}
}
}
运行结果如下
这样就得到了状态转移的实例数,在下一个博客中介绍如何生成马尔可夫模型