版本:
$ hadoop version
Hadoop 0.20.2-cdh3u4
Subversion git://ubuntu-slave01/var/lib/jenkins/workspace/CDH3u4-Full-RC/build/cdh3/hadoop20/0.20.2-cdh3u4/source -r 214dd731e3bdb687cb55988d3f47dd9e248c5690
Compiled by jenkins on Mon May 7 13:01:39 PDT 2012
From source with checksum a60c9795e41a3248b212344fb131c12c
问题描述:
Hadoop在执行Reducer时对应的Iterable<VALUEIN> 其对应的值保持问题,代码如下:
protected void reduce(Text key, Iterable<VectorWritable> values, Context context)
throws IOException, InterruptedException {
Map<String, VectorWritable> map = new HashMap<String, VectorWritable>();
for (VectorWritable vw : values) {
NamedVector nv = (NamedVector) vw.get();
Item itemi = Item.toInstance(nv.getName());
map.put(itemi.getItemID(), vw);
}
}
其对应的Map中的元素都是一样的。
问题原因:
每次迭代对应的值在此次reduce时内存中是一个实例,源码如下:
public class ReduceContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
extends TaskInputOutputContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
private RawKeyValueIterator input;
private Counter inputKeyCounter;
private Counter inputValueCounter;
private RawComparator<KEYIN> comparator;
private KEYIN key; // current key
private VALUEIN value; // 就是这个实例
.....................
}
每次执行时都是对value进行赋值
@Override
public VALUEIN next() {
// if this is the first record, we don't need to advance
if (firstValue) {
firstValue = false;
return value;
}
// if this isn't the first record and the next key is different, they
// can't advance it here.
if (!nextKeyIsSame) {
throw new NoSuchElementException("iterate past last value");
}
// otherwise, go to the next key/value pair
try {
nextKeyValue();
return value;
} catch (IOException ie) {
throw new RuntimeException("next value iterator failed", ie);
} catch (InterruptedException ie) {
// this is bad, but we can't modify the exception list of java.util
throw new RuntimeException("next value iterator interrupted", ie);
}
}
因此造成了上述问题
解决方式:
protected void reduce(Text key, Iterable<VectorWritable> values, Context context)
throws IOException, InterruptedException {
Map<String, VectorWritable> map = new HashMap<String, VectorWritable>();
for (VectorWritable vectorWritable : values) {
VectorWritable vw = WritableUtils.clone(vectorWritable, context.getConfiguration());
NamedVector nv = (NamedVector) vw.get();
Item itemi = Item.toInstance(nv.getName());
map.put(itemi.getItemID(), vw);
}
}
克隆一个即可