使用DistributedCache有一个前提,就是进行联结的数据有一个足够小,可以装入内存中。注意我们可以从代码中看出它是如何被装入内存中的,因此,我们也可以在装入的过程中进行过滤。但是值得指出的是,如果文件很大,那么装入内存中也是很费时的。
DistributedCache的原理是将小的那个文件复制到所有节点上。
我们使用DistributedCache.addCacheFile()来设定要传播的文件,然后在mapper的初始化方法configure中取用DistributedCache.getCacheFiles(conf)方法获取该文件并装入内存中。
环境:Vmware 8.0 和Ubuntu11.04
第一步:首先创建一个工程命名为HadoopTest.目录结构如下图:
第二步: 在/home/tanglg1987目录下新建一个start.sh脚本文件,每次启动虚拟机都要删除/tmp目录下的全部文件,重新格式化namenode,代码如下:
sudo rm -rf /tmp/*
rm -rf /home/tanglg1987/hadoop-0.20.2/logs
hadoop namenode -format
hadoop datanode -format
start-all.sh
hadoop fs -mkdir input
hadoop dfsadmin -safemode leave
第三步:给start.sh增加执行权限并启动hadoop伪分布式集群,代码如下:
chmod 777 /home/tanglg1987/start.sh
./start.sh
执行过程如下:
第四步:上传本地文件到hdfs
在/home/tanglg1987目录下新建Order.txt内容如下:
3,A,12.95,02-Jun-2008
1,B,88.25,20-May-2008
2,C,32.00,30-Nov-2007
3,D,25.00,22-Jan-2009
在/home/tanglg1987目录下新建Customer.txt内容如下:
1,tom,555-555-5555
2,white,123-456-7890
3,jerry,281-330-4563
4,tanglg,408-555-0000
上传本地文件到hdfs:
hadoop fs -put /home/tanglg1987/Orders.txt input
hadoop fs -put /home/tanglg1987/Customer.txt input
第五步:新建一个DistributedCacheJoin.java,代码如下:
package com.baison.action;
import java.io.BufferedReader;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.util.Hashtable;
import java.util.Iterator;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.util.ReflectionUtils;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.contrib.utils.join.DataJoinMapperBase;
import org.apache.hadoop.contrib.utils.join.DataJoinReducerBase;
import org.apache.hadoop.contrib.utils.join.TaggedMapOutput;
public class DistributedCacheJoin extends Configured implements Tool {
public static class MapClass extends MapReduceBase implements
Mapper<Text, Text, Text, Text> {
private Hashtable<String, String> joinData = new Hashtable<String, String>();
@Override
public void configure(JobConf conf) {
try {
URI[] cacheFiles =DistributedCache.getCacheFiles(conf);
if (cacheFiles != null && cacheFiles.length > 0) {
String line;
String[] tokens;
BufferedReader joinReader = new BufferedReader(
new FileReader(cacheFiles[0].toString()));
try {
line = joinReader.readLine();
while ((line = joinReader.readLine()) != null) {
tokens = line.split(",", 2);
joinData.put(tokens[0], tokens[1]);
}
} finally {
joinReader.close();
}
}
} catch (IOException e) {
System.err.println("Exception reading DistributedCache: " + e);
}
}
public void map(Text key, Text value,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
// for(String t: joinData.keySet()){
// output.collect(new Text(t), new Text(joinData.get(t)));
// }
String joinValue = joinData.get(key.toString());
if (joinValue != null) {
output.collect(key,
new Text(value.toString() + "," + joinValue));
}
}
}
public int run(String[] args) throws Exception {
Configuration conf = getConf();
DistributedCache.addCacheFile(new Path(args[0]).toUri(), conf);
JobConf job = new JobConf(conf, DistributedCacheJoin.class);
Path in = new Path(args[1]);
Path out = new Path(args[2]);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);
job.setJobName("DataJoin with DistributedCache");
job.setMapperClass(MapClass.class);
job.setNumReduceTasks(0);
job.setInputFormat(KeyValueTextInputFormat.class);
job.setOutputFormat(TextOutputFormat.class);
job.set("key.value.separator.in.input.line", ",");
JobClient.runJob(job);
return 0;
}
public static void main(String[] args) throws Exception {
// String[] arg = { "/home/tanglg1987/test/Customers.txt","/home/tanglg1987/test/Orders.txt",
// "hdfs://localhost:9100/user/tanglg1987/output" };
String[] arg = { "/home/tanglg1987/test/Orders.txt","/home/tanglg1987/test/Customers.txt",
"hdfs://localhost:9100/user/tanglg1987/output" };
int res = ToolRunner.run(new Configuration(), new DistributedCacheJoin(), arg);
System.exit(res);
}
}
第六步:Run On Hadoop,运行过程如下:
12/10/22 22:07:52 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
12/10/22 22:07:52 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
12/10/22 22:07:52 INFO mapred.FileInputFormat: Total input paths to process : 1
12/10/22 22:07:53 INFO mapred.JobClient: Running job: job_local_0001
12/10/22 22:07:53 INFO mapred.FileInputFormat: Total input paths to process : 1
12/10/22 22:07:53 INFO mapred.MapTask: numReduceTasks: 0
12/10/22 22:07:53 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
12/10/22 22:07:53 INFO mapred.LocalJobRunner:
12/10/22 22:07:53 INFO mapred.TaskRunner: Task attempt_local_0001_m_000000_0 is allowed to commit now
12/10/22 22:07:53 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local_0001_m_000000_0' to hdfs://localhost:9100/user/tanglg1987/output
12/10/22 22:07:53 INFO mapred.LocalJobRunner: file:/home/tanglg1987/test/Customers.txt:0+85
12/10/22 22:07:53 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.
12/10/22 22:07:54 INFO mapred.JobClient: map 100% reduce 0%
12/10/22 22:07:54 INFO mapred.JobClient: Job complete: job_local_0001
12/10/22 22:07:54 INFO mapred.JobClient: Counters: 7
12/10/22 22:07:54 INFO mapred.JobClient: FileSystemCounters
12/10/22 22:07:54 INFO mapred.JobClient: FILE_BYTES_READ=16860
12/10/22 22:07:54 INFO mapred.JobClient: FILE_BYTES_WRITTEN=33908
12/10/22 22:07:54 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=121
12/10/22 22:07:54 INFO mapred.JobClient: Map-Reduce Framework
12/10/22 22:07:54 INFO mapred.JobClient: Map input records=4
12/10/22 22:07:54 INFO mapred.JobClient: Spilled Records=0
12/10/22 22:07:54 INFO mapred.JobClient: Map input bytes=85
12/10/22 22:07:54 INFO mapred.JobClient: Map output records=3
第七步:查看结果集,运行结果如下: