总体思想
使用Local Resource,在Map任务执行前,从HDFS读入小表放进内存中。Map从HDFS逐条读入大表记录,与内存中的小表进行匹配。输出结果。
基本实现
1)将HDFS上的小表,通过Local Resource进行分发。
job.addCacheFile(new URI("data/donation-project-small/opendata_projects_small"));
2)在Map中创建HashMap。在setup()中读入Local Resource的小表,并写入HashMap放在内存中
HashMap<String, String> donproj=new HashMap<String, String>(); //在内存中创建HashMap,存储小表数据
@Override
protected void setup(
Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
BufferedReader fs=new BufferedReader(new FileReader("opendata_projects_small")); //打卡Local Resource上的文件流
String line;
while((line=fs.readLine())!=null){ //按行读取文件流记录
if(line.contains("_projectid"))
continue;
String []words=line.split("\",\"");
if(words[6].isEmpty() || words[8].isEmpty()||words[11].isEmpty())
continue;
String str1=(words[0].replaceAll("\"", ""));
String str2= (words[6]+DELIMITER+words[8]+DELIMITER+words[11]);
donproj.put(str1, str2); //小表数据写入HashMap
}
}
3)在map()中,从HDFS按行读取大表数据,并与HashMap进行匹配。
protected void map(LongWritable key, Text value,
Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
String str=value.toString();
if(str.contains("_projectid"))
return;
String []words=str.split("\",\"");
if(words[4].isEmpty()|| words[5].isEmpty() ||words[11].isEmpty())
return;
if(donproj.containsKey(words[1].replaceAll("\"", ""))){ //如果大表的key在HashMap中存在,则进行批评
outputKey.set(words[1].replaceAll("\"", ""));
outputValue.set(words[4]+DELIMITER+words[5]+DELIMITER+words[11]+DELIMITER+donproj.get(words[1].replaceAll("\"", "")));
context.write(outputKey, outputValue);
}
}
相关问题
1、小表需要适合heap内存空间。在Map创建HashMap,需要在内存heap中分配空间。按照集群配置,heapsize=1000MB。小表超过1000MB会导致mapreduce程序溢出。错误原因:
Error:GC overhead limit exceed
2、Map side join仅仅适合left join和inner join。
执行结果
File System Counters
FILE: Number of bytes read=13417657
FILE: Number of bytes written=28377816
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1602284448
HDFS: Number of bytes written=13082489
HDFS: Number of read operations=39
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Killed map tasks=1
Launched map tasks=13
Launched reduce tasks=1
Data-local map tasks=13
Total time spent by all maps in occupied slots (ms)=557269
Total time spent by all reduces in occupied slots (ms)=4297
Total time spent by all map tasks (ms)=557269
Total time spent by all reduce tasks (ms)=4297
Total vcore-milliseconds taken by all map tasks=557269
Total vcore-milliseconds taken by all reduce tasks=4297
Total megabyte-milliseconds taken by all map tasks=570643456
Total megabyte-milliseconds taken by all reduce tasks=4400128
Map-Reduce Framework
Map input records=4631337
Map output records=167581
Map output bytes=13082489
Map output materialized bytes=13417723
Input split bytes=1632
Combine input records=0
Combine output records=0
Reduce input groups=69911
Reduce shuffle bytes=13417723
Reduce input records=167581
Reduce output records=167581
Spilled Records=335162
Shuffled Maps =12
Failed Shuffles=0
Merged Map outputs=12
GC time elapsed (ms)=253036
CPU time spent (ms)=283660
Physical memory (bytes) snapshot=3361980416
Virtual memory (bytes) snapshot=10917888000
Total committed heap usage (bytes)=2146959360
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1602282816
File Output Format Counters
Bytes Written=13082489
xuefei@xuefei:~/hdp/donation5$ hdfs dfs -cat output/donation5/p* |head
0001539fee6b098f6f196138a61a5556 West Linn,OR,519.00,Brooklyn,11210,Kings (Brooklyn)
0003977c1e69106679cde1aad78c04ec St. Clair Shores,MI,162.35,San Jose,95122,Santa Clara
00054bf9976e21324d0dc44eb7779832 New York,NY,737.65,Brooklyn,11206,Kings (Brooklyn)
00056f61e349ae92d9038f14a9793307 Cambridge,MA,3059.15,N Charleston,29406,Charleston
000678592ad8ece4dcc2f1651ee4fd9c Charlotte,NC,392.53,Morganton,28655,Burke
00077717327177b9e8f38f487cfa9fa7 Simi Valley,CA,144.59,Chatsworth,91311,Los Angeles
00077717327177b9e8f38f487cfa9fa7 Los Angeles,CA,153.05,Chatsworth,91311,Los Angeles
00092a0e6903e2e1dc61eddcb8a5049b Chicago,IL,25.00,Chicago,60647,Cook
00092a0e6903e2e1dc61eddcb8a5049b CHCIAGO,IL,50.00,Chicago,60647,Cook
00092a0e6903e2e1dc61eddcb8a5049b NEW YORK,NY,60.00,Chicago,60647,Cook