HDPCD-Java --- Exam Test -- Map-Side Join

23 篇文章 0 订阅

Question:

There are two folders in HDFS in the /user/horton folder: flightdelays and weather. These are comma-separated files that contain flight delay information for airports in the U.S. for the year 2008, along with the weather data from the San Francisco airport. Write and execute a Java MapReduce application that satisfies the following criteria:

  1. Join the flight delay data in flightdelays with the weather data in weather. Join the data by the day, month and year and also where the "Dest" column in flightdelays is equal to "SFO".
  2. The output of each delayed flight into SFO consists of the following fields:
    Year,Month,DayofMonth,DepTime,ArrTime,UniqueCarrier,FlightNum,
    ActualElapsedTime,ArrDelay,DepDelay,Origin,Dest,PRCP,TMAX,TMIN
    
    For example, for the date 2008-01-03, there is a delayed flight number 488 from Las Vegas (LAS) to San Francisco (SFO). The corresponding output would be:
    2008,1,3,1426,1605,WN,488,99,35,31,LAS,SFO,43,150,94
    
  3. The output is sorted by date ascending, and on each day the output is sorted by ArrDelay descending (so that the longest arrival delays appear first).
  4. The output is in text files in a new folder in HDFS named task1 with values separated by commas
  5. The output is in two text files
Data:
-------------------------sfo_weather.csv ---------------------------------------
STATION_NAME,YEAR,MONTH,DAY,PRCP,TMAX,TMIN
SAN FRANCISCO INTERNATIONAL AIRPORT CA US,2008,01,01,0,122,39
SAN FRANCISCO INTERNATIONAL AIRPORT CA US,2008,01,02,0,117,39
SAN FRANCISCO INTERNATIONAL AIRPORT CA US,2008,01,03,43,150,94
SAN FRANCISCO INTERNATIONAL AIRPORT CA US,2008,01,04,533,150,100
SAN FRANCISCO INTERNATIONAL AIRPORT CA US,2008,01,05,196,122,78
SAN FRANCISCO INTERNATIONAL AIRPORT CA US,2008,01,06,15,106,50
SAN FRANCISCO INTERNATIONAL AIRPORT CA US,2008,01,07,0,111,67
SAN FRANCISCO INTERNATIONAL AIRPORT CA US,2008,01,08,20,128,61
-------------------------flight_delays1、2、3.csv----------------------------------
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier, FlightNum,TailNum,
ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,
CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2008,1,3,4,2003,1955,2211,2225,WN,335,N712SW,128,150,116,-14,8,IAD,TPA,810,4,8,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,754,735,1002,1000,WN,3231,N772SW,128,145,113,2,19,IAD,TPA,810,5,10,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,628,620,804,750,WN,448,N428WN,96,90,76,14,8,IND,BWI,515,3,17,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,926,930,1054,1100,WN,1746,N612SW,88,90,78,-6,-4,IND,BWI,515,3,7,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,1829,1755,1959,1925,WN,3920,N464WN,90,90,77,34,34,IND,BWI,515,3,10,0,,0,2,0,0,0,32
2008,1,3,4,1940,1915,2121,2110,WN,378,N726SW,101,115,87,11,25,IND,JAX,688,4,10,0,,0,NA,NA,NA,NA,NA
2008,1,3,4,1937,1830,2037,1940,WN,509,N763SW,240,250,230,57,67,IND,LAS,1591,3,7,0,,0,10,0,0,0,47
2008,1,3,4,1039,1040,1132,1150,WN,535,N428WN,233,250,219,-18,-1,IND,LAS,1591,7,7,0,,0,NA,NA,NA,NA,NA

Understand:
    1.Inner join, we can see weather data is small enough to get into memory, so let's start with map side join.
    a.Add weather data as cache file.Use day, month and year as the join key. 
    b.Use "Dest" column in flightdelays as filter, which will filter "Dest" column in flightdelays is equal to "SFO".
    c.Get the "SFO" from arguments in main(). 
    2.Use  Year,Month,DayofMonth  as  key,
use  DepTime,ArrTime,UniqueCarrier,FlightNum, ActualElapsedTime,ArrDelay,DepDelay,Origin,Dest,PRCP,TMAX,TMIN   as value.
    a.DepTime,ArrTime,UniqueCarrier,FlightNum,ActualElapsedTime,ArrDelay,DepDelay,Origin,Dest from flight_delays1、2、3.csv.
    b.PRCP,TMAX,TMIN from sfo_weather.csv.
    3.Modify Year,Month,DayofMonth, ArrDelay as key. Custom output format.
    4.Output dir is "task1", output file is text file, fields separated by commas.
    5.Reducer task is two.

Code:

package task1;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.StringUtils;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class Task1 extends Configured implements Tool{
	//private static final String STATION_NAME = "StationName";
	private static final String DESTINATION = "Destination";
	public static class MapSideJoinMapper extends Mapper<LongWritable, Text, DateDelay, DelayWeather> {
		private Map<Datee, Weather> map = new HashMap<Datee, Weather>();
		//private String stationName;
		private String destionation;
		@Override
		protected void setup(
				Mapper<LongWritable, Text, DateDelay, DelayWeather>.Context context)
				throws IOException, InterruptedException {
			//stationName = context.getConfiguration().get(STATION_NAME);
			destionation = context.getConfiguration().get(DESTINATION);
			BufferedReader reader = new BufferedReader(new FileReader("sfo_weather.csv"));
			String line;
			String[] wStr;
			Datee datee;
			Weather weather;
			while ((line = reader.readLine()) != null) {
				wStr = StringUtils.split(line, '\\', ',');
				if (wStr[1].equals("YEAR")) {
					continue;
				}
				datee = new Datee(Integer.parseInt(wStr[1]), Integer.parseInt(wStr[2]), Integer.parseInt(wStr[3]));
				weather = new Weather(Integer.parseInt(wStr[4]), Integer.parseInt(wStr[5]), Integer.parseInt(wStr[6]));
				map.put(datee, weather);
				
			}
			reader.close();
		}
		@Override
		protected void map(LongWritable key, Text value,
				Mapper<LongWritable, Text, DateDelay, DelayWeather>.Context context)
				throws IOException, InterruptedException {
			String[] delays = StringUtils.split(value.toString(), '\\', ',');
			DateDelay dateDelay;
			Datee date;
			if (delays[0].equals("Year")) {
				return;
			}
			if (delays[17].trim().equals(destionation)) {
				boolean xx =  Utils.replaceNAWithZero(delays);
				if (xx) {
					return;
				}
				date = new Datee(Integer.parseInt(delays[0]), Integer.parseInt(delays[1]), Integer.parseInt(delays[2]));
				if (map.containsKey(date)) {
					dateDelay = new DateDelay(date, Integer.parseInt(delays[14]));
					FlightDelay flightDelay = new FlightDelay(Integer.parseInt(delays[4]), Integer.parseInt(delays[6]), delays[8], 
							Integer.parseInt(delays[9]), Integer.parseInt(delays[11]), Integer.parseInt(delays[14]),
							Integer.parseInt(delays[15]), delays[16], delays[17]);
					DelayWeather delayWeather = new DelayWeather();
					delayWeather.flightDelay = flightDelay;
					delayWeather.weather = map.get(date);
					context.write(dateDelay, delayWeather);
		        }
			}
		}
	}
	
	public static final class MapSideJoinReducer extends Reducer<DateDelay, DelayWeather, DateDelay, DelayWeather>{
		@Override
		protected void reduce(
				DateDelay key,
				Iterable<DelayWeather> values,
				Reducer<DateDelay, DelayWeather, DateDelay, DelayWeather>.Context context)
				throws IOException, InterruptedException {
			Iterator<DelayWeather> iterator = values.iterator();
			while (iterator.hasNext()) {
				context.write(key, iterator.next());
			}
		}
	}
	
	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException, URISyntaxException {
		int result = 0;  
        try {  
            result = ToolRunner.run(new Configuration(),  new Task1(), args);  
        } catch (Exception e) {  
            e.printStackTrace();  
        }  
        System.exit(result);  
	}

	@Override
	public int run(String[] args) throws Exception {
		Job job = Job.getInstance(getConf(), "Task1");
		job.setJarByClass(getClass());
		Configuration conf = job.getConfiguration();
		//conf.set(STATION_NAME, args[0]);
		conf.set(DESTINATION, args[0]);
		job.addCacheFile(new URI("/user/horton/weather/sfo_weather.csv"));
		Path out = new Path("task1");
		out.getFileSystem(conf).delete(out, true);
		FileInputFormat.setInputPaths(job, new Path("/user/horton/flightdelays"));
		FileOutputFormat.setOutputPath(job, out);
		
		conf.set(TextOutputFormat.SEPERATOR, ",");
		
		job.setMapperClass(MapSideJoinMapper.class);
		//job.setGroupingComparatorClass(DateDelayComparator.class);
		job.setReducerClass(MapSideJoinReducer.class);
		job.setInputFormatClass(TextInputFormat.class);
		job.setOutputFormatClass(DelayFileOutputFormat.class);
		job.setMapOutputKeyClass(DateDelay.class);
		job.setMapOutputValueClass(DelayWeather.class);
		job.setOutputKeyClass(DateDelay.class);
		job.setOutputValueClass(DelayWeather.class);
		job.setNumReduceTasks(2);
		//job.setNumReduceTasks(0);
		return job.waitForCompletion(true) ? 0 : 1;
	}
}

package task1;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.io.WritableComparable;

public class Datee implements WritableComparable<Datee> {
	public int year;
	public int month;
	public int day;
	
	public Datee() {
	}
	
	public Datee(int year, int month, int day) {
		this.year = year;
		this.month = month;
		this.day = day;
	}
	
	@Override
	public void write(DataOutput out) throws IOException {
		out.writeInt(year);
		out.writeInt(month);
		out.writeInt(day);
	}

	@Override
	public void readFields(DataInput in) throws IOException {
		year = in.readInt();
		month = in.readInt();
		day = in.readInt();
	}

	@Override
	public int compareTo(Datee date) {
		int response = this.year - date.year;
		if (response == 0) {
			response = this.month - date.month;
		}
		if (response == 0) {
			response = this.day - date.day;
		}
		return response;
	}
	
	@Override
	public int hashCode() {
		return year + month + day;
	}
	
	@Override
	public boolean equals(Object obj) {
		if (obj instanceof Datee) {
			Datee date = (Datee) obj;
			if (year == date.year && month == date.month && day == date.day) {
				return true;
			}
		}
		return false;
	}

	public int getYear() {
		return year;
	}

	public void setYear(int year) {
		this.year = year;
	}

	public int getMonth() {
		return month;
	}

	public void setMonth(int month) {
		this.month = month;
	}

	public int getDay() {
		return day;
	}

	public void setDay(int day) {
		this.day = day;
	}
	
	@Override
	public String toString() {
		
		return this.year + "," + this.month + "," + this.day;
	}
	

}

package task1;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.io.Writable;

public class Weather implements Writable {
	private int prcp;
	private int tMax;
	private int tMin;
	public Weather() {
	}
	public Weather(int prcp, int tMax, int tMin){
		this.prcp = prcp;
		this.tMax = tMax;
		this.tMin = tMin;
	}
	@Override
	public void write(DataOutput out) throws IOException {
		out.writeInt(prcp);
		out.writeInt(tMax);
		out.writeInt(tMin);
	}

	@Override
	public void readFields(DataInput in) throws IOException {
		this.prcp = in.readInt();
		this.tMax = in.readInt();
		this.tMin = in.readInt();
	}
	
	@Override
	public String toString() {
		return this.prcp + "," + this.tMax + "," + this.tMin;
	}

}

package task1;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.io.WritableComparable;

public class DateDelay implements WritableComparable<DateDelay> {
	public Datee date;
	public int arriveDelay;
	
	public DateDelay() {
	}
	
	public DateDelay(Datee date, int arriveDelay) {
		this.date = date;
		this.arriveDelay = arriveDelay;
	}
	
	@Override
	public void write(DataOutput out) throws IOException {
		date.write(out);
		out.writeInt(arriveDelay);
	}

	@Override
	public void readFields(DataInput in) throws IOException {
		date = new Datee();
		date.readFields(in);
		arriveDelay = in.readInt();
	}

	@Override
	public int compareTo(DateDelay dateDelay) {
		int response = this.date.compareTo(dateDelay.date);
		if (response == 0) {
		    response = dateDelay.arriveDelay - this.arriveDelay;
		}
		return response;
	}
	@Override
	public String toString() {
		return this.date + "," + this.arriveDelay;
	}

}

package task1;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.io.Writable;

public class FlightDelay implements Writable{
	public int depTime;
	public int arrTime;
	public String uniqueCarrier;
	public int flightNum;
	public int actualElapsedTime;
	public int arrDelay;
	public int depDelay;
	public String origin;
	public String destionation;
	public FlightDelay() {
	}
	public FlightDelay(int depTime, int arrTime, String uniqueCarrier, int flightNum, 
			int actualElapsedTime, int arrDelay, int depDelay, String origin, String destionation) {
		this.depTime = depTime;
		this.arrTime = arrTime;
		this.uniqueCarrier = uniqueCarrier;
		this.flightNum = flightNum;
		this.actualElapsedTime = actualElapsedTime;
		this.arrDelay = arrDelay;
		this.depDelay = depDelay;
		this.origin = origin;
		this.destionation = destionation;
	}
	@Override
	public void write(DataOutput out) throws IOException {
		out.writeInt(depTime);
		out.writeInt(arrTime);
        out.writeUTF(uniqueCarrier);
		out.writeInt(flightNum);
		out.writeInt(actualElapsedTime);
		out.writeInt(arrDelay);
		out.writeInt(depDelay);
        out.writeUTF(origin);
		out.writeUTF(destionation);
	}

	@Override
	public void readFields(DataInput in) throws IOException {
		this.depTime = in.readInt();
		this.arrTime = in.readInt();
		this.uniqueCarrier = in.readUTF();
		this.flightNum = in.readInt();
		this.actualElapsedTime = in.readInt();
		this.arrDelay = in.readInt();
		this.depDelay = in.readInt();
		this.origin = in.readUTF();
		this.destionation = in.readUTF();
		
	}
	
	@Override
	public String toString() {
		return this.depTime + "," + this.arrTime + "," + this.uniqueCarrier + "," + this.flightNum + "," + this.actualElapsedTime + "," + this.arrDelay
				 + "," + this.depDelay + "," + this.origin + "," + this.destionation;
	}

}

package task1;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.io.Writable;

public class DelayWeather implements Writable {
	public FlightDelay flightDelay;
	public Weather weather;
	@Override
	public void write(DataOutput out) throws IOException {
		flightDelay.write(out);
		weather.write(out);
	}

	@Override
	public void readFields(DataInput in) throws IOException {
		flightDelay = new FlightDelay();
		weather = new Weather();
		flightDelay.readFields(in);
		weather.readFields(in);
	}

	@Override
	public String toString() {
		return this.flightDelay + "," + this.weather;
	}
}


package task1;

import java.io.IOException;

import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class DelayFileOutputFormat extends FileOutputFormat<DateDelay, DelayWeather> {
	
	@Override
	public RecordWriter<DateDelay, DelayWeather> getRecordWriter(
			TaskAttemptContext job) throws IOException, InterruptedException {
	    int partition = job.getTaskAttemptID().getTaskID().getId();
        Path outDir = FileOutputFormat.getOutputPath(job);
        Path filename = new Path(outDir.getName() + Path.SEPARATOR + job.getJobName() + "_" + partition);
        FileSystem fSystem = filename.getFileSystem(job.getConfiguration());
        //
        FSDataOutputStream out = fSystem.create(filename);
		return new DelayRecordWriter(out);
	}

}

package task1;

import java.io.DataOutputStream;
import java.io.IOException;

import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;

public class DelayRecordWriter extends RecordWriter<DateDelay, DelayWeather> {
	
	private DataOutputStream out;
	private final static String SEPERATOR = ",";
	public DelayRecordWriter(DataOutputStream out) {
		this.out = out;
	}

	@Override
	public void write(DateDelay key, DelayWeather value) throws IOException,
			InterruptedException {
		StringBuilder builder = new StringBuilder();
		builder.append(key.date);
		builder.append(SEPERATOR);
		builder.append(value);
		builder.append("\n");
		out.write(builder.toString().getBytes());
	}

	@Override
	public void close(TaskAttemptContext context) throws IOException,
			InterruptedException {
		out.close();
	}

}

package task1;



public class Utils {
	public static boolean replaceNAWithZero(String[] strs){
		if (strs == null || strs.length == 0) {
			return false;
		}
		for (int i = 0; i < strs.length; i++) {
			if (strs[i].trim().toUpperCase().equals("NA")) {
				return true;
			}
		}
		return false;
	}
}


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值