#博学谷IT学习技术支持#
关于TopN 问题中的排序
案例:
现有美国2021-1-28号,各个县county的新冠疫情累计案例信息,包括确诊病例和死亡病例,数据格式如下所示:
2021-01-28,Juneau City and Borough,Alaska,02110,1108,3
2021-01-28,Kenai Peninsula Borough,Alaska,02122,3866,18
2021-01-28,Ketchikan Gateway Borough,Alaska,02130,272,1
2021-01-28,Kodiak Island Borough,Alaska,02150,1021,5
2021-01-28,Kusilvak Census Area,Alaska,02158,1099,3
2021-01-28,Lake and Peninsula Borough,Alaska,02164,5,0
2021-01-28,Matanuska-Susitna Borough,Alaska,02170,7406,27
2021-01-28,Nome Census Area,Alaska,02180,307,0
2021-01-28,North Slope Borough,Alaska,02185,973,3
2021-01-28,Northwest Arctic Borough,Alaska,02188,567,1
2021-01-28,Petersburg Borough,Alaska,02195,43,0
字段含义如下:date(日期),county(县),state(州),fips(县编码code),cases(累计确诊病例),deaths(累计死亡病例)。
需求:找出美国2021-01-28,每个州state的确诊案例数最多的县county前3个。Top3问题。
上面的问题中,我们使用MapReduce编码实现,其思路是需要按照州名进行分组,然后每个州内按照确诊病例数进行倒序排序。因此需要定义一个JavaBean对象,实现WriteComparable接口,重写compareTo方法;同时需要自定义分组规则,将州名相同的数据分到同一组。
Mapper类代码如下:
public class CovidTopMapper extends
Mapper<LongWritable, Text, CovidBean, NullWritable> {
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] splitArr = value.toString().split(",");
CovidBean covidBean = new CovidBean();
covidBean.setDate(splitArr[0]);
covidBean.setCounty(splitArr[1]);
covidBean.setState(splitArr[2]);
covidBean.setCode(splitArr[3]);
covidBean.setCases(Integer.valueOf(splitArr[4]));
covidBean.setDeaths(splitArr.length > 5 ? Integer.valueOf(splitArr[5]) : 0);
context.write(covidBean, NullWritable.get());
}
}
Reducer类代码如下:
public class CovidTopReducer
extends Reducer<CovidBean, NullWritable, CovidBean, NullWritable> {
@Override
protected void reduce(CovidBean key,