电商实战
项目需求
根据电商日志文件,分析:
-
统计页面浏览量(每行记录就是一次浏览)
-
统计各个省份的浏览量 (需要解析IP)
-
日志的ETL操作,本项目中需要解析出:ip、url、pageId(topicId对应的页面Id)、country、province、city
本篇文章完成问题2
实现思路
- Extract(抽取): 同上,将日志文件上传到HDFS(即第1问时上传的文件)。
- Transform(转换):
Mapper:
读取日志文件的每一行,解析出IP地址,并可能通过外部IP地址库(如GeoIP)来查找对应的省份信息。将IP、省份和1(代表一次浏览)作为键值对输出。Reducer:
对相同省份的浏览量进行累加。
- Load(加载): 将结果存储到HDFS、HBase或Hive中,以便后续查询和分析。
实现代码
导入Hadoop依赖和baomidou依赖
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.2.0</version>
</dependency>
<dependency>
<groupId>com.baomidou</groupId>
<artifactId>mybatis-plus-boot-starter</artifactId>
<version>3.4.1</version> <!-- 替换为当前最新版本 -->
</dependency>
</dependencies>
ProvinceViewMapper.java
import com.baomidou.mybatisplus.core.toolkit.StringUtils;
import com.bigdata.hadoop.project.utils.IPParser;
import com.bigdata.hadoop.project.utils.LogParser;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
import java.util.Map;
public class ProvinceViewMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private static final IntWritable one = new IntWritable(1);
private Text city = new Text();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
// Split the input line into fields based on the delimiter
String[] fields = value.toString().split("\u0001");
if (fields.length > 13) {
// Assuming the IP address is in the 14th field (index 13)
String ip = fields[13];
String log = value.toString();
LogParser parser = new LogParser();
Map<String, String> logInfo = parser.parse(log);
if (StringUtils.isNotBlank(logInfo.get("ip"))) {
IPParser.RegionInfo regionInfo = IPParser.getInstance().analyseIp(logInfo.get("ip"));
String province = regionInfo.getProvince();
if (StringUtils.isNotBlank(province)) {
context.write(new Text(province), new IntWritable(1));
} else {
context.write(new Text("-"), new IntWritable(1));
}
} else {
context.write(new Text("+"), new IntWritable(1));
}
}
}
}
ProvinceViewReducer.java
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class ProvinceViewReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
ProvinceDriver.java
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class ProvinceDriver {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: LogAnalysisDriver <input path> <output path>");
System.exit(-1);
}
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Province View Count");
job.setJarByClass(ProvinceDriver.class);
job.setMapperClass(ProvinceViewMapper.class);
job.setCombinerClass(ProvinceViewReducer.class);
job.setReducerClass(ProvinceViewReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
运行结果
将java项目打包后,在Hadoop上运行jar包