电商实战——Hadoop实现（3）

m0_70276855

已于 2024-06-12 15:31:05 修改

阅读量447

点赞数 6

文章标签： hadoop 大数据分布式

于 2024-06-12 15:21:00 首次发布

本文链接：https://blog.csdn.net/m0_70276855/article/details/139627288

版权

电商实战

项目需求
实现思路
实现代码
运行结果
- 电商实战——Hadoop实现（1）
- 电商实战——Hadoop实现（2）

项目需求

根据电商日志文件，分析：

统计页面浏览量（每行记录就是一次浏览）
统计各个省份的浏览量（需要解析IP）
日志的ETL操作，本项目中需要解析出：ip、url、pageId（topicId对应的页面Id）、country、province、city

本篇文章完成问题3

实现思路

Extract（抽取）: 同上，将日志文件上传到HDFS（即第1问时上传的文件）
Transform（转换）:

Mapper:
读取日志文件的每一行，解析出所需的字段，包括ip、url。对于pageId，可能需要从url中提取特定的部分（如/products/123中的123），或者使用正则表达式匹配。对于country、province、city，同样可以通过IP地址库进行查找。Mapper将解析出的字段作为键值对输出。
Reducer：
如果Mapper的输出已经满足需求（即每个key对应唯一的记录），则可能不需要Reducer。但如果有需要合并或进一步处理的记录，可以使用Reducer。

Load（加载）: 将转换后的数据加载到目标存储中，如HDFS、HBase、Hive或关系型数据库等。如果目标存储是HBase或Hive，还需要定义相应的表结构或模式。

实现代码

LogETLMapper.java

import com.bigdata.hadoop.project.utils.GetPageId;
import com.bigdata.hadoop.project.utils.LogParser;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.IOException;
import java.util.Map;

public class LogETLMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

    private static final IntWritable one = new IntWritable(1);
    private Text outputKey = new Text();
    private LogParser logParser = new LogParser();
    private Logger logger = LoggerFactory.getLogger(LogETLMapper.class);

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // 解析日志记录
        Map<String, String> logInfo = logParser.parse(value.toString());

        if (logInfo == null) {
            logger.error("日志记录的格式不正确或解析失败：" + value.toString());
            return;
        }

        // 获取需要的字段
        String ip = logInfo.get("ip");
        String url = logInfo.get("url");
        String country = logInfo.get("country");
        String province = logInfo.get("province");
        String city = logInfo.get("city");

        // 调用 GetPageId 获取 topicId
        String topicId = GetPageId.getPageId(url);
        logInfo.put("pageId", topicId);

        // 检查所有字段是否全部为空
        if (ip != null || url != null || topicId != null || country != null || province != null || city != null) {
            StringBuilder sb = new StringBuilder();

            if (ip != null && !ip.isEmpty()) sb.append("IP: ").append(ip).append(", ");
            if (url != null && !url.isEmpty()) sb.append("URL: ").append(url).append(", ");
            if (topicId != null && !topicId.isEmpty()) sb.append("PageId: ").append(topicId).append(", ");
            if (country != null && !country.isEmpty()) sb.append("Country: ").append(country).append(", ");
            if (province != null && !province.isEmpty()) sb.append("Province: ").append(province).append(", ");
            if (city != null && !city.isEmpty()) sb.append("City: ").append(city);

            // 移除末尾的逗号和空格
            String outputString = sb.toString().replaceAll(", $", "");
            outputKey.set(outputString);
            context.write(outputKey, one);
        } else {
            logger.error("所有字段为空，日志记录：" + value.toString());
        }
    }
}

LogETLReducer.java

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class LogETLReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

    private IntWritable result = new IntWritable();

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        result.set(sum);
        context.write(key, result);
    }
}

LogETLDriver.java

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class LogETLDriver {

    public static void main(String[] args) throws Exception {
        if (args.length != 2) {
            System.err.println("Usage: LogETLDriver <input path> <output path>");
            System.exit(-1);
        }

        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "Log ETL");

        job.setJarByClass(LogETLDriver.class);
        job.setMapperClass(LogETLMapper.class);
        job.setCombinerClass(LogETLReducer.class);
        job.setReducerClass(LogETLReducer.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

运行结果

将java项目打包后，在Hadoop上运行jar包
在这里插入图片描述

电商实战——Hadoop实现（1）

电商实战——Hadoop实现（2）

m0_70276855

关注

6
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
电商实战——Hadoop实现（3）

日志的ETL操作，本项目中需要解析出：ip、url、pageId（topicId对应的页面Id）、country、province、city。将java项目打包后，在Hadoop上运行jar包。统计页面浏览量（每行记录就是一次浏览）统计各个省份的浏览量（需要解析IP）
复制链接

扫一扫