电商日志数据分析

菜菜0319

已于 2024-06-21 14:05:51 修改

阅读量518

点赞数 14

文章标签：原型模式

于 2024-06-21 02:03:14 首次发布

本文链接：https://blog.csdn.net/A_aicai_/article/details/139846188

版权

一、项目要求：

根据电商日志文件，分析：

统计页面浏览量（每行记录就是一次浏览）
统计各个省份的浏览量（需要解析IP）
日志的ETL操作（ETL：数据从来源端经过抽取（Extract）、转换（Transform）、加载（Load）至目的端的过程）
为什么要ETL：没有必要解析出所有数据，只需要解析出有价值的字段即可。本项目中需要解析出：ip、url、pageId（topicId对应的页面Id）、country、province、city

前言

Day4 主要完成第三问：
日志的ETL操作（ETL：数据从来源端经过抽取（Extract）、转换（Transform）、加载（Load）至目的端的过程）

二、使用步骤

1.分析题目

日志的ETL操作
ETL操作是指从原始数据中提取有用信息、转换数据格式并加载到目标系统的过程。在本项目中，需要提取日志中的IP、URL、pageId（topicId对应的页面Id）、country、province、city字段。

ETL步骤：

Extract（提取）：从原始日志文件中提取出上述字段。
Transform（转换）：将提取的数据转换成适合分析的格式，例如，将IP地址转换为地理位置信息（国家、省份、城市）。
Load（加载）：将转换后的数据加载到目标数据库或数据仓库中，以便进一步分析。 ETL实现：
Map阶段：设计一个Mapper来执行ETL操作，输出键值对，其中键可以是日志的唯一标识（如日志行号或时间戳），值是一个包含所有提取字段的对象或结构。
Reduce阶段：在某些情况下，Reduce阶段可以用来进一步清洗或聚合ETL过程中的数据。

2.编写代码

private static final IntWritable one：用于计数的静态变量。
private Text outputKey：用于存储输出键的Text变量。
private LogParser logParser：日志解析器实例，用于解析日志数据。
private Logger logger：用于记录日志的SLF4J Logger实例。

Map阶段：

protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // 解析日志记录
        Map<String, String> logInfo = logParser.parse(value.toString());

        if (logInfo == null) {
            logger.error("日志记录的格式不正确或解析失败：" + value.toString());
            return;
        }

        // 获取需要的字段
        String ip = logInfo.get("ip");
        String url = logInfo.get("url");
        String country = logInfo.get("country");
        String province = logInfo.get("province");
        String city = logInfo.get("city");

        // 调用 GetPageId 获取 topicId
        String topicId = GetPageId.getPageId(url);
        logInfo.put("pageId", topicId);

        // 检查所有字段是否全部为空
        if (ip != null || url != null || topicId != null || country != null || province != null || city != null) {
            StringBuilder sb = new StringBuilder();

            if (ip != null && !ip.isEmpty()) sb.append("IP: ").append(ip).append(", ");
            if (url != null && !url.isEmpty()) sb.append("URL: ").append(url).append(", ");
            if (topicId != null && !topicId.isEmpty()) sb.append("PageId: ").append(topicId).append(", ");
            if (country != null && !country.isEmpty()) sb.append("Country: ").append(country).append(", ");
            if (province != null && !province.isEmpty()) sb.append("Province: ").append(province).append(", ");
            if (city != null && !city.isEmpty()) sb.append("City: ").append(city);

            // 移除末尾的逗号和空格
            String outputString = sb.toString().replaceAll(", $", "");
            outputKey.set(outputString);
            context.write(outputKey, one);
        } else {
            logger.error("所有字段为空，日志记录：" + value.toString());
        }
    }
}

Map<String, String> logInfo = logParser.parse(value.toString());：解析日志字符串，获取日志信息。

如果解析失败，使用logger记录错误，并返回。

从解析后的日志信息中提取ip、url、country、province和city字段。

String topicId = GetPageId.getPageId(url);：调用GetPageId.getPageId方法从URL中提取页面ID。

如果任何字段非空，构建输出字符串并写入Context。

如果所有字段都为空，记录错误日志。

使用StringBuilder构建包含所有非空字段的输出字符串。

context.write(outputKey, one);：将构建的输出键和计数写入Context。

Reduce阶段：

public class LogETLReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

    private IntWritable result = new IntWritable();

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        result.set(sum);
        context.write(key, result);
    }

@Override protected void reduce(Text key, Iterable values, Context context)：重写 reduce 方法，这是 Reducer 阶段的核心方法。
累加逻辑：

int sum = 0;：初始化一个整型变量 sum，用于累计所有输入值。

for (IntWritable val : values) { sum += val.get(); }：使用增强型 for 循环遍历与当前键 key 相关联的所有 IntWritable 对象，并累加它们的值。
设置结果：

result.set(sum);：将累计的总和赋值给 result 对象。
输出结果：

context.write(key, result);：使用 Context 对象的 write 方法将最终的键值对写入输出。键是输入的键（key），值是累计后的结果（result）。

这个类的目的是接收 Mapper 阶段输出的具有相同键的所有值，计算这些值的总和，并将结果输出。例如，在统计页面浏览量的场景中，如果 Mapper 输出了每个页面的浏览次数，Reducer 将接收相同页面标识的所有浏览次数，累加它们，并输出每个页面的总浏览量。

在ETL操作中，Reducer 可以用于聚合数据、清洗数据或转换数据格式，以便为最终的加载（Load）阶段准备数据。在这个例子中，Reducer 执行了聚合操作，将分散的数据点合并成有用的统计信息。