电商日志数据分析

Makka Pakka Y

已于 2024-06-12 00:00:46 修改

阅读量297

点赞数 5

文章标签：数据分析原型模式数据挖掘

于 2024-06-11 23:59:37 首次发布

本文链接：https://blog.csdn.net/qq_61561587/article/details/139611112

版权

提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档

文章目录

前言
一、统计页面浏览量（每行记录就是一次浏览）
二、统计各个省份的浏览量（需要解析IP）
三、日志的ETL操作（ETL：数据从来源端经过抽取（Extract）、转换（Transform）、加载（Load）至目的端的过程）
四、总结

前言

`
根据电商日志文件，分析：

1 . 统计页面浏览量（每行记录就是一次浏览）
2 . 统计各个省份的浏览量（需要解析IP）
3 . 日志的ETL操作（ETL：数据从来源端经过抽取（Extract）、转换（Transform）、加载（Load）至目的端的过程）
为什么要ETL：没有必要解析出所有数据，只需要解析出有价值的字段即可。本项目中需要解析出：ip、url、pageId（topicId对应的页面Id）、country、province、city

一、统计页面浏览量（每行记录就是一次浏览）

读入数据
编写代码

map代码

 static class Mymapper extends Mapper<LongWritable, Text, Text, LongWritable> {

        private  Text KEY=new Text("key");
        private  LongWritable ONE=new LongWritable(1);
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

            context.write(KEY,ONE);
        }
    }

reduce代码

 static class Mymapper extends Mapper<LongWritable, Text, Text, LongWritable> {

        private  Text KEY=new Text("key");
        private  LongWritable ONE=new LongWritable(1);
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

            context.write(KEY,ONE);
        }
    }

3.将编写好的程序进行打包在Hadoop中运行jar包

二、统计各个省份的浏览量（需要解析IP）

导入数据
编写代码
map代码

protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // Split the input line into fields based on the delimiter
        String[] fields = value.toString().split("\u0001");

        if (fields.length > 13) {
            // Assuming the IP address is in the 14th field (index 13)
            String ip = fields[13];
            String log = value.toString();
            LogParser parser = new LogParser();

            Map<String, String> logInfo = parser.parse(log);

            if (StringUtils.isNotBlank(logInfo.get("ip"))) {
                IPParser.RegionInfo regionInfo = IPParser.getInstance().analyseIp(logInfo.get("ip"));
                String province = regionInfo.getProvince();
                if (StringUtils.isNotBlank(province)) {
                    context.write(new Text(province), new IntWritable(1));
                } else {
                    context.write(new Text("-"), new IntWritable(1));
                }
            } else {
                context.write(new Text("+"), new IntWritable(1));
            }

        }
    }

reduce 代码

protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        result.set(sum);
        context.write(key, result);
    }

总体代码

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * 浏览量的统计
 */
public class PVStatApp {
    public static void main(String[] args) throws Exception{
// driver类 八股文
        Configuration configuration =new Configuration();

        FileSystem fileSystem=FileSystem.get(configuration);
        Path outputPath=new Path(args[1]);
        if(fileSystem.exists(outputPath)){
            fileSystem.delete(outputPath,true);
        }


        Job job =Job.getInstance(configuration);
        job.setJarByClass(PVStatApp.class);

        job.setMapperClass(Mymapper.class);
        job.setReducerClass(MyReducer.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);

        job.setOutputKeyClass(NullWritable.class);
        job.setOutputValueClass(LongWritable.class);

        FileInputFormat.setInputPaths(job,new Path(args[0]));
        FileOutputFormat.setOutputPath(job,new Path(args[1]));

        job.waitForCompletion(true);


    }
//Map
    static class Mymapper extends Mapper<LongWritable, Text, Text, LongWritable> {

        private  Text KEY=new Text("key");
        private  LongWritable ONE=new LongWritable(1);
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

            context.write(KEY,ONE);
        }
    }
    //Reduce
    static class MyReducer extends Reducer<Text,LongWritable, NullWritable,LongWritable>{
        @Override
        protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
            long count =0;
            for(LongWritable value :values){
                count++;
            }
            context.write(NullWritable.get(),new LongWritable(count));
        }
    }
}

将代码打包成jar包并上传结果至hdfs，查看最终结果

三、日志的ETL操作（ETL：数据从来源端经过抽取（Extract）、转换（Transform）、加载（Load）至目的端的过程）

1.分析题目

 1.日志的ETL操作
ETL操作是指从原始数据中提取有用信息、转换数据格式并加载到目标系统的过程。在本项目中，需要提取日志中的IP、URL、pageId（topicId对应的页面Id）、country、province、city字段。

 2. ETL步骤：
Extract（提取）：从原始日志文件中提取出上述字段。
Transform（转换）：将提取的数据转换成适合分析的格式，例如，将IP地址转换为地理位置信息（国家、省份、城市）。
Load（加载）：将转换后的数据加载到目标数据库或数据仓库中，以便进一步分析。

 3.ETL实现：
Map阶段：设计一个Mapper来执行ETL操作，输出键值对，其中键可以是日志的唯一标识（如日志行号或时间戳），值是一个包含所有提取字段的对象或结构。
Reduce阶段：在某些情况下，Reduce阶段可以用来进一步清洗或聚合ETL过程中的数据。

实现步骤

public class ETLApp {

    public static void main(String[] args) throws Exception{

        Configuration configuration = new Configuration();

        FileSystem fileSystem = FileSystem.get(configuration);
        Path outputPath = new Path(args[1]);
        if (fileSystem.exists(outputPath)) {
            fileSystem.delete(outputPath, true);
        }


        Job job = Job.getInstance(configuration);
        job.setJarByClass(ETLApp.class);

        job.setMapperClass(MyMapper.class);

        job.setMapOutputKeyClass(NullWritable.class);
        job.setMapOutputValueClass(Text.class);


        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.waitForCompletion(true);

    }

    static class MyMapper extends Mapper<LongWritable, Text, NullWritable, Text> {

        private LongWritable ONE = new LongWritable(1);

        private LogParser logParser;

        @Override
        protected void setup(Context context) throws IOException, InterruptedException {
            logParser = new LogParser();
        }

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

            String log = value.toString();

            Map<String, String> info = logParser.parseV2(log);

            String ip=info.get("ip");
            String country=info.get("country");
            String province=info.get("province");
            String city=info.get("city");
            String url=info.get("url");
            String time=info.get("time");
            String pageId=ContentUtils.getPageId(url);


            StringBuilder builder=new StringBuilder();
            builder.append(ip).append("\t");
            builder.append(country).append("\t");
            builder.append(province).append("\t");
            builder.append(city).append("\t");
            builder.append(url).append("\t");
            builder.append(time).append("\t");
            builder.append(pageId);

            context.write(NullWritable.get(),new Text(builder.toString()));


        }

    }




}

将代码打包成jar包
将jar包传入hadoop目录下并运行
在hdfs中查看结果

四、总结

通过编写了mapreduce代码来统计页面浏览量和统计统计各个省份的浏览量。并且通过ETL操作对电商日志文件进行了有效处理，实现了数据的高效管理和分析。需要注意在数据提取（Extract）过程中的数据识别中，在提取过程中，需要识别出对业务分析有价值的数据字段。ETL过程中的数据清洗涉及过滤掉不符合要求的数据，如不完整、错误或重复的数据。数据转换则包括不一致数据的转换、数据粒度的转换、商务规则的计算和聚集等。这些操作有助于确保数据仓库中数据的准确性和一致性。