第十六节大数据分析案例

吴琼老师

已于 2023-08-25 09:13:42 修改

阅读量836

点赞数

分类专栏： Java入门基础阶段文章标签： java 开发语言

于 2022-07-25 23:37:51 首次发布

本文链接：https://blog.csdn.net/u013280750/article/details/125899983

版权

Java入门基础阶段专栏收录该内容

30 篇文章 3 订阅

订阅专栏

数据流量分析案例

1. 前言

学习互联网，就会接触一个热点词汇 “大数据” ，那到底什么是大数据，这需要从不同的方面去深入，就会有不同的理解，那么针对于程序员的大数据，就比较纯粹了，使用代码去解决海量数据的问题。
- 例如：在实际生活中几乎人人一步手机，手机已经成为生活的必需品了，那么当手机的使用就会通过运营商，ps： 然后，通过收集信息，在经过业务分析，就产生了价值，因为数据就是价值。
- 针对手机流量进行分析其中些业务处理。
- 该数据相当于一次点击产生的数据。（海量的数据产生）。
根据同一批次数据，使用不同的维度算法，就会产生不同的业务需求。

2. 数据源字段说明

2.1 Http.log 日志记录

基站可以记录用户上网访问某些网站行为，经过数据清洗，然后将这些数据存储在日志。
- http.log中，且每一条数据，都有自己的分割的逻辑规则。
例如："18518381887 http://v.baidu.com/tv 40 9000"是一条上网行为数据为例。
- 第一个字段，手机号码。
- 第二个字段，请求网址的地址URl。
- 第三个字段，数据的上行流量40byte。
- 第四个字段，数据的下行流量9000byte。
如下图截图：
- 用户上网日志记录。

2.2 phone.txt 数据规则

不同的运营商，对手机号不同的编码规则，一个手机号还有它的归属地，省份，城市，等其他信息。
- phone.txt 是手机号段规则，是手机号码对应地区城市和运营商的数据。
例如：130 1300000 山东济南联通 250000 0531 370100；
- 前缀，手机号，省份，城市，运营商，邮编，代码，城市代码等。
- 如下图所示：

3. 最高上网流量

3.1 需求分析

根据给的用户上网日志记录数据，计算出总流量 最高的网站前3个 。
- 根据http.log,上网日志记录数据，分析。
简要思路分析如下：
- 首先有4个字段组成，手机号，网址，上行，下行。需要哪个？
- 什么是总流量？即上行流量+ 下行流量。
- 当我们仔细观察会发现好多网址有重复，如：百度知道，百度tv , 这样的我们都应该算百度的流量，所以需要对网址进行截取处理之后进行汇总。截取时需要注意，截取规则。
```
 /**
     * 正则截取网址
     * @param url
     * @return String
     */
    private static String getUrl(String url) {

        String s = "\\w*\\.?\\w+\\.{1}\\w+";
        Pattern p = Pattern.compile(s);
        Matcher matcher = p.matcher(url);
        while (matcher.find()){
            return matcher.group();
        }
        return null;
    }
```
- IO流进行读取，创建对象，并且将所需字段进行封装到集合中。
提示：处理网址方法

3.2 封装对象 Flow

代码如下。

封装对象Url 。
如果自定义对象实现了Comarable 接口，并重写方法。否则，就需要外部比较器。

/**
 * 上网流量对象
 * @author Wu
 */
public class Flow implements Comparable<Flow> {
    private  String url;//网址
    private  int count;//总流量

    public String getUrl() {
        return url;
    }

    public void setUrl(String url) {
        this.url = url;
    }

    public int getCount() {
        return count;
    }

    public void setCount(int count) {
        this.count = count;
    }

    public Flow(String url, int count) {
        this.url = url;
        this.count = count;
    }

	public Flow(String url, int count) {
    	this.url = url;
    	this.count = count;
	}

    @Override
    public String toString() {
        return "Flow{" +
                "url='" + url + '\'' +
                ", count=" + count +
                '}';
    }

    @Override
    public int compareTo(Flow o) {
        return o.getCount()-this.getCount();
    }
}

3.3 业务代码

业务代码处理。

注意，如何截取数据。
将个自文件的关键字段，封装数据到hashmap集合中。
正则获取Url网址。
通过List集合进行排序，需要将hashMap集合中的数据封装到List中。

public class Url_Flow {
    public static void main(String[] args) {
        //4.1 封装到hashMap集合中。
        HashMap<String,Integer> hashMap = new HashMap();

        //1.将需要处理的日志进行读取，通过Io流。
        try(BufferedReader br = new BufferedReader(new FileReader(new
                                        File("D://case//http.log")));){

            //开始读取
            String line;
            while((line=br.readLine())!=null){
                //2.开始数据截取操作 ,第一个制表符 \t， 后面是空格！
                // 15639120688	http://v.baidu.com/movie 3936 12058
                String[] arr = line.split(" ");//选择制表符截取，还是空格截取？

                //计算总流量，如何将String类型相加 arr[1] + arr[2];
                int count = Integer.parseInt(arr[1])+Integer.parseInt(arr[2]);

                //拿到网址地址
                String[] arr2 = arr[0].split("\t");//按照制表符截取

                //3.截取网址地址,封装获取方法
                String url = getUrl(arr2[1]);

                //4.封装到Map集合中，
                // 注意每一次循环会添加一条，所以需要判断，如果有重复需要累加。
                if (hashMap.containsKey(url)){
                    hashMap.put(url,hashMap.get(url)+count);//如果存储需要累加；
                }else {
                    hashMap.put(url,count);
                }
            }

        } catch (IOException e) {
            e.printStackTrace();
        }



        //5.将hashmap中的数据进行封装到对象中进行排序。
        ArrayList<Flow> list = getList(hashMap);

        //输出前TOP 3
        System.out.println(list.get(0));
        System.out.println(list.get(1));
        System.out.println(list.get(2));
        //第4个和第5个
        System.out.println(list.get(3));
        System.out.println(list.get(4));



    }

    /**
     * 主要将map中的key和value封装到集合
     * 通过List集合进行排序。
     * @param hashMap
     * @return
     */
    private static ArrayList<Flow> getList(HashMap<String, Integer> hashMap) {
        //2.1 创建List集合
        ArrayList<Flow> list = new ArrayList<>();

        //1.遍历List集合将数据
        Set<String> keySet = hashMap.keySet();
        Iterator<String> iterator = keySet.iterator();
        while(iterator.hasNext()){
            String key = iterator.next();
            int value = hashMap.get(key);

            //2.封装到List集合中。
            list.add(new Flow(key,value));
        }

        //3.在返回的时候做一个排序
        Collections.sort(list);//因为Flow实现了接口，不然就需要写比较器。

        return list;

    }

    /**
     * 正则截取网址
     * @param url
     * @return String
     */
    private static String getUrl(String url) {

        String s = "\\w*\\.?\\w+\\.{1}\\w+";
        Pattern p = Pattern.compile(s);
        Matcher matcher = p.matcher(url);
        while (matcher.find()){
            return matcher.group();
        }
        return null;
    }
}

输出结果：

Flow{url='www.jianshu.com', count=717979222}
Flow{url='v.baidu.com', count=678201065}
Flow{url='www.edu360.cn', count=482997506}
Flow{url='movie.youku.com', count=480758412}
Flow{url='blog.csdn.net', count=478625511}

根据给的手机号段归属地规则，计算出总流量最高的省份前5个。
根据给的手机号段运营商规则，计算出总流量最高的运营商前3个。
根据给的用户上网日志记录数据，计算出总流量最高的手机号前3个。

4. 最高省份对应流量

4.1 需求分析

根据给的手机号段归属地规则计算流量最高的省份，这就需要phone.txt 配置文件和http.log，文件结合使用。
- 提示： 通过两个表的关联字段，即：通过手机号进行关联。
简要的思路分析：
- 通过观察 phone.txt 手机 前缀7位 来匹配 http.log中的流量表中手机号，并关联其它数据。
- 将所需要的数据封装到hashMap集合中。为什么使用map集合？ 因为一个号段只会对应一个省份，不可能出现一个号段对应两个省份。因为 Map key值不能重复。
如下图所示：

在这里插入图片描述

4.2 封装对象 Province

代码如下

封装对象Province 。
通过外部比较器实现。

/**
 * 最高省份对象
 * @author Wu
 */
public class Province {
    private String province_Name; //省份
    private int  count; //总流量

    public String getProvince_Name() {
        return province_Name;
    }

    public void setProvince_Name(String province_Name) {
        this.province_Name = province_Name;
    }

    public int getCount() {
        return count;
    }

    public void setCount(int count) {
        this.count = count;
    }

    public Province(String province_Name, int count) {
        this.province_Name = province_Name;
        this.count = count;
    }

    @Override
    public String toString() {
        return "Province {" +
                "province_Name='" + province_Name + '\'' +
                ", count=" + count + '}';
    }
}

4.3 业务代码

业务代码处理

Io流分别读取数据。
分别将对应字段，在Map中封装数据。
重点： 通过将两个hashMap当做形式参数，封装到List集合中，并排序（外部比较器）。
然后找到匹配手机号，对应的省份流量，最后将省份和流量封装到map集合中。

public class Province_Flow {
    public static void main(String[] args) {
        //封装http.log  key 和 value
        HashMap<String,Integer> map_http = new HashMap<>();
        //封装http.log  key 和 value
        HashMap<String,String> map_phone = new HashMap<>();

        //读数据源
        try(BufferedReader br_P = new BufferedReader(new
                        FileReader(new File("D://case//phone.txt")));
             BufferedReader br_H = new BufferedReader(new FileReader(new
                        File("D://case//http.log")));){

            /*1.处理http.log数据 截取字段，封装到map集合中。
                15768611816	http://movie.youku.com 306 6184
             */
            String line;
            while((line=br_H.readLine())!=null){
                //处理Http.log数据 获取手机号对应的总流量
                String[] arr = line.split(" ");
                int count = Integer.parseInt(arr[1]) + Integer.parseInt(arr[2]);//上行流量+下行流量

                String[] arr2 = arr[0].split("\t");
                String phone = arr2[0];//手机号

                //封装hashmap中
                if (map_http.containsKey(phone)){
                    map_http.put(phone,map_http.get(phone)+count);
                }else {
                    map_http.put(phone,count);
                }
            }

            /*2. 处理phone.txt 将手机号，对应省份进行封装map中
                    prefix	phone	province	city	isp	post_code	city_code	area_code
                    130	1300000	山东	济南	联通	250000	0531	370100
             */
            String line2;
            while((line2=br_P.readLine())!=null){
                String[] arr = line2.split("\t");

                String phone = arr[1]; //手机号 7位
                String province = arr[2]; //省份

                //因为不会重复，所以直接封装即可
                map_phone.put(phone,province);
            }



            /*3.处理将两个map中的数据进行转换处理成list
             */
            ArrayList<Province> list = getList(map_http,map_phone);
            System.out.println(list.get(0));
            System.out.println(list.get(1));
            System.out.println(list.get(2));


        }catch (IOException e){
            e.printStackTrace();
        }



    }

    /**
     * 处理key和value 转换为List集合
     *      将两个map的key进行处理。
     * @param map_http
     * @param map_phone
     * @return
     */
    private static ArrayList<Province> getList(HashMap<String, Integer> map_http, HashMap<String, String> map_phone) {
        //1.创建List集合对象
        ArrayList<Province> list = new ArrayList<>();

        //2.封装到hashmap中。
        HashMap<String,Integer> hashMap = new HashMap<>();

        /* 处理map集合
         */
        for(Map.Entry<String,Integer> map : map_http.entrySet()){
            String key = map.getKey();
            Integer count = map.getValue();

            String phone_sub = key.substring(0, 7);

            //通过截取的手机号找到对应的省份。
            String value_Province = map_phone.get(phone_sub);
            
            if (hashMap.containsKey(value_Province)){
                hashMap.put(value_Province,hashMap.get(value_Province)+count);
            }else {
                hashMap.put(value_Province,count);
            }

        }


        for (Map.Entry<String,Integer> map:hashMap.entrySet()){
            list.add(new Province(map.getKey(),map.getValue()));
        }

        //进行排序处理
        Collections.sort(list, new Comparator<Province>() {
            @Override
            public int compare(Province o1, Province o2) {
                return o2.getCount()-o1.getCount();
            }
        });
        return list;
    }
}

输出结果：

Province{province_Name='广东', count=438816394}
Province{province_Name='江苏', count=279100752}
Province{province_Name='山东', count=274846877}

5. 运营商对应最高流量

5.1 需求分析

通过两个数据源的关流，分析运营商（isp），在一次点击时，对应总流量的排行。
- 根据给的手机号段 运营商规则，计算出总流量最高的运营商前3个。
- 分析 两者之间的关联，使用哪个数据进行关联！？
- 提示：运营商 isp 字段。

在这里插入图片描述

5.2 封装对象 Isp

封装对象。

通过自定义类，实现比较接口去实现排序。

/**
 * 封装运营商对应流量
 */
public class Isp implements Comparable<Isp>{
    private String isp_Name;//运营商名称
    private int  count; //上行流量+下行流量

    public String getIsp_Name() {
        return isp_Name;
    }

    public void setIsp_Name(String isp_Name) {
        this.isp_Name = isp_Name;
    }

    public int getCount() {
        return count;
    }

    public void setCount(int count) {
        this.count = count;
    }

    public Isp(String isp_Name, int count) {
        this.isp_Name = isp_Name;
        this.count = count;
    }

    /**
     * 查看对象的输出结果
     * @return
     */
    @Override
    public String toString() {
        return "Isp{" +
                "isp_Name='" + isp_Name + '\'' +
                ", count=" + count +
                '}';
    }

    @Override
    public int compareTo(Isp o) {
        return o.count-this.count;// 降序
    }
}

5.3 业务代码

简单的业务分析：

io流的读数据源数据
截取相应的数据，封装到对应的hashmap集合中，封装数据源相关数据，对用的key和value。
通过两个map集合，进行转换为List集合。
重点： 如何关联两个表的关键字段。通过其中的一个key取找到另一个集合中的value。
转换集合，将hashmap转换为list
排序。

public class Isp_Flow {
    public static void main(String[] args) {
        //1.3 创建map集合
        HashMap<String,Integer> map_http = new HashMap<>();
        //2.2 创建map集合
        HashMap<String,String> map_phone = new HashMap<>();

        //读数据源
        try(BufferedReader br_P = new BufferedReader(new
                FileReader(new File("D://case//phone.txt")));
            BufferedReader br_H = new BufferedReader(new FileReader(new
                    File("D://case//http.log")));) {

            /*1.处理Http.log 手机号对应总流量
                15768611816	http://movie.youku.com 306 6184
             */
            String line;
            while((line=br_H.readLine())!=null){
                String[] arr = line.split(" ");
                int count = Integer.parseInt(arr[1])+Integer.parseInt(arr[2]);//总流量

                String[] arr2 = arr[0].split("\t");
                String phone = arr2[0];//手机号

                //1.2 将对应的key 和 value封装到map中

                if (map_http.containsKey(phone)){
                    map_http.put(phone,map_http.get(phone)+count);//循环添加，如果有重复就累计流量。
                }else {
                    map_http.put(phone,count);
                }
            }

             /*2.处理phone.txt  手机前缀 对应 运行商
                prefix	phone	province	city	isp	post_code	city_code	area_code
                130	1300000	山东	济南	联通	250000	0531	370100
             */
            String line2;
            while((line2=br_P.readLine())!=null){

                String[] arr = line2.split("\t");
                String phone_prefix = arr[0];//手机号前缀

                String isp_name = arr[4]; //运营商名称

                //2.1 封装到map集合中，需要判断是否有重复么？
                map_phone.put(phone_prefix,isp_name);

            }

            /*3. 处理两个map结合
             */

            ArrayList<Isp> list =  getList(map_http,map_phone);
            System.out.println(list.get(0));
            System.out.println(list.get(1));
            System.out.println(list.get(2));
            System.out.println(list.get(3));

        }catch (IOException e){
            e.printStackTrace();
        }
    }

    /**
     * 处理两个集合对应的key 并封装到list集合进行排序。
     * @param map_http
     * @param map_phone
     * @return List集合
     */
    @Test
    private static ArrayList<Isp> getList(HashMap<String, Integer> map_http, HashMap<String, String> map_phone) {
        ArrayList<Isp> list = new ArrayList<>();
        HashMap<String,Integer> hashMap = new HashMap<>();
        //1.遍历map_http集合，并截取前3位手机号
        for (Map.Entry<String,Integer> map : map_http.entrySet()){
            String sub_phone = map.getKey().substring(0, 3);
            Integer count  = map.getValue();//对应的总流量

            //1.1 通过截取的字段去map_phone中找对应的isp运营商
            String isp_name = map_phone.get(sub_phone);

            //1.2 将对应的供应商和流量封装到集合
            if (hashMap.containsKey(isp_name)){
                hashMap.put(isp_name,hashMap.get(isp_name)+count);
            }else {
                hashMap.put(isp_name,count);
            }

        }

        //转换集合。需要将Map中的数据添加到List中
        for (Map.Entry<String,Integer> map : hashMap.entrySet()){
            list.add(new Isp(map.getKey(),map.getValue()));
        }

        //排序,前提需要实现Comparable<>接口，并重写方法。
        Collections.sort(list);

        return list;
    }
}

输出结果：

Isp{isp_Name='移动', count=1977129164}
Isp{isp_Name='联通', count=1039736761}
Isp{isp_Name='电信', count=805612841}
Isp{isp_Name='虚拟/联通', count=175292373}

6. 课堂练习

6.1 手机流量最高的3个手机

根据给的用户上网日志记录数据，计算出总流量最高的手机号前3个。
- 需要数据源怎么处理数据？两个表都需要么？
- 同业务一很接近，只是维度不同。

6.2 增加拓展维度

可以根据有限的数据拓展业务，根据给的手机号段 归属地规则，计算出总流量 最高的市 前2个。
思考关联性，如何去分析问题。

吴琼老师

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
打赏
0
评论
第十六节大数据分析案例

根据给的用户上网日志记录数据，计算出总流量最高的网站前5个根据给的手机号段归属地规则，计算出总流量最高的省份前5个。根据给的手机号段运营商规则，计算出总流量最高的运营商前3个。根据给的用户上网日志记录数据，计算出总流量最高的手机号前3个。...
复制链接

扫一扫