access重复数据累计_极简教程: 网站日志数据分析实战(四)

最新推荐文章于 2022-11-09 13:46:46 发布

一直住顶楼

最新推荐文章于 2022-11-09 13:46:46 发布

阅读量269

点赞数

文章标签： access重复数据累计

本文链接：https://blog.csdn.net/weixin_34098296/article/details/112357667

版权

目的

阅读对象: 7天SQL基础+7天Python同学
预期收获: 接触真实的日志数据

关于本文章涉及到的日志数据:

数据来源: 访问superset学习网站生产的nginx真实的日志数据, IP属于隐私数据，所以映射为地域信息
怎么使用: 大家直接进入学习网站，点击SQL Lab，选择data_sample, 再选access_log即可看到

为方便练习SQL，提供一个公共账号, 用户名:hi-zhihu 密码:justdoit

为了更直白，我录一段操作小视频演示下:

教你如何进入数据练习环境https://www.zhihu.com/video/1142590482713632768

下面，简单对App的日志数据做简单介绍。

日志来源

数据分析面对的日志数据，主要来自两个端，如下图:

日志来源

一个是服务端，主要是产品的每个功能背后对应的数据接口请求数据，这部分数据大部分生产，对用户来说是透明的，看不到，比如: App上报一些开关状态信息，用户无感知的情况下进行的;

另外一个是大前端，用户能直接看到的，包括客户端页面和网页端页面，我们常说的用户点击行为日志。

这样说，小白听了好像还是给没听一样, 这样的效果，我干嘛花时间码字？！所以，我现在把真实的日志捞出来，大家一起观摩一下，虽不能代表所有公司的日志场景，但是作为初次接触，了解学习，你会的收获会很足！这，才是我花时间写的目的！

日志数据样本

这里，简单介绍下网站日志的一种，nginx日志。字段介绍:

客户端（用户）IP地址
访问时间。如： [07/Aug/2019:16:00:00 +0800]
请求方式（GET或者POST等）。如：上例中的 GET
请求状态（状态码，200表示成功，404表示页面不存在，301表示永久重定向等）
请求页面大小，默认为B（byte）
来源页面，即从哪个页面转到本页，专业名称叫做“referer”
用户浏览器版本、浏览器类型等

样本数据:(ip脱敏)

日志样例

日志脱敏

根据ip可以映射地域，这里我使用的是python第三方包ip2region进行ip到地域的映射。

ip映射地域的python代码

def ip_to_geo(ip, db_path):
    searcher = Ip2Region(db_path)
    res = searcher.memorySearch(ip)
    return str(res['city_id']) + '|' + res['region'].decode('utf-8')

插入数据库代码片段

# 这里略过pytho连接mysql数据的过程，直接进入逻辑处理部分
def insert_log(cursor, connect, access_log, logger):
    column_list = ['ip', 'ext1', 'ext2', 'createTime', 'microSecond', 'api', 'statusCode', 'bodyBytes', 'httpReferer',
                   'userAgent']
    df = pd.read_csv(access_log, sep=' ', header=None, names=column_list).head()
    logger.info("%s has %s lines" % (access_log, df.count(axis=1).size))

    # 不同字段进行处理, 满足mysql表插入字段定义
    df['ipLast'] = df.ip.apply(lambda ip: ip_last_n(ip, n=3))
    df['geo'] = df.ip.apply(lambda ip: ip_to_geo(ip, db_path=db_path))
    df['newCreateTime'] = df.createTime.apply(time_format)
    df['newApi'] = df.api.apply(lambda x: cut_str(x, 100))
    df['newHttpReferer'] = df.httpReferer.apply(lambda x: cut_str(x, 200))
    df['newUserAgent'] = df.userAgent.apply(lambda x: cut_str(x, 500))
    list_with_ip = ['ip', 'ipLast', 'geo', 'newCreateTime', 'newApi', 'statusCode', 'bodyBytes', 'newHttpReferer', 'newUserAgent']

    list_insert_with_ip = df[list_with_ip].values.tolist()
    logger.info("%s first line is : %s" % (access_log, list_insert_with_ip[0]))

    insert_num = 0
    for line in list_insert_with_ip:
        insert_num += 1
        line_with_ip = [line[0]] + line[2:]
        line_without_ip = line[1:]
        try:
            cur.execute(sql_without_ip, tuple(line_without_ip))
        except Exception as e:
            logger.error("%s. n %s insert failed when to insert % line!! insert line: %s" % (e, access_log, insert_num, line))
            break
    conn.commit()

以上日志数据，经过简单处理后，插入到mysql数据库中，表信息如下:

CREATE TABLE `access_log_without_ip` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `eid` varchar(20) NOT NULL DEFAULT '' COMMENT '重复可能性大',
  `geo` varchar(60) NOT NULL DEFAULT '' COMMENT '地域信息:城市Id|国家|区域|省份|城市|ISP',
  `createTime` timestamp NOT NULL DEFAULT '2000-01-01 00:00:00' COMMENT '访问时间',
  `api` varchar(100) NOT NULL DEFAULT '' COMMENT '访问api',
  `statusCode` smallint(6) NOT NULL DEFAULT '0' COMMENT 'http请求返回状态码',
  `bodyBytes` int(11) NOT NULL DEFAULT '0' COMMENT '请求文件大小',
  `httpReferer` varchar(200) NOT NULL DEFAULT '' COMMENT '访问url',
  `userAgent` varchar(500) NOT NULL DEFAULT '' COMMENT 'ua',
  PRIMARY KEY (`id`),
  UNIQUE KEY `userAccess` (`eid`,`geo`,`createTime`,`api`)
) ENGINE=InnoDB AUTO_INCREMENT=324700 DEFAULT CHARSET=utf8mb4

取前几行样例如下:

mysql> select * from access_log_without_ip where date(createTime)='2019-08-07' limit 100,1 G;
*************************** 1. row ***************************
         id: 321435
        eid: 18
        geo: 995|中国|0|上海|上海市|联通
 createTime: 2019-08-07 08:43:11
        api: GET /superset/fave_dashboards/3/ HTTP/1.1
 statusCode: 200
  bodyBytes: 2
httpReferer: http://superset.workingindata.com/superset/welcome
  userAgent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36
1 row in set (0.09 sec)

日志分析

进入练习环境见文章开始的视频，下面简单的几个日志统计如下:

每日访问pv分布

mysql> select date(createTime) as day, count(*) as pv from access_log_without_ip group by date(createTime) order by day ;
+------------+-------+
| day        | pv    |
+------------+-------+
| 2019-07-18 |   157 |
| 2019-07-19 |    59 |
| 2019-07-20 |    58 |
| 2019-07-21 |  1033 |
| 2019-07-22 |  2020 |
| 2019-07-23 |  1513 |
| 2019-07-24 |  1520 |
| 2019-07-25 |  1929 |
| 2019-07-26 |  6152 |
| 2019-07-27 |    54 |
| 2019-07-28 |  4448 |
| 2019-07-29 |  5576 |
| 2019-07-30 | 47677 |
| 2019-07-31 |  6212 |
| 2019-08-01 |  5285 |
| 2019-08-02 |  7351 |
| 2019-08-03 |  2468 |
| 2019-08-04 |  1784 |
| 2019-08-05 |  2211 |
| 2019-08-06 |  2629 |
| 2019-08-07 |  1636 |
+------------+-------+
21 rows in set (0.10 sec)

24h用户访问分布

数据上看，每天的中午是访问的高峰期

mysql> select hour(createTime) as h, count(*) as pv from access_log_without_ip group by hour(createTime) order by h;
+------+-------+
| h    | pv    |
+------+-------+
|    0 |  1286 |
|    1 |   563 |
|    2 |   303 |
|    3 |   314 |
|    4 |   640 |
|    5 |   421 |
|    6 |   446 |
|    7 |   912 |
|    8 |  1790 |
|    9 |  4298 |
|   10 |  9555 |
|   11 | 11539 |
|   12 |  7925 |
|   13 | 10191 |
|   14 | 12397 |
|   15 |  9847 |
|   16 |  5508 |
|   17 |  4177 |
|   18 |  5187 |
|   19 |  3901 |
|   20 |  2237 |
|   21 |  3329 |
|   22 |  3312 |
|   23 |  1694 |
+------+-------+
24 rows in set (0.08 sec)

历史访问总量的省份top 10

看看，top3城市缩写熟不熟悉，"北上广"，所以，码农多不是没原因的。

mysql> SELECT 
    -> substring_index(SUBSTRING_INDEX(geo, '|', 4), '|', -1)
    -> as city,
    -> count(*) as pv 
    -> from access_log.access_log_without_ip 
    -> where date(createTime)> date_sub(date(createTime), interval 15 day)
    -> group by substring_index(SUBSTRING_INDEX(geo, '|', 4), '|', -1)
    -> order by pv desc limit 10;
+-----------+-------+
| city      | pv    |
+-----------+-------+
| 北京      | 23750 |
| 广东省    | 18112 |
| 上海      | 10934 |
| 重庆      | 10767 |
| 福建省    | 10698 |
| 辽宁省    |  4849 |
| 浙江省    |  4086 |
| 山东省    |  3283 |
| 江苏省    |  2605 |
| 湖北省    |  1714 |
+-----------+-------+
10 rows in set (0.22 sec)

统计访问最多的url Top3

用户目的很明确，就是进入SQL实验室练习SQL，这也是我希望看到的。大家加油！

mysql> select httpReferer, count(*) pv from access_log.access_log_without_ip group by httpReferer order by pv desc limit 3;
+----------------------------------------------------+-------+
| httpReferer                                        | pv    |
+----------------------------------------------------+-------+
| http://superset.workingindata.com/superset/sqllab  | 62033 |
| http://superset.workingindata.com/login/           | 13940 |
| http://superset.workingindata.com/superset/welcome |  4794 |
+----------------------------------------------------+-------+
3 rows in set (0.59 sec)

精选文章

数据民工来取经儿：超级菜鸟如何入坑数据分析

一直住顶楼

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
access重复数据累计_极简教程: 网站日志数据分析实战(四)

目的阅读对象: 7天SQL基础+7天Python同学预期收获: 接触真实的日志数据关于本文章涉及到的日志数据:数据来源: 访问superset学习网站生产的nginx真实的日志数据, IP属于隐私数据，所以映射为地域信息怎么使用: 大家直接进入学习网站，点击SQL Lab，选择data_sample, 再选access_log即可看到为方便练习SQL，提供一个公共账号, 用户名:hi-zhihu...
复制链接

扫一扫