pyspark点击流日志分析

准备日志
# cat access.log
194.237.142.21 - - [18/Sep/2019:06:49:18 +0000] "GET /wp-content/uploads/2019/07/rstudio-git3.png HTTP/1.1" 304 0 "-" "Mozilla/4.0 (compatible;)"
183.49.46.228 - - [18/Sep/2019:06:49:23 +0000] "-" 400 0 "-" "-"
163.177.71.12 - - [18/Sep/2019:06:49:33 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"
163.177.71.12 - - [18/Sep/2019:06:49:33 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"
163.177.71.12 - - [18/Sep/2019:06:49:36 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"
163.177.71.12 - - [18/Sep/2019:06:49:36 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"
101.226.68.137 - - [18/Sep/2019:06:49:42 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"
101.226.68.137 - - [18/Sep/2019:06:49:45 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"
60.208.6.156 - - [18/Sep/2019:06:49:48 +0000] "GET /wp-content/uploads/2019/07/rcassandra.png HTTP/1.0" 200 185524 "http://cos.name/category/software/packages/" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"
60.208.6.156 - - [18/Sep/2019:06:49:48 +0000] "GET /wp-content/uploads/2019/07/rcassandra.png HTTP/1.0" 200 185524 "http://cos.name/category/software/packages/" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"
60.208.6.156 - - [18/Sep/2019:06:49:48 +0000] "GET /wp-content/uploads/2019/07/rcassandra.png HTTP/1.0" 200 185524 "http://cos.name/category/software/packages/" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"
222.68.172.190 - - [18/Sep/2019:06:49:57 +0000] "GET /images/my.jpg HTTP/1.1" 200 19939 "http://www.angularjs.cn/A00n" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"
222.68.172.190 - - [18/Sep/2019:06:50:08 +0000] "-" 400 0 "-" "-"
222.68.172.190 - - [18/Sep/2019:06:50:08 +0000] "-" 400 0 "-" "-"
222.68.172.190 - - [18/Sep/2019:06:50:08 +0000] "-" 400 0 "-" "-"
222.68.172.190 - - [18/Sep/2019:06:50:08 +0000] "-" 400 0 "-" "-"
点击流日志分析
# -*- coding: utf-8 -*-
# @Time    : 2019/12/15 8:42
# @Author  :

import os
from operator import add

from pyspark import SparkConf
from pyspark.sql import SparkSession

os.environ['PYSPARK_PYTHON'] = "/usr/bin/python3"

# 使用单机模式 2个分区
#master = "local[2]"
# 使用集群模式
master = "spark://192.168.18.126:7077"
appName = "pv_uv_TopN"

sc_conf = SparkConf()
sc_conf.setMaster(master)
spark = SparkSession.builder.appName(appName).getOrCreate()

sc = spark.sparkContext

# 读取数据
rdd1 = sc.textFile("file:///root/access.log")

# ****************1.统计当日总访问次数
# 将每一行数据标记为 pv
rdd_total = rdd1.map(lambda x: ("pv", 1))

# 相同的key,value相加
rdd_total_add = rdd_total.reduceByKey(add)
print(rdd_total_add.collect())

# ****************2.统计当日共有多少个不同用户访问
rdd_ips = rdd1.map(lambda x: x.split(" ")).map(lambda x: x[0])
# rdd_ips_count = rdd_ips.distinct().count()
rdd_ips_count = rdd_ips.distinct().map(lambda x: ("uv", 1)).reduceByKey(lambda a, b: a+b).collect()
print(rdd_ips_count)

# ****************3.统计当日访问频率最高的3个客户
rdd_ips_tuple = rdd_ips.map(lambda x: (x, 1))
rdd_ips_tuple_add = rdd_ips_tuple.reduceByKey(add)
rdd_ips_top3 = rdd_ips_tuple_add.sortBy(lambda x: x[1], ascending=False).filter(lambda x: x[1] >= 2).take(3)
print(rdd_ips_top3)

>[('pv', 16)]                                                                    
>[('uv', 6)]
>[('222.68.172.190', 5), ('163.177.71.12', 4), ('60.208.6.156', 3)]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Cocktail_py

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值