数据源:apache.log
86.149.9.216 10001 17/05/2015:10:05:30 GET /presentations/logstash-monitorama-2013/images/github-contributions.png
83.149.9.216 10002 17/05/2015:10:06:53 GET /presentations/logstash-monitorama-2013/css/print/paper.css
83.149.9.216 10002 17/05/2015:10:06:53 GET /presentations/logstash-monitorama-2013/css/print/paper.css
83.149.9.216 10002 17/05/2015:10:06:53 GET /presentations/logstash-monitorama-2013/css/print/paper.css
83.149.9.216 10002 17/05/2015:10:06:53 GET /presentations/logstash-monitorama-2013/css/print/paper.css
83.149.9.216 10002 17/05/2015:10:06:53 GET /presentations/logstash-monitorama-2013/css/print/paper.css
83.149.9.216 10002 17/05/2015:10:06:53 GET /presentations/logstash-monitorama-2013/css/print/paper.css
10.0.0.1 10003 17/05/2015:10:06:53 POST /presentations/logstash-monitorama-2013/css/print/paper.css
10.0.0.1 10003 17/05/2015:10:07:53 POST /presentations/logstash-monitorama-2013/css/print/paper.css
10.0.0.1 10003 17/05/2015:10:08:53 POST /presentations/logstash-monitorama-2013/css/print/paper.css
10.0.0.1 10003 17/05/2015:10:09:53 POST /presentations/logstash-monitorama-2013/css/print/paper.css
10.0.0.1 10003 17/05/2015:10:10:53 POST /presentations/logstash-monitorama-2013/css/print/paper.css
10.0.0.1 10003 17/05/2015:10:16:53 POST /presentations/logstash-monitorama-2013/css/print/paper.css
10.0.0.1 10003 17/05/2015:10:26:53 POST /presentations/logstash-monitorama-2013/css/print/paper.css
需求:
计算当前网站访问的PV(被访问次数)
当前访问的UV(访问的用户数)
有哪些IP访问了本网站
哪个页面的访问量最高
代码:
from pyspark import SparkConf, SparkContext
from pyspark.storagelevel import StorageLevel
if __name__ == '__main__':
conf=SparkConf().setAppName("test").setMaster("local[*]")
sc=SparkContext(conf=conf)
file_rdd=sc.textFile("../../data/input/apache.log")
visit_num=file_rdd.count()
print("当前网站的PV:",visit_num)
print("当前访问的用户数为:",file_rdd.distinct().count())
ip_rdd=file_rdd.map(lambda x :x.split(" "))
ip_rdd.cache()
if_rdd_2=ip_rdd.map(lambda x:x[0]).distinct()
print("访问了本站的ip有:",if_rdd_2.collect())
page_rdd=ip_rdd.map(lambda x:(x[-1],1)).reduceByKey(lambda a,b:a+b)
page=page_rdd.takeOrdered(1,lambda x:-x[1])
top_page=page[0]
print("访问量最高的页面是:",top_page[0],"共被访问:",top_page[1],"次")