日志字段格式:
id,ip,url,ref,cookie,time_stamp
把日志文件放到HDFS。仅取了1000行。
hadoop fs -put 1000_log hdfs://localhost:9000/user/root/input
直接在Scala Shell中读取文件并计算PV。
scala> val textFile = sc.textFile("hdfs://localhost:9000/user/root/input/1000_log")
scala> val textRDD = textFile.map(_.split("\t")).filter(_.length == 6)
scala> val result = textRDD.map(w => ((new java.net.URL(w(2))).getHost,1)).reduceByKey(_ + _).map(item => item.swap).sortByKey(false).map(item => item.swap)
scala> result.saveAsTextFile("hdfs://localhost:9000/user/root/out8.txt")
从HDFS上取回结果:
hadoop fs -get hdfs://localhost:9000/user/root/out8.txt
查看结果:
$ more out8.txt/part-00000
(www.j1.com,126)
(tieba.