生活在于积累,你所遇到的困难,问题,,,,都将成为你生活的财富,而博客,就是记录你一路成长的最好见证。
————————送给正在写作业的你
Spark 项目分析网络URL数据。加深RDD理解
要求分析出每个域名的前三个访问量是哪些 URL
数据的格式:
https://blog.csdn.net/qq_43688472/article/details/84307884 [2015-07-07 13:52:58] 54
https://item.jd.com/26838388932.html [2008-03-04 15:47:37] 81
https://blog.csdn.net/qq_24073707/article/details/80665991 [5002-10-17 09:20:02] 73
https://hizero.taobao.com/?spm=a217m.8316598.682372.3.426d33d5KRgTIh [2009-05-29 20:02:43] 63
操作:
import java.net.URL
import org.apache.spark.{SparkConf, SparkContext}
object UrlCount01 {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[2]").setAppName("UrlCount01")
val sc = new SparkContext(sparkConf)
val line = sc.textFile("file:///E:\\data.txt" )
//筛选数据因为有的数据可能不散URL格式的数据
//println("line: "+line)
// val rdd1 = line.filter(x=>{
// val tmp = x.split(":")
// if (tmp.length>=3||x.contains("[["))
// false
// else
// true
// }).map(
// x=>{
// val data = x.split("\t")
// val urls = new URL(data(0))
// val host = urls.getHost
// (data(0), 1)
// })
val rdd1 = line.map(
x=>{
val data = x.split("\t")
val urls = new URL(data(0))
val host = urls.getHost
(data(0), 1)
})
val rdd2 = rdd1.reduceByKey((x,y)=>x+y)
//合并两个Map集合对象(将两个对应KEY的值累加)
//( map1 /: map2 ) { case (map, (k,v)) => map + ( k -> (v + map.getOrElse(k, 0)) ) }
val rdd3 = rdd2.map{case(d,t)=>{
val urls = new URL(d)
val host = urls.getHost
(host,d, t)
}}
//把数据进行分组
val rdd4 = rdd3.groupBy(_._1)
//分组后进行排序操作
val rdd5 = rdd4.map(sx=>{
val key = sx._1
val value = sx._2;
val sorval = value.toList.sortBy(_._3).take(3)
(key,sorval)
})
rdd5.foreach(println)
//把操作完的数据存入本地文件
//rdd5.saveAsTextFile("E:\\data2")
sc.stop()
}
}
其中已经表明一些注释了,需要什么就打开什么吧,
让我们看一下结果:
(wengna.taobao.com,List((wengna.taobao.com,https://wengna.taobao.com/?spm=a217m.8316598.682375.7.426d33d5KRgTIh,2)))
(takefired.taobao.com,List((takefired.taobao.com,https://takefired.taobao.com/?spm=a217m.8316598.711275.5.426d33d5KRgTIh,1)))
(blog.csdn.net,List((blog.csdn.net,https://blog.csdn.net/qq_43688472/article/details/84940873,1), (blog.csdn.net,https://blog.csdn.net/qq_24073707/article/details/80988329,1), (blog.csdn.net,https://blog.csdn.net/qq_24073707/article/details/80658301,1)))
(12cmlook.taobao.com,List((12cmlook.taobao.com,https://12cmlook.taobao.com/?spm=a217m.8316598.682375.9.426d33d5KRgTIh,1)))
(item.jd.com,List((item.jd.com,https://item.jd.com/25619900612.html,1), (item.jd.com,https://item.jd.com/19997245287.html,1), (item.jd.com,https://item.jd.com/100001625726.html,1)))
(unawares.taobao.com,List((unawares.taobao.com,https://unawares.taobao.com/?spm=a217m.8316598.682348.7.426d33d5KRgTIh,2)))
(mp.csdn.net,List((mp.csdn.net,https://mp.csdn.net/mdeditor/84307884#,1)))
付出总会有回报,相信自己,即使你很慢,只要你在前进就好额,不要害怕,加油!
——————————————送给努力的你