需求
求用户访问量的top5
需求分析
1、拿到用户
2、访问量求和
2、反转排序再反转
一、读取文件
val file=sc.textFile("/opt/data/page_views.dat")
二、使用tab键分割并拿到第五个字段用户id,然后每个id赋1的操作
val user=file.map(x =>(x.split("\t")(5),1))
最终效果
scala> user.collect.take(10).foreach(println)
(NULL,1)
(134939954,1)
(NULL,1)
(NULL,1)
(NULL,1)
(NULL,1)
(NULL,1)
(NULL,1)
(5305018,1)
(NULL,1)
三、访问量求和
scala> val pageview = user.reduceByKey(_+_)
scala> pagview.collect.take(10).foreach(println)
(125134732,7)
(123968302,1)
(121444974,1)
(85118917,2)
(134519579,1)
(126322710,12)
(116495558,1)
(134909433,5)
(124158904,9)
(114816337,3)
四、反转排序再反转
scala> val sort=pagview.map(x => (x._2,x._1)).sortByKey(false).map(x => (x._2,x._1))
五、最终结果
scala> sort.collect.take(5).foreach(println)
(NULL,60871)
(123626648,40)
(116191447,38)
(122764680,34)
(85252419,30)
六、IDEA+Maven构建代码并使用spark-submit提交
代码
package com.spark.demo
import org.apache.spark.{SparkConf, SparkContext}
object sparkpageview {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf()
val sc = new SparkContext(sparkConf)
val files = sc.textFile(args(0))
val result=files.map(x => (x.split("\t")(5),1)).reduceByKey(_+_).map(x => (x._2,x._1)).sortByKey(false).map(x => (x._2,x._1))
result.take(5).foreach(println)
sc.stop()
}
}
打包上传至linux后提交作业
./spark-submit \
--class com.spark.demo.sparkpageview \
--master local[2] \
/opt/lib/SparkCore-1.0.jar \
/opt/data/page_views.dat