注意:
数据自己简单造一点就行,写完把代码复制在下面即可!
第一题:
有数据文件test.txt,分隔符为“\t”,字段有id、time、url,用SparkCore实现分组取topn。求搜索引擎被使用最多的前三名
样例数据如下:
id time url
2 11:08:23 google
3 12:09:11 baidu
1 08:45:56 sohu
2 16:42:17 yahoo
1 23:10:34 baidu
5 06:23:05 google
6 07:45:56 sohu
4 18:42:17 yahoo
5 24:10:34 baidu
1 04:23:05 google
7 16:42:17 yahoo
8 23:10:34 baidu
10 06:23:05 google
11 07:45:56 sohu
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setMaster("local[*]").setAppName("test"))
val rdd: RDD[String] = sc.textFile("datas/test.txt")
val tuples: Array[(String, Int)] = rdd.map(x => ((x.split("\t")) (2), 1)).reduceByKey(_ + _).sortBy(_._2, false).collect().take(3)
tuples.foreach(println)
}
第二题:
统计下面语句中
1)Spark的出现次数
2)哪个单词出现的次数最多
Get Spark from the [downloads page](http://spark.apache.org/downloads.html) of the project website This documentation is for Spark version Spark uses Hadoop s client libraries for HDFS and YARN.Downloads are pre packaged for a handful of popular Hadoop versions Users can also download a Hadoop free binary and run Spark with any Hadoop version [by augmenting Spark s classpath](http://spark.apache.org/docs/latest/hadoop-provided.html) Scala and Java users can include Spark in their projects using its Maven coordinates and in the future Python users can also install Spark from PyPI
val sc = new SparkContext(new SparkConf().setMaster("local[*]").setAppName("test"))
val rdd: RDD[String] = sc.textFile("datas/words.txt")
val count: Long = rdd.flatMap(_.split(" |/|\\[|\\]|:|\\.|\\(|\\)|-")).filter(_.equalsIgnoreCase("spark")).count()
println(count)
def main(args: Array[String]): Unit = {
al sc = new SparkContext(new SparkConf().setMaster("local[*]").setAppName("test"))
val rdd: RDD[String] = sc.textFile("datas/words.txt")
val tuples: Array[(String, Int)] = rdd.flatMap(_.split(" |/|\\[|\\]|:|\\.|\\(|\\)|-")).filter(!_.equals("")).map((_, 1)).reduceByKey((x, y) => {
x + y
}).sortBy(_._2, false).take(1)
tuples.foreach(println)
}
第三题
hdfs目录/data下的数据文件peopleinfo.txt,该文件包含了序号、性别和身高三个列,形式如下:
1 F 170
2 M 178
3 M 174
4 F 165
5 M 179
6 F 160
编写Spark应用程序,该程序对HDFS文件中的数据文件peopleinfo.txt进行统计,计算得到男性总数、女性总数、男性最高身高、女性最高身高、男性最低身高、女性最低身高。
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setMaster("local[*]").setAppName("test"))
val rdd: RDD[String] = sc.textFile("datas/peopleinfo.txt")
val result: Array[(String, Int, Int, Int)] = rdd.map(
x => {
val strings: Array[String] = x.split(" ")
(strings(1), strings(2))
}).groupBy(_._1).map(x => {
var count = 0
var max = -1
var min = 300
x._2.map(y => {
count += 1
if (max < y._2.toInt)
max = y._2.toInt
if (min > y._2.toInt)
min = y._2.toInt
})
(x._1, count, max, min)
}).collect()
result.foreach(println)
}
第四题:
假设有股票数据(csv文件)如下:股票代码(sid)、日期(time)、成交价(price)(表名t1)
sh600794,2019-09-02 09:00:10,25.51
sh603066,2019-09-02 09:00:10,15.51
sh600794,2019-09-02 09:00:20,25.72
sh603066,2019-09-02 09:00:20,15.72
sh600794,2019-09-02 09:00:30,25.83
sh603066,2019-09-02 09:00:30,15.83
sh600794,2019-09-02 09:00:40,25.94
sh603066,2019-09-02 09:00:40,15.94
sh600794,2019-09-02 09:00:50,26.00
sh603066,2019-09-02 09:00:50,16.00
sh600794,2019-09-02 09:10:00,25.98
sh603066,2019-09-02 09:10:00,25.98
sh600794,2019-09-02 15:00:00,25.50
sh603066,2019-09-02 15:00:00,16.00
用Spark-core实现:找每支股票每日所有的波峰、波谷值
val sc = new SparkContext(new SparkConf().setMaster("local[*]").setAppName("test"))
val rdd: RDD[String] = sc.textFile("datas/test.csv")
rdd.map(str=>{
val strings: Array[String] = str.split(",")
(strings(0),strings(1).split(" ")(0),strings(2).toDouble)
}).groupBy(_._1).map(x=>{
(x._1,x._2.maxBy(_._3)._3,x._2.minBy(_._3)._3)
}).foreach(println)