Spark 常用案例

最新推荐文章于 2024-03-03 08:36:21 发布

Im_GaoYue

最新推荐文章于 2024-03-03 08:36:21 发布

阅读量7.3k

点赞数 7

本文链接：https://blog.csdn.net/weixin_42540606/article/details/81100882

版权

数据过滤清洗数据

 //textFile() 加载数据
    val data = sc.textFile("/spark/seven.txt")

    //filter 过滤长度小于0， 过滤不包含GET与POST的URL 
    val filtered = data.filter(_.length() > 0).filter(line => (line.indexOf("GET") > 0 || line.indexOf("POST") > 0))

    //转换成键值对操作
    val res = filtered.map(line => {
      if (line.indexOf("GET") > 0) { //截取 GET 到URL的字符串
        (line.substring(line.indexOf("GET"), line.indexOf("HTTP/1.0")).trim, 1)
      } else { //截取 POST 到URL的字符串
        (line.substring(line.indexOf("POST"), line.indexOf("HTTP/1.0")).trim, 1)
      } //最后通过reduceByKey求sum
    }).reduceByKey(_ + _)

    //触发action事件执行
    res.collect()

数据去重问题

原始数据

file1：

file2：

2012-3-1 a

2012-3-2 b

2012-3-3 c

2012-3-4 d

2012-3-5 a

2012-3-6 b

2012-3-7 c

2012-3-3 c

2012-3-1 b

2012-3-2 a

2012-3-3 b

2012-3-4 d

2012-3-5 a

2012-3-6 c

2012-3-7 d

2012-3-3 c

数据输出：

2012-3-1 a
2012-3-1 b
2012-3-2 a
2012-3-2 b
2012-3-3 b
2012-3-3 c
2012-3-4 d
2012-3-5 a
2012-3-6 b
2012-3-6 c
2012-3-7 c
2012-3-7 d

数据去重的最终目标是让原始数据中出现次数超过一次的数据在输出文件中只出现一次

val two = sc.textFile("/tmp/spark/two")
two.filter(_.trim().length>0).map(line=>(line.trim,"")).groupByKey().sortByKey().keys.collect.foreach(println _)

分析每年的最高温度

原始数据分析

0067011990999991950051507004888888889999999N9+00001+9999999999999999999999

0067011990999991950051512004888888889999999N9+00221+9999999999999999999999

0067011990999991950051518004888888889999999N9-00111+9999999999999999999999

0067011990999991949032412004888888889999999N9+01111+9999999999999999999999

0067011990999991950032418004888888880500001N9+00001+9999999999999999999999

0067011990999991950051507004888888880500001N9+00781+9999999999999999999999

数据说明：

第15-19个字符是year

第45-50位是温度表示，+表示零上 -表示零下，且温度的值不能是9999，9999表示异常数据

第50位值只能是0、1、4、5、9几个数字

val one = sc.textFile("/tmp/hadoop/one")
val yearAndTemp = one.filter(line => {
      val quality = line.substring(50, 51);
      var airTemperature = 0
      if (line.charAt(45) == '+') {
        airTemperature = line.substring(46, 50).toInt
      } else {
        airTemperature = line.substring(45, 50).toInt
      }
      airTemperature != 9999 && quality.matches("[01459]")
    }).map {
      line => {
        val year = line.substring(15, 19)
        var airTemperature = 0
        if (line.charAt(45) == '+') {
          airTemperature = line.substring(46, 50).toInt
        } else {
          airTemperature = line.substring(45, 50).toInt
        }
        (year, airTemperature)
      }
    }

    val res = yearAndTemp.reduceByKey(
      (x, y) => if (x > y) x else y
    )
    res.collect.foreach(x => println("year : " + x._1 + ", max : " + x._2))
  }

数据排序

输入文件

file1：

file2：

file3：

654

756

65223

5956

650

样例输出：

1    2
2    6
3    15
4    22
5    26
6    32
7    32
8    54
9    92
10    650
11    654
12    756
13    5956
14    65223

val three = sc.textFile("/tmp/spark/three",3)
var idx = 0
import org.apache.spark.HashPartitioner
 //由入输入文件有多个，产生不同的分区，为了生产序号，使用HashPartitioner将中间的RDD归约到一起
val res = three.filter(_.trim().length>0).map(num=>(num.trim.toInt,"")).partitionBy(
 new HashPartitioner(1)).sortByKey().map(t => {
 idx += 1
 (idx,t._1)
}).collect.foreach(x =>  println(x._1 +"\t" + x._2) )

平均成绩

原始数据

Math：	China：	English：
张三 88 李四 99 王五 66 赵六 77	张三 78 李四 89 王五 96 赵六 67	张三 80 李四 82 王五 84 赵六 86

样本输出：

张三 82
李四 90
王五 82
赵六 76

val fourth = sc.textFile("/tmp/spark/fourth",3)
 
val res = fourth.filter(_.trim().length>0).map(line=>(line.split("\t")(0).trim(),line.split("\t")(1).trim().toInt)).groupByKey().map(x => {
   var num = 0.0
   var sum = 0 
   for(i <- x._2){
    sum = sum + i
    num = num +1
   }
   val avg = sum/num 
   val format = f"$avg%1.2f".toDouble
   (x._1,format)
 }).collect.foreach(x => println(x._1+"\t"+x._2))

求最大最小值问题

数据准备

eightteen_a.txt	eightteen_b.txt
102 10 39 109 200 11 3 90 28	5 2 30 838 10005

结果预测

Max 10005
Min 2

val fifth = sc.textFile("/tmp/spark/fifth",3)
 
val res = fifth.filter(_.trim().length>0).map(line => ("key",line.trim.toInt)).groupByKey().map(x => {
 var min = Integer.MAX_VALUE
 var max = Integer.MIN_VALUE
 for(num <- x._2){
  if(num>max){
   max = num
  }
  if(num<min){
   min = num
  }
 }
 (max,min)
}).collect.foreach(x => {
println("max\t"+x._1)
println("min\t"+x._2)
})

求最大的K个值并排序

需求分析

#orderid,userid,payment,productid

求topN的payment值

a.txt

1,9819,100,121
2,8918,2000,111
3,2813,1234,22
4,9100,10,1101
5,3210,490,111
6,1298,28,1211
7,1010,281,90
8,1818,9000,20

b.txt

100,3333,10,100
101,9321,1000,293
102,3881,701,20
103,6791,910,30
104,8888,11,39

预测结果：（求 Top N=5 的结果）

val six = sc.textFile("/tmp/spark/six")
var idx = 0;
val res = six.filter(x => (x.trim().length>0) && (x.split(",").length==4)).map(_.split(",")(2)).map(x => (x.toInt,"")).sortByKey(false).map(x=>x._1).take(5)
.foreach(x => {
idx = idx+1
println(idx +"\t"+x)})