spark编程基础--5.4综合实例

操作指令如下:

cd /usr/local/hadoop

./sbin/start-dfs.sh

./bin/hdfs dfs -mkdir -p spark/mycode/rdd/TopN

./bin/hdfs dfs -put /usr/local/spark/mycode/TopN_file1.txt spark/mycode/rdd/TopN
./bin/hdfs dfs -put /usr/local/spark/mycode/TopN_file2.txt spark/mycode/rdd/TopN

./bin/hdfs dfs -ls ./spark/mycode/rdd/TopN

./bin/hdfs dfs -cat spark/mycode/rdd/TopN/TopN_file1.txt
./bin/hdfs dfs -cat spark/mycode/rdd/TopN/TopN_file2.txt

cd ~/sparkapp

/usr/local/sbt/sbt package


/usr/local/spark/bin/spark-submit --class "TopN" ~/sparkapp/target/scala-2.11/simple-project_2.11-1.0.jar

 

 

案例1:求TOP1 案

 

//TopN.scala
import org.apache.spark.{SparkConf, SparkContext}
object TopN {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("TopN").setMaster("local")
    val sc = new SparkContext(conf)
    sc.setLogLevel("ERROR")
    val lines = sc.textFile("hdfs://localhost:9000/user/hadoop/spark/mycode/rdd/examples",2)
    var num = 0;
    val result = lines.filter(line => (line.trim().length > 0) && (line.split(",").length == 4))
      .map(_.split(",")(2))
      .map(x => (x.toInt,""))
      .sortByKey(false)
      .map(x => x._1).take(5)
      .foreach(x => {
        num = num + 1
        println(num + "\t" + x)
      })
  }
}

案例2:求最大最小值

1:求TOP值5.4.15.4.1案例1OP值55.4.1 案例1:求TOP值5.4.1 案例1:求TOP值.4.1 案例1:求TOP值 案例1:求?/TOP

案例3:文件排序

这是怎么了?

 

用了朋友的代码 又可以了……

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.HashPartitioner
object FileSort {
    def main(args: Array[String]) {
        val conf = new SparkConf().setAppName("FileSort")
        val sc = new SparkContext(conf)
        val dataFile = "file:///usr/local/spark/mycode/rdd/data"
        val lines = sc.textFile(dataFile,3)
        var index = 0
        val result = lines.filter(_.trim().length>0).map(n=>(n.trim.toInt,"")).partitionBy(new HashPartitioner(1)).sortByKey().map(t => {
            index += 1
            (index,t._1)
        })
        result.saveAsTextFile("file:///usr/local/spark/mycode/rdd/examples/result")
    }
}

 

案例4:二次排序

二次排序,具体的实现步骤:

 * 第一步:按照OrderedSerializable接口实现自定义排序的key

 * 第二步:将要进行二次排序的文件加载进来生成<key,value>类型的RDD

 * 第三步:使用sortByKey基于自定义的Key进行二次排序

 * 第四步:去除掉排序的Key,只保留排序的结果

 

SecondarySortKey.scala代码如下:

class SecondarySortKey(val first:Int,val second:Int) extends Ordered [SecondarySortKey] with Serializable {
def compare(other:SecondarySortKey):Int = {
    if (this.first - other.first !=0) {
         this.first - other.first 
    } else {
      this.second - other.second
    }
  }
}

SecondarySortApp.scala代码如下:

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object SecondarySortApp {
  def main(args:Array[String]){
     val conf = new SparkConf().setAppName("SecondarySortApp").setMaster("local")
       val sc = new SparkContext(conf)
       val lines = sc.textFile("file:///usr/local/spark/mycode/rdd/examples/SecondarySortApp_file1.txt", 1)
       val pairWithSortKey = lines.map(line=>(new SecondarySortKey(line.split(" ")(0).toInt, line.split(" ")(1).toInt),line))
       val sorted = pairWithSortKey.sortByKey(false)
       val sortedResult = sorted.map(sortedLine =>sortedLine._2)
       sortedResult.collect().foreach (println)
  }
}

 

案例5:连接操作

任务描述:在推荐领域有一个著名的开放测试集,下载链接是:http://grouplens.org/datasets/movielens/,该测试集包含三个文件,分别是ratings.datsers.datmovies.dat,具体介绍可阅读:README.txt。请编程实现:通过连接ratings.datmovies.dat两个文件得到平均得分超过4.0的电影列表,采用的数据集是:ml-1m

/*movies.dat
MovieID::Title::Genres
1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
5::Father of the Bride Part II (1995)::Comedy
6::Heat (1995)::Action|Crime|Thriller
7::Sabrina (1995)::Comedy|Romance
8::Tom and Huck (1995)::Adventure|Children's
9::Sudden Death (1995)::Action
10::GoldenEye (1995)::Action|Adventure|Thriller*/

/*ratings.dat
UserID::MovieID::Rating::Timestamp
1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291
1::1197::3::978302268
1::1287::5::978302039
1::2804::5::978300719
1::594::4::978302268
1::919::4::978301368
1::595::5::978824268
1::938::4::978301752
1::2398::4::978302281
1::2918::4::978302124
1::1035::5::978301753
1::2791::4::978302188
1::2687::3::978824268*/

import org.apache.spark._
import SparkContext._
object SparkJoin {
        def main(args: Array[String]) {
                if(args.length != 3 ){
                println("usage is SparkJoin <rating> <movie> <output>")
                return
        }
        val conf = new SparkConf().setAppName("SparkJoin").setMaster("local")
        val sc = new SparkContext(conf)
// Read rating from HDFS file 
        val textFile = sc.textFile(args(0))
//extract (movieid, rating) 
        val rating = textFile.map(line => {
        val fileds = line.split("::")
        (fileds(1).toInt, fileds(2).toDouble)
        })
//get (movieid,ave_rating)
        val movieScores = rating
                .groupByKey()
                .map(data => {
        val avg = data._2.sum / data._2.size
        (data._1, avg)
        })
// Read movie from HDFS file 
        val movies = sc.textFile(args(1))
        val movieskey = movies.map(line => {
        val fileds = line.split("::")
        (fileds(0).toInt, fileds(1))    //(MovieID,MovieName)
        }).keyBy(tup => tup._1)

// by join, we get <movie, averageRating, movieName> 
        val result = movieScores
                .keyBy(tup => tup._1)
                .join(movieskey)
                .filter(f => f._2._1._2 > 4.0)
                .map(f => (f._1, f._2._1._2, f._2._2._2))

        result.saveAsTextFile(args(2))
        }
}

 

//连接操作
使用网址http://files.grouplens.org/datasets/movielens/ml-1m.zip下载所需文件,解压后存在"~/下载/ml-1m"目录里 

 

 

  • 3
    点赞
  • 26
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值