Scala实践Spark（一）

最新推荐文章于 2021-11-11 20:55:44 发布

得克特

最新推荐文章于 2021-11-11 20:55:44 发布

阅读量321

点赞数

分类专栏：大数据文章标签： Scala Spark

本文链接：https://blog.csdn.net/weixin_40548136/article/details/102707152

版权

大数据专栏收录该内容

32 篇文章 1 订阅

订阅专栏

1024快乐！！！

基本

以下基于spark-shell

scala> val lines = sc.textFile("file:///home/hadoop/software/spark/spark-2.4.4-bin-hadoop2.7/README.md")
lines: org.apache.spark.rdd.RDD[String] = file:///home/hadoop/software/spark/spark-2.4.4-bin-hadoop2.7/README.md MapPartitionsRDD[5] at textFile at <console>:24

scala> lines.first()
res6: String = # Apache Spark

scala> val lineSpark = lines.filter(line => line.contains("Spark"))
lineSpark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at filter at <console>:25

scala> lineSpark.collect()
res7: Array[String] = Array(# Apache Spark, Spark is a fast and general cluster computing system for Big Data. It provides, rich set of higher-level tools including Spark SQL for SQL and DataFrames,, and Spark Streaming for stream processing., You can find the latest Spark documentation, including a programming, ## Building Spark, Spark is built using [Apache Maven](http://maven.apache.org/)., To build Spark and its example programs, run:, You can build Spark using more than one thread by using the -T option with Maven, see ["Parallel builds in Maven 3"](https://cwiki.apache.org/confluence/display/MAVEN/Parallel+builds+in+Maven+3)., ["Building Spark"](http://spark.apache.org/docs/latest/building-spark.html)., For general development tips, including info on developing Spark using an IDE,...

scala> lineSpark.count()
res8: Long = 20

scala> val words = lines.flatMap(line => line.split(" "))
words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[8] at flatMap at <console>:25

scala> val counts = words.map(word =>(word,1)).reduceByKey{case (x,y) =>x+y}
counts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[10] at reduceByKey at <console>:25

scala> counts.collect()
res9: Array[(String, Int)] = Array((package,1), (For,3), (Programs,1), (processing.,1), (Because,1), (The,1), (page](http://spark.apache.org/documentation.html).,1), (cluster.,1), (its,1), ([run,1), (than,1), (APIs,1), (have,1), (Try,1), (computation,1), (through,1), (several,1), (This,2), (graph,1), (Hive,2), (storage,1), (["Specifying,1), (To,2), ("yarn",1), (Once,1), (["Useful,1), (prefer,1), (SparkPi,2), (engine,1), (version,1), (file,1), (documentation,,1), (processing,,1), (the,24), (are,1), (systems.,1), (params,1), (not,1), (different,1), (refer,2), (Interactive,2), (R,,1), (given.,1), (if,4), (build,4), (when,1), (be,2), (Tests,1), (Apache,1), (thread,1), (programs,,1), (including,4), (./bin/run-example,2), (Spark.,1), (package.,1), (1000).count(),1), (Versions,1), (HDFS,1), (D...

scala> counts.take(3).foreach(println) //这里可以看出take取与collect的区别
(package,1)
(For,3)
(Programs,1)

函数传递，与python类似，我们传递一个对象或方法时，会包含对整个对象的引用，Python中用self比较明显，如穿入Scala val query:String，在方法中使用该参数时rdd.map(x=>s.split(query))，“query"表示"this.query”，因此我们要传递整个"this"。通过val query = this.query则是安全的
通常出现NoSerializableException，在于我们传递了一个不可序列化的类中的函数或字段。

scala> val input = sc.parallelize(List(1,2,3,4))
input: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[11] at parallelize at <console>:24

scala> val result = input.map(x=>x*x)
result: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[12] at map at <console>:25

scala> println(result.collect().mkString(","))
1,4,9,16

scala> result.collect()
res12: Array[Int] = Array(1, 4, 9, 16)

flatMap与map类似，但其返回的是返回值序列的迭代器。这里我理解它是将map的结果平铺。

scala> val lines = sc.parallelize(List("hello world","hi"))
lines: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[13] at parallelize at <console>:24

scala> val words = lines.flatMap(line => line.split(" "))
words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[14] at flatMap at <console>:25

scala> words.first()
res13: String = hello

聚合

scala> val input = sc.parallelize(List(1,2,3,4))

scala> val result = input.aggregate((0,0))((acc,value)=>(acc._1+value,acc._2+1),(acc1,acc2)=>(acc1._1+acc2._1,acc1._2+acc2._2))
result: (Int, Int) = (10,4)

scala> val avg = result._1 / result._2.toDouble
avg: Double = 2.5

注意：在scala中，将RDD转化为有特定函数的RDD有隐式转换来自动处理，通过import org.apache.spark.SparkContext._来使用隐式转换。这些隐式转换可以将一个RDD转为各种封装类，比如DoubleRDDFunctions和PairRDDFunctions，这样我们就有了诸如mean()和variance()之类的额外的函数

持久化，为什么需要持久化呢？Spark RDD是惰性求值的，如果我们希望多次使用同一个RDD，简单调用action动作，Spark每次都会重算RDD以及所有依赖，我们可以一次计算然后持久化使用。

 val result = input.map(x=>x*x)
 println(result.count())
 println(result.collect().mkString(","))

persist与cache的区别，cache调用的persist(StorageLevel.MEMORY_ONLY)spark中cache和persist的区别

scala> import org.apache.spark.storage.StorageLevel
import org.apache.spark.storage.StorageLevel

scala> val result = input.map(x=>x*x)
result: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:26

scala> result.persist(StorageLevel.DISK_ONLY)
res5: result.type = MapPartitionsRDD[1] at map at <console>:26

scala> println(result.count())
4                                                                               

scala> println(result.collect.mkString(","))
1,4,9,16

scala> result.unpersist()
res8: result.type = MapPartitionsRDD[1] at map at <console>:26

scala> input.cache()
res2: input.type = ParallelCollectionRDD[0] at parallelize at <console>:24