1024快乐!!!
基本
以下基于spark-shell
scala> val lines = sc.textFile("file:///home/hadoop/software/spark/spark-2.4.4-bin-hadoop2.7/README.md")
lines: org.apache.spark.rdd.RDD[String] = file:///home/hadoop/software/spark/spark-2.4.4-bin-hadoop2.7/README.md MapPartitionsRDD[5] at textFile at <console>:24
scala> lines.first()
res6: String = # Apache Spark
scala> val lineSpark = lines.filter(line => line.contains("Spark"))
lineSpark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at filter at <console>:25
scala> lineSpark.collect()
res7: Array[String] = Array(# Apache Spark, Spark is a fast and general cluster computing system for Big Data. It provides, rich set of higher-level tools including Spark SQL for SQL and DataFrames,, and Spark Streaming for stream processing., You can find the latest Spark documentation, including a programming, ## Building Spark, Spark is built using [Apache Maven](http://maven.apache.org/)., To build Spark and its example programs, run:, You can build Spark using more than one thread by using the -T option with Maven, see ["Parallel builds in Maven 3"](https://cwiki.apache.org/confluence/display/MAVEN/Parallel+builds+in+Maven+3)., ["Building Spark"](http://spark.apache.org/docs/latest/building-spark.html)., For general development tips, including info on developing Spark using an IDE,...
scala> lineSpark.count()
res8: Long = 20
scala> val words = lines.flatMap(line => line.split(" "))
words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[8] at flatMap at <console>:25
scala> val counts = words.map(word =>(word,1)).reduceByKey{case (x,y) =>x+y}
counts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[10] at reduceByKey at <console>:25
scala> counts.collect()
res9: Array[(String, Int)] = Array((package,1), (For,3), (Programs,1), (processing.,1), (Because,1), (The,1), (page](http://spark.apache.org/documentation.html).,1), (cluster.,1), (its,1), ([run,1), (than,1), (APIs,1), (have,1), (Try,1), (computation,1), (through,1), (several,1), (This,2), (graph,1), (Hive,2), (storage,1), (["Specifying,1), (To,2), ("yarn",1), (Once,1), (["Useful,1), (prefer,1), (SparkPi,2), (engine,1), (version,1), (file,1), (documentation,,1), (processing,,1), (the,24), (are,1), (systems.,1), (params,1), (not,1), (different,1), (refer,2), (Interactive,2), (R,,1), (given.,1), (if,4), (build,4), (when,1), (be,2), (Tests,1), (Apache,1), (thread,1), (programs,,1), (including,4), (./bin/run-example,2), (Spark.,1), (package.,1), (1000).count(),1), (Versions,1), (HDFS,1), (D...
scala> counts.take(3).foreach(println) //这里可以看出take取与collect的区别
(package,1)
(For,3)
(Programs,1)
函数传递,与python类似,我们传递一个对象或方法时,会包含对整个对象的引用,Python中用self
比较明显,如穿入Scala val query:String
,在方法中使用该参数时rdd.map(x=>s.split(query))
,“query"表示"this.query”,因此我们要传递整个"this"。通过val query = this.query
则是安全的
通常出现NoSerializableException
,在于我们传递了一个不可序列化的类中的函数或字段。
scala> val input = sc.parallelize(List(1,2,3,4))
input: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[11] at parallelize at <console>:24
scala> val result = input.map(x=>x*x)
result: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[12] at map at <console>:25
scala> println(result.collect().mkString(","))
1,4,9,16
scala> result.collect()
res12: Array[Int] = Array(1, 4, 9, 16)
flatMap
与map
类似,但其返回的是返回值序列的迭代器。这里我理解它是将map的结果平铺。
scala> val lines = sc.parallelize(List("hello world","hi"))
lines: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[13] at parallelize at <console>:24
scala> val words = lines.flatMap(line => line.split(" "))
words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[14] at flatMap at <console>:25
scala> words.first()
res13: String = hello
聚合
scala> val input = sc.parallelize(List(1,2,3,4))
scala> val result = input.aggregate((0,0))((acc,value)=>(acc._1+value,acc._2+1),(acc1,acc2)=>(acc1._1+acc2._1,acc1._2+acc2._2))
result: (Int, Int) = (10,4)
scala> val avg = result._1 / result._2.toDouble
avg: Double = 2.5
注意:在scala中,将RDD转化为有特定函数的RDD有隐式转换来自动处理,通过import org.apache.spark.SparkContext._
来使用隐式转换。这些隐式转换可以将一个RDD转为各种封装类,比如DoubleRDDFunctions和PairRDDFunctions,这样我们就有了诸如mean()和variance()之类的额外的函数
持久化,为什么需要持久化呢?Spark RDD是惰性求值的,如果我们希望多次使用同一个RDD,简单调用action动作,Spark每次都会重算RDD以及所有依赖,我们可以一次计算然后持久化使用。
val result = input.map(x=>x*x)
println(result.count())
println(result.collect().mkString(","))
persist与cache的区别,cache调用的persist(StorageLevel.MEMORY_ONLY)
spark中cache和persist的区别
scala> import org.apache.spark.storage.StorageLevel
import org.apache.spark.storage.StorageLevel
scala> val result = input.map(x=>x*x)
result: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:26
scala> result.persist(StorageLevel.DISK_ONLY)
res5: result.type = MapPartitionsRDD[1] at map at <console>:26
scala> println(result.count())
4
scala> println(result.collect.mkString(","))
1,4,9,16
scala> result.unpersist()
res8: result.type = MapPartitionsRDD[1] at map at <console>:26
scala> input.cache()
res2: input.type = ParallelCollectionRDD[0] at parallelize at <console>:24
以上是RDD模型的常见操作,下一篇主要操作键值对形式的RDD。