目录
Spark运行原理与架构
Spark容错机制checkpoint
参考彻底理解 spark 的checkpoint 机制【转载】。
基于SparkContext的常见文件类型的读写
文本文件输入输出
scala> sc.textFile("./README.md")
res6: org.apache.spark.rdd.RDD[String] = ./README.md MapPartitionsRDD[7] at textFile at <console>:25
scala> val readme = sc.textFile("./README.md")
readme: org.apache.spark.rdd.RDD[String] = ./README.md MapPartitionsRDD[9] at textFile at <console>:24
scala> readme.collect()
res7: Array[String] = Array(# Apache Spark, "", Spark is a fast and general cluster...
scala> readme.saveAsTextFile("hdfs://node01:8020/test")
JSON文件输入输出
如果JSON文件中每一行就是一个JSON记录,那么可以通过将JSON文件当做文本文件来读取,然后利用相关的JSON库对每一条数据进行JSON解析。读取JSON文件需要导包。
scala> import org.json4s._
import org.json4s._
scala> import org.json4s.jackson.JsonMethods._
import org.json4s.jackson.JsonMethods._
scala> import org.json4s.jackson.Serialization
import org.json4s.jackson.Serialization
scala> var result = sc.textFile("examples/src/main/resources/people.json")
result: org.apache.spark.rdd.RDD[String] = examples/src/main/resources/people.json MapPartitionsRDD[7] at textFile at <console>:47
scala> implicit val formats = Serialization.formats(ShortTypeHints(List()))
formats: org.json4s.Formats{
val dateFormat: org.json4s.DateFormat; val typeHints: org.json4s.TypeHints} = org.json4s.Serialization$$anon$1@61f2c1da
scala> result.collect()
res3: Array[String] = Array({
"name":"Michael"}, {
"name":"Andy",