一、RDD
1.1 通过本地集合创建RDD
val seq1 = Seq(1001,"liming",24,95)
val seq2 = Seq(1,2,3)
//可以不指定分区数
val rdd1: RDD[Any] = sc.parallelize(seq1,2)
println(rdd1.collect().mkString(","))
val rdd2: RDD[Int] = sc.makeRDD(Seq(1,2,3,4),2)
//也可以使用Array、List
val rdd3 = sc.parallelize(List(1, 2, 3, 4))
val rdd4 = sc.parallelize(Array(1, 2, 3, 4))
rdd3.take(10).foreach(println)
1.2 通过外部数据创建RDD
//外部数据(文件)创建RDD
val rdd1 = sc.textFile("file_path")
//1、textFile传入的是一个路径
//2、分区是由HDFS中的block决定的
1.3 通过RDD衍生新的RDD
//RDD衍生RDD
val rdd1 = sc.parallelize(Seq("zhangsan","lisi","wangwu"))
//通过RDD执行算子操作会产生RDD
val rdd2 = rdd1.map(item => (item, 1))
二、DataFrame
1.1 通过Seq生成
val df = spark.createDataFrame(Seq(
("ming", 20, 15552211521L),
("hong", 19, 13287994007L),
("zhi", 21, 15552211523L)
)).toDF("name", "age", "phone")
df.show()
1.2 通过读取外部结构化数据
1.2.1 读取Json文件生成
json文件内容:
{"name":"ming","age":20,"phone":15552211521}
{"name":"hong", "age":19,"phone":13287994007}
{"name":"zhi", "age":21,"phone":15552211523}
代码:
val dfJson = spark.read.format("json").load("/Users/hadoop/sparkLearn/data/student.json")
dfJson.show()
1.2.2 读取csv文件生成
csv文件:
name,age,phone
ming,20,15552211521
hong,19,13287994007
zhi,21,15552211523
代码:
val dfCsv = spark.read.format("csv").option("header", true).load("/Users/had