SparkSQL 适用场景
import spark.implicits._
case class People(name: String, age: Int)
spark.sparkContext.setLogLevel("WARN")
val spark: SparkSession = new sql.SparkSession.Builder()
.appName("hello")
.master("local[6]")
.getOrCreate()
val peopleDS: Dataset[People] = spark.sparkContext.parallelize
(Seq(Person("zhangsan", 12), Person("lisi", 18), Person("zhangsan", 8)).toDS
).toDS()
1.1 命令是API
val teenagers: Dataset[String] = peopleDS.where('age > 10)
.where('age < 20)
.select('name)
.as[String]
teenagers.show()
val ds = spark.createDataset(Seq(Person("zhangsan", 15), Person("zhangsan", 15), Person("lisi", 15)))
ds.distinct().show()
ds.dropDuplicates("age").show()
API | 代码 | 结果 |
---|
select | ds.select(expr(“sum(age)”)).show() | |
selectExpr | ds.selectExpr(“Max(age)”).show() | |
withColumn | ds.withColumn(“name_jok”, 'name === “zhangsan”).show() | |
groupBy | ds.groupBy('name).agg(avg(“age”)).show() | |
map | ds.map(person => Person(person.name, person.age * 2)).show() | |
1.2 SQL版本
val peopleRDD: RDD[People] = spark.sparkContext.parallelize(Seq(People("zhangshan",9),
People("lisi",15),
People("wangwu",25)))
val peopleDS: Dataset[People] = peopleRDD.toDS()
peopleDS.createOrReplaceTempView("people")
val teens: DataFrame = spark.sql("select * from people where age>10 and age<20")
teens.show()
1.3 DataFrame 创建
val df1: DataFrame = Seq("nihao", "hello").toDF("text")
df1.show()
val df2: DataFrame = Seq(("a", 1), ("b", 1)).toDF("word", "count")
df2.show()
# 外部资源创建
val df = spark.read
.option("header", true)
.csv("dataset/BeijingPM20100101_20151231.csv")
df.show(10)
df.printSchema()
Step 1: 对于大部分计算来说, 可能不会使用所有的列, 所以可以选择其中某些重要的列
df.select('year, 'month, 'PM_Dongsi)
Step 2: 可以针对某些列进行分组, 后对每组数据通过函数做聚合
df.select('year, 'month, 'PM_Dongsi)
.where('PM_Dongsi =!= "Na")
.groupBy('year, 'month)
.count()
.show()
1.4 DataFrame 操作
# 按年汇总 PM_Dongsi!=Na,的记录条数
val df = spark.read
.option("header", true)
.csv("dataset/BeijingPM20100101_20151231.csv")
df.select('year,'month,'PM_Dongsi)
.where('PM_Dongsi =!= "Na")
.groupBy('year)
.count().show()
1.5 schema 转换
val schema = StructType(
Seq(
StructField("name", StringType),
StructField("age", IntegerType),
StructField("gpa", FloatType)
)
)
val df: DataFrame = spark.read
.schema(schema)
.option("delimiter", "\t")
.csv("dataset/studenttab10k")
val ds: Dataset[Student] = df.as[Student]
ds.show()
1.6 空值处理
sourceDF.na.drop("any").show()
sourceDF.na.drop().show()
sourceDF.na.drop("all").show()
sourceDF.na.drop("any", List("year", "month", "day", "hour")).show()
sourceDF.na.fill(0).show()
sourceDF.na.fill(0, List("year", "month")).show()
sourceDF.na.replace("PM_Dongsi", Map("NA" -> "NaN", "NULL" -> "null")).show()
1.7 MySQL 写入
val schema = StructType(
List(
StructField("name", StringType),
StructField("age", IntegerType),
StructField("gpa", FloatType)
)
)
val df = spark.read
.schema(schema)
.option("delimiter", "\t")
.csv("dataset/studenttab10k")
val resultDF = df.where("age < 30")
resultDF.write
.format("jdbc")
.option("url", "jdbc:mysql://node01:3306/spark02")
.option("dbtable", "student")
.option("user", "spark03")
.option("password", "Spark03!")
.mode(SaveMode.Overwrite)
.save()
1.8 multiAgg
val postAndYearDF = pmFinal.groupBy('source, 'year)
.agg(avg('pm) as "pm")
val postDF = pmFinal.groupBy('source)
.agg(avg('pm) as "pm")
.select('source, lit(null) as "year", 'pm)
postAndYearDF.union(postDF)
.sort('source, 'year.asc_nulls_last, 'pm)
.show()
1.9 窗口函数
val data = Seq(
("Thin", "Cell phone", 6000),
("Normal", "Tablet", 1500),
("Mini", "Tablet", 5500),
("Ultra thin", "Cell phone", 5500),
("Very thin", "Cell phone", 6000),
("Big", "Tablet", 2500),
("Bendable", "Cell phone", 3000),
("Foldable", "Cell phone", 3000),
("Pro", "Tablet", 4500),
("Pro2", "Tablet", 6500)
)
val source = data.toDF("product", "category", "revenue")
val window = Window.partitionBy('category)
.orderBy('revenue.desc)
val maxPrice: sql.Column = max('revenue) over window
source.select('product, 'category, 'revenue,'revenue, row_number() over window as "rank")
.where('rank<=2)
.show()
类型 | 函数 | 解释 |
---|
排名函数 | dense_rank | 1.排名函数, 计算当前数据在其 Frame 中的位置 2.如果有重复, 则重复项后面的行号会有空挡 |
| rank | |
| row_num | |