创建SparkSession
第一种方式:通过SparkSession的内部类Builder对象创建
import org.apache.Spark.sql
val spark = new sql.SparkSession.Builder().master("local[*]").appName("test").getOrCreate()
第二种方式:通过SparkSession中builer方法创建
import org.apache.Spark.sql.SparkSession
val spark2 = SparkSession.builder()..master("local[*]").appName("test").getOrCreate()
创建DataFrame
1.通过toFD()方法创建
2.通过读取文件的方法创建
3.通过其他DataFrame衍生
4.通过createDataFrame方法创建(不推荐)
1、通过toDF方法创建
* 集合.toDF
* rdd.toDF
import org.apache.Spark.sql.SparkSession
val spark2 = SparkSession.builder()..master("local[*]").appName("test").getOrCreate()
//写入spark隐式转换
import spark2.implicits._
val list = List( Person("lisi",20,"man"), Person("wagnwu",30,"woman"),Person("zhaoliu",45,"man"),Person("hanmeimei",18,"woman") )
val df = list.toDF()
//会自动使用样例类Person的形参名作为列名,如果List中是元组,就为_1、_2、_3,如果List中是String类型,列名为value
df.show
//TODO 可以使用有参的toDF方法重定义列名,必须列名的个数与列的个数一致。
list2.toDF("NAME","AGE","SEX").show
//通过spark直接.sparkContext
val rd = spark.sparkContext.parallelize(list)
case class Person(name:String,age:Int,sex:String)
2.通过读取文件的方法创建
```scala
val df = spark.read.json("文件名")
df.printSchema()//打印列结构
df.show()
3.通过其他DataFrame衍生
val list = List( Person("lisi",20,"man"), Person("wagnwu",30,"woman"),Person("zhaoliu",45,"man"),Person("hanmeimei",18,"woman") )
val df = list.toDF()
val df2 = df.selectExpr("name","age")
val df3 = df2.where("sex='man'")
4.通过createDataFrame方法创建(不推荐)
//dataFrame表所有行数据
val rdd = spark.sparkContext.parallelize(List( Row("lisi",20,"shenzhen"),Row("wangwu",33,"beijing"),Row("zhaoliu",43,"beijing") ))
//指定DataFrame的列信息
val fileds = Array[StructField](StructField("name",StringType),StructField("age",IntegerType),StructField("address",StringType))
val schema = StructType(fileds)
val fd = spark.createDateFrame(rdd, schema)
创建DataSet
1.通过toDS()方法创建
2.通过读取文件的方法创建
3.通过其他DataSet衍生
4.通过createDataSet方法创建(不推荐)
1、通过toDS方法创建
* 集合.toDS
* rdd.toDS
val list = List( Person("lisi",20,"man"), Person("wagnwu",30,"woman"),Person("zhaoliu",45,"man"),Person("hanmeimei",18,"woman") )
val ds = list.toDS()
ds.printSchema()
ds.show
//想要给列名先toDF
val list2 = List( ("lisi",20,"man"), ("wagnwu",30,"woman"),("zhaoliu",45,"man"),("hanmeimei",18,"woman") )
list2.toDS().toDF("NAME","age","address").show
2.通过读取文件的方法创建
spark.read.textFile("文件名")
3.其他DataSet衍生
val ds = spark2.read.textFile("datas/wc.txt")
val ds2 = ds.flatMap(x=> x.split(" "))
4.通过createDataSet方法创建
val list = List( Person("lisi",20,"man"), Person("wagnwu",30,"woman"),Person("zhaoliu",45,"man"),Person("hanmeimei",18,"woman") )
val ds = spark2.createDataset(list)
sparkSql的两种编程方式:1.命令式2.声明式
/**
* 命令式: 使用一些方法操作数据[了解]
* 列裁剪: selectExpr
* 过滤: where/filter两种方法相同
* 去重:
* distinct: 当两行数据所有列相同就会去重
* dropDuplicates: 当两行数据指定列相同就会去重
*/
val list = List(
("lisi",20,"man","2020-06-14 11:25:25"),
("wagnwu",30,"woman","2020-06-17 11:25:25"),
("wagnwu",30,"woman","2020-06-17 11:25:25"),
("wagnwu",99,"woman","2020-06-22 11:25:25"),
("zhaoliu",45,"man","2020-06-15 11:25:25"),
("hanmeimei",18,"woman","2020-06-16 11:25:25") )
val df = list.toDF("name","age","sex","date_str")
//TODO 列裁剪: selectExpr(列名,函数(..) as 别名,...)
//import org.apache.spark.sql.functions._
//val df2 = df.select('name,'age,date_format('date_str, "yyyy-MM-dd"))
//df2.show
df.selectExpr("name","age","date_format(date_str,'yyyy-MM-dd') date_str").show
//TODO 过滤
df.where("age>=25").show
df.filter("age>=25").show
//TODO 去重
df.distinct().show
df.dropDuplicates("name").show
/**
* 声明式: sql操作数据
* 1、注册表:
* createOrReplaceTempView : 创建临时表,如果表存在则覆盖
* 2、写sql: spark.sql(“sql语句”)
*/
val list = List(
("lisi",20,"man","2020-06-14 11:25:25"),
("wagnwu",30,"woman","2020-06-17 11:25:25"),
("wagnwu",30,"woman","2020-06-17 11:25:25"),
("wagnwu",99,"woman","2020-06-22 11:25:25"),
("zhaoliu",45,"man","2020-06-15 11:25:25"),
("hanmeimei",18,"woman","2020-06-16 11:25:25") )
val df = list.toDF("name","age","sex","date_str")
//select ... from 表名 where ...
//TODO 注册表
df.createOrReplaceTempView("person")
//TODO 写sql
spark.sql(
"""
|select
| name,age,date_format(date_str,'yyyy-MM-dd') date_str
|from person where age>=25
""".stripMargin).show