文章目录
1、SparkSql读取文件
\qquad spark读取文件,可以读取格式csv、json、parquet等,有2中读取方式。
// 1种,指定format
val data = spark.read.format("csv").load("xx.csv")
// 2种,使用封装的
val data = spark.read.csv("xx.csv")
\qquad spark读取文件需要注意的是,要区分是否带有head,即字段名;要区分分隔符,可以通过Option来设置。
val data = spark.read.format("csv")
.option("header", "false")
.option("delimiter", "\t")
.load("xx.csv")
\qquad spark读取文件,针对无头的这种文件,需要根据schema创建DataSet,或者根据colName创建DataFrame。
// 1、使用schema
val schema = StructType(Seq(StructField("id", DoubleType),
StructField("code", StringType),
StructField("name", StringType),
StructField("quantity", DoubleType),
StructField("amount", DoubleType),
StructField("data", StringType)
))
val data = spark.read.schema(schema)
.option("header", "false")
.option("delimiter", "\t")
.csv("xx.csv").toDF()
// 2种,使用colName
val colNames = Seq("id", "code", "name", "quantity", "amount", "data")
val data = spark.read.format("csv")
.option("header", "false")
.option("delimiter", "\t")
.load("data/example/cc_order_process.txt")
.toDF(colNames:_*)
2、SparkSql DF自定义UDF
- 1、sparkSql使用SqlContent自定义函数
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("example").master("local").getOrCreate()
val sqlContext = spark.sqlContext
//创建临时视图
data.createOrReplaceTempView("data")
//注册自定义函数getLen
sqlContext.udf.register("getLen", (name: String) => name.length)
//应用自定义函数
sqlContext.sql("select name,getLen(name) from data limit 10").show()
/*
+-------------+----------------+
| name|UDF:getLen(name)|
+-------------+----------------+
| 脱敏市C区日日到副食品店| 12|
| 脱敏市C区陈锦修副食品店| 12|
| 脱敏市A区张宏萍食杂店| 11|
| 脱敏市E区陈信副食品商店| 12|
| 脱敏市E区晓妹副食品商店| 12|
| 脱敏市C区和益源副食品店| 12|
| 脱敏市F区魏德放食杂店| 11|
| 脱敏市C区林雪日杂店| 10|
|脱敏市E区柯晓燕副食品商店| 13|
|脱敏市E区喜佳丽副食品商店| 13|
+-------------+----------------+
*/
- 2、使用dataframe的udf创建自定义函数
import org.apache.spark.sql.functions._
val div_shark_udf = udf { (amount: Double, base: Double) => amount / base }
//应用函数
val result = data.select("amount").withColumn("A0", div_shark_udf(col("amount"), lit(30)))
result.show(10)
/*
+--------+------------------+
| amount| A0|
+--------+------------------+
| 8714.3|290.47666666666663|
| 15914.9| 530.4966666666667|
| 6000.68|200.02266666666668|
| 7524.94|250.83133333333333|
| 7768.74|258.95799999999997|
| 6385.44|212.84799999999998|
|18346.48| 611.5493333333333|
| 5642.54|188.08466666666666|
| 9881.3|329.37666666666667|
| 737.76| 24.592|
+--------+------------------+
*/
3、spark.sql.functions.lit
\qquad 在使用dataframe自定义udf时,需要使用列之间的操作,则需要用到lit函数,将字面量literal创建一个Column。
\qquad 详情见上例
4、spark.sql.functions.regexp_extract
\qquad dataframe可以使用regexp_extract提取正则组内容,然后赋别名。
import org.apache.spark.sql.functions._
val result = data.select(regexp_extract(col("name"), ".*?(.*市.*区).*?", 0).alias("area"))
result.show(10)
/*
+-----+
| area|
+-----+
|脱敏市C区|
|脱敏市C区|
|脱敏市A区|
|脱敏市E区|
|脱敏市E区|
|脱敏市C区|
|脱敏市F区|
|脱敏市C区|
|脱敏市E区|
|脱敏市E区|
+-----+
*/
5、sparkSql语法糖
\qquad 想使用sparkSql的语法糖,则需要导入sqlContext.implicit._包
val spark = SparkSession.builder().appName("example_2").master("local").getOrCreate()
val sqlContext = spark.sqlContext
import sqlContext.implicits._
//简单的语法糖使用示例
//支持 > < >= <= =!= ===
//$"data"代表data列
data.filter($"data" =!= "2017-01-01 00:00:00").show()
6、sparkSql的filter坑
\qquad 使用filter算子一定要toDF,才能进行其他操作,否则默认当作filter之前的dataSet。
/*
先进行过滤,在进行表关联
*/
val spark = SparkSession.builder().appName("example_2").master("local").getOrCreate()
val sc = spark.sparkContext
import spark.sqlContext.implicits._
val df = sc.parallelize(Array(
(1, "yuwen", "zhangsan", 80),
(2, "yuwen", "lisi", 90),
(3, "shuxue", "lisi", 95),
(4, "shuxue", "zhangsan", 90)
)).toDF("id", "course", "name", "score")
//过滤course == yuwen
val df2 = df.filter($"course" === "yuwen")
df2.show()
/*
+---+------+--------+-----+
| id|course| name|score|
+---+------+--------+-----+
| 1| yuwen|zhangsan| 80|
| 2| yuwen| lisi| 90|
+---+------+--------+-----+
*/
//左表关联
val joined = df.join(df2, df("id") === df2("id"), "left_outer")
joined.show()
/*
+---+------+--------+-----+----+------+--------+-----+
| id|course| name|score| id|course| name|score|
+---+------+--------+-----+----+------+--------+-----+
| 1| yuwen|zhangsan| 80| 1| yuwen|zhangsan| 80|
| 3|shuxue| lisi| 95|null| null| null| null|
| 4|shuxue|zhangsan| 90|null| null| null| null|
| 2| yuwen| lisi| 90| 2| yuwen| lisi| 90|
+---+------+--------+-----+----+------+--------+-----+
*/
//按字段获取值
joined.select(
df("id") as "id",
df("course") as "cs",
df("name") as "nm",
df("score") as "sc",
df2("score") as "scc"
).show()
/*
+---+------+--------+---+---+
| id| cs| nm| sc|scc|
+---+------+--------+---+---+
| 1| yuwen|zhangsan| 80| 80|
| 3|shuxue| lisi| 95| 95|
| 4|shuxue|zhangsan| 90| 90|
| 2| yuwen| lisi| 90| 90|
+---+------+--------+---+---+
*/
\qquad 很显然,这个结果是不正确的。
\qquad 正确结果如下:
val df2 = df.filter($"course" === "yuwen").toDF(df.columns: _*)
df2.show()
/*
+---+------+--------+-----+
| id|course| name|score|
+---+------+--------+-----+
| 1| yuwen|zhangsan| 80|
| 2| yuwen| lisi| 90|
+---+------+--------+-----+
*/
// 进行表关联后,再查询
joined.select(
df("id") as "id",
df("course") as "cs",
df("name") as "nm",
df("score") as "sc",
df2("score") as "scc"
).show()
/*
+---+------+--------+---+----+
| id| cs| nm| sc| scc|
+---+------+--------+---+----+
| 1| yuwen|zhangsan| 80| 80|
| 3|shuxue| lisi| 95|null|
| 4|shuxue|zhangsan| 90|null|
| 2| yuwen| lisi| 90| 90|
+---+------+--------+---+----+
*/