SparkSQL之处理结构化数据

SPARKSQL 处理结构化数据

同样SPARKSQL 处理机构化数据也存在2个大的API:transformation 和action
Transformation API 列表
Operationdesc
select对应传统SQL的select 语句
selectExprselect 语句中添加表达式
filter where过滤条件
distinct dropDuplicates去除掉重复的记录
sort orderBy排序操作
limit限制输出结果集条数
union合并多个结果集
withColumn结果集输出添加新的列
withColumnRenamed结果集输出重命名已存在的列
drop删除一些输出列
sample取样输出结果集
randomSplit分裂结果集
join结果集之间的连接
groupBy结果集的聚合操作
describe显示结果集的统计信息
下面的案例是对输出列的操作: 常规有如下5中方式
方式举例
“”“columName”
colcol(“columnName”)
columncolumn(“columnName”)
$$“columnName”
’ (tick mark)'columnName
列的输出形式
 val spark = SparkSession.builder().config(conf).getOrCreate();
 val df = spark.createDataFrame(List(("Jason",34),("Tom",20))).toDF("Name","age");
 df.select("Name").show();
 df.select(col("Name")).show();
 df.select(column("Name")).show();
输出结果
+-----+
| Name|
+-----+
|Jason|
|  Tom|
+-----+

+-----+
| Name|
+-----+
|Jason|
|  Tom|
+-----+

+-----+
| Name|
+-----+
|Jason|
|  Tom|
+-----+
select(columns)
 val spark = SparkSession.builder().config(conf).getOrCreate();
 val df = spark.createDataFrame(List(("Jason",34),("Tom",20))).toDF("Name","age");
 df.select("Name","age").show();
输出结果
+-----+---+
| Name|age|
+-----+---+
|Jason| 34|
|  Tom| 20|
+-----+---+
selectExpr(expressions)
val spark = SparkSession.builder().config(conf).getOrCreate();
val df = spark.createDataFrame(List(("Jason",34),("Tom",20))).toDF("Name","age");
df.selectExpr("*","(2019-age) as Brith_Year").show();
输出结果
+-----+---+----------+
| Name|age|Brith_Year|
+-----+---+----------+
|Jason| 34|      1985|
|  Tom| 20|      1999|
+-----+---+----------+
调用内置SQL的count,distinct 的方法
selectExpr(expressions)
val spark = SparkSession.builder().config(conf).getOrCreate();
val df = spark.createDataFrame(List(("Jason",34),("Tom",20))).toDF("Name","age");
df.selectExpr("count(distinct(Name)) as count").show();
输出结果
+-----+
|count|
+-----+
|    2|
+-----+
filler(condition), where(condition)
val spark = SparkSession.builder().config(conf).getOrCreate();
val df = spark.createDataFrame(List(("Jason",34),("Tom",20))).toDF("Name","age");
df.filter("age < 30").show();
输出结果
+----+---+
|Name|age|
+----+---+
| Tom| 20|
+----+---+
val spark = SparkSession.builder().config(conf).getOrCreate();
val df = spark.createDataFrame(List(("Jason",34),("Tom",20))).toDF("Name","age");
//df.filter("age < 30").show();
df.filter("age == 34 and Name == 'Jason'").show();
输出结果
+-----+---+
| Name|age|
+-----+---+
|Jason| 34|
+-----+---+
distinct, dropDuplicates
val spark = SparkSession.builder().config(conf).getOrCreate();
import spark.implicits._
 val df = spark.sparkContext.parallelize(List(("Jason","DBA"),("Jason","Dev"),("Jason","BigData"))).toDF("Name","Job")
 df.select("Name").distinct().show();
 df.select("Job").distinct().show();
 df.select("Name").dropDuplicates().show();
 df.select("Job").dropDuplicates().show();
输出结果
+-----+
| Name|
+-----+
|Jason|
+-----+
+-------+
|    Job|
+-------+
|    Dev|
|BigData|
|    DBA|
+-------+
sort(columns), orderBy(columns)
val spark = SparkSession.builder().config(conf).getOrCreate();
import spark.implicits._
val df = spark.sparkContext.parallelize(List((2,"Jason"),(1,"Tom"),(5,"James"))).toDF("ID","Name");
df.sort("ID").show();
df.orderBy(col("ID").desc).show();
输出结果
+---+-----+
| ID| Name|
+---+-----+
|  1|  Tom|
|  2|Jason|
|  5|James|
+---+-----+
+---+-----+
| ID| Name|
+---+-----+
|  5|James|
|  2|Jason|
|  1|  Tom|
+---+-----+
limit(n)
 val spark = SparkSession.builder().config(conf).getOrCreate();
 import spark.implicits._
 val df = spark.sparkContext.parallelize(List((1,"Jason"),(3,"Mike"),(4,"Tom"))).toDF("ID","Name");
 df.orderBy(col("ID").desc).limit(1).show();
输出结果
+---+----+
| ID|Name|
+---+----+
|  4| Tom|
+---+----+
union(otherDataFrame)
 val spark = SparkSession.builder().config(conf).getOrCreate();
 import spark.implicits._
 val df1 = spark.sparkContext.parallelize(List((1,"Jason"),(3,"Tom"))).toDF("ID","Name");
 val df2 = spark.sparkContext.parallelize(List((2,"Mike"),(5,"James"))).toDF("ID","Name");
 df1.union(df2).sort(col("ID").desc).show();
输出结果
+---+-----+
| ID| Name|
+---+-----+
|  5|James|
|  3|  Tom|
|  2| Mike|
|  1|Jason|
+---+-----+
withColumn(colName, column)
val spark = SparkSession.builder().config(conf).getOrCreate();
import spark.implicits._
val df1 = spark.sparkContext.parallelize(List((1,"Jason"),(3,"Tom"))).toDF("ID","Name");
val df2 = spark.sparkContext.parallelize(List((2,"Mike"),(5,"James"))).toDF("ID","Name");
df1.union(df2).sort(col("ID").desc).withColumn("New Name",col("ID")+col("ID")) .show()
输出结果
+---+-----+--------+
| ID| Name|New Name|
+---+-----+--------+
|  5|James|      10|
|  3|  Tom|       6|
|  2| Mike|       4|
|  1|Jason|       2|
+---+-----+--------+
withColumnRenamed(existingColName, newColName)
val spark = SparkSession.builder().config(conf).getOrCreate();
import spark.implicits._
val df1 = spark.sparkContext.parallelize(List(("one",1),("Two",2),("Three",3))).toDF("Num","Number");
df1.withColumnRenamed("Num","Number1").withColumnRenamed("Number","Num1")
      .show();
输出结果
+-------+----+
|Number1|Num1|
+-------+----+
|    one|   1|
|    Two|   2|
|  Three|   3|
+-------+----+
drop(columnName1, columnName2)
val spark = SparkSession.builder().config(conf).getOrCreate();
import spark.implicits._;
val df = spark.sparkContext.parallelize(List(("Jason","DBA"),("Jason","BigData"),("Jason","Dev"))).toDF("Name","Job");
df.drop("Name").sort(col("Job").desc).show();
输出结果
+-------+
|    Job|
+-------+
|    Dev|
|    DBA|
|BigData|
+-------+
sample(fraction), sample(fraction, seed), sample(fraction, seed,withReplacement)
val spark = SparkSession.builder().config(conf).getOrCreate();
import spark.implicits._;
val df = spark.sparkContext.parallelize(List(("Jason","DBA"),("Jason","BigData"),("Jason","Dev"))).toDF("Name","Job");
df.sample(0.2).show();
输出结果
+-----+---+
| Name|Job|
+-----+---+
|Jason|Dev|
+-----+---+
randomSplit(weights)
val spark = SparkSession.builder().config(conf).getOrCreate();
import spark.implicits._;
val df = spark.sparkContext.parallelize(List(1,2,3,4,5,6,7,8,9,10)).toDF("Num");
val splitDf = df.randomSplit(Array(0.5,0.3,0.2));
println("Total count:"+df.count);
println("split 0 count:"+splitDf(0).count);
println("split 1 count:"+splitDf(1).count);
println("split 2 count:"+splitDf(2).count);
输出结果
Total count:10
split 0 count:6
split 1 count:2
split 2 count:2
describe(columnNames)
val spark = SparkSession.builder().config(conf).getOrCreate();
import spark.implicits._;
val df = spark.sparkContext.parallelize(List(1,2,3,4,5,6,7,8,9,10)).toDF("Num");
df.describe("Num").show();
输出结果
+-------+------------------+
|summary|               Num|
+-------+------------------+
|  count|                10|
|   mean|               5.5|
| stddev|3.0276503540974917|
|    min|                 1|
|    max|                10|
+-------+------------------+
Action API 列表
Operationdesc
show()显示DataFrame的内容,默认20条数据
show(numRows)显示N条DataFrame的内容
show(truncate)如果String 类型的column 的长度超过20个字符,将截断String
show(numRows, truncate)显示N条DataFrame的内容,如果String 类型的column 的长度超过20个字符,将截断String
head(),first(),head(n),take(n)返回前N条记录
takeAsList(n)返回一个Java List 类型数据,注意返回大量数据会out-of memory
collect,collectAsList已数组或者Java list 形式返回所有的数据,注意数据量的大小
count返回DataFrame 的记录数
show(truncate)
val spark = SparkSession.builder().config(conf).getOrCreate();
import spark.implicits._;
val df = spark.sparkContext.parallelize(List(("abcdefghijklmn123123132"))).toDF("Num");
df.show(true);
输出结果
+--------------------+
|                 Num|
+--------------------+
|abcdefghijklmn123...|
+--------------------+
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值