一、以spark sql语句
1.Dataframe按某一行排序新加一列index (index不重复)
import spark.implicits._
var data2 = Seq(
("0", "2002", "196", "1", "bai"),
("1", "4004", "192", "2", "wang"),
("0", "7007", "95", "2", "wang"),
("0", "4004", "4", "3", "wa"),
("1", "7007", "15", "3", "wa"),
("1", "7007", "15", "3", "wa"),
("1", "0", "14", "4", "zhu"),
("0", "9009", "96", "1", "bai"),
("1", "9009", "126", "1", "bai"),
("1", "9009", "126", "5", "wei"),
("0","10010", "19219", "5", "wei")
).toDF("label", "AMOUNT", "Pclass", "nums", "name")
import org.apache.spark.sql.functions._
data2 = data2.withColumn("AMOUNT", col("AMOUNT").cast("int"))
data2.show()
+-----+------+------+----+------+
|label|AMOUNT|Pclass|nums|name|
+-----+------+------+----+------+
| 0| 2002| 196| 1| bai|
| 1| 4004| 192| 2| wang|
| 0| 7007| 95| 2| wang|
| 0| 4004| 4| 3| wa|
| 1| 7007| 15| 3| wa|
| 1| 7007| 15| 3| wa|
| 1| 0| 14| 4| zhu|
| 0| 9009| 96| 1| bai|
| 1| 9009| 126| 1| bai|
| 1| 9009| 126| 5| wei|
| 0| 10010| 19219| 5| wei|
+-----+------+------+----+------+
以某个字段分组 以name为例
-
row_number() over(partition by name order by AMOUNT asc) as index
其中name和AMOUNT 是表中表名
asc 升序 desc 降序
data2.createOrReplaceTempView(s"data")
spark.sql("select *,row_number() over(partition by name order by AMOUNT asc) as index from data").show()
+-----+------+------+----+----+-----+
|label|AMOUNT|Pclass|nums|name|index|
+-----+------+------+----+----+-----+
| 1| 9009| 126| 5| wei| 1|
| 0| 10010| 19219| 5| wei| 2|
| 1| 4004| 192| 2|wang| 1|
| 0| 7007| 95| 2|wang| 2|
| 0| 2002| 196| 1| bai| 1|
| 0| 9009| 96| 1| bai| 2|
| 1| 9009| 126| 1| bai| 3|
| 1| 0| 14| 4| zhu| 1|
| 0| 4004| 4| 3| wa| 1|
| 1| 7007| 15| 3| wa| 2|
| 1| 7007| 15| 3| wa| 3|
+-----+------+------+----+----+-----+
如果想全表排序 去掉 partition by name
- row_number() over(order by AMOUNT asc
spark.sql("select *,row_number() over(order by AMOUNT asc) as index from data").show()
+-----+------+------+----+----+-----+
|label|AMOUNT|Pclass|nums|name|index|
+-----+------+------+----+----+-----+
| 1| 0| 14| 4| zhu| 1|
| 0| 2002| 196| 1| bai| 2|
| 1| 4004| 192| 2|wang| 3|
| 0| 4004| 4| 3| wa| 4|
| 0| 7007| 95| 2|wang| 5|
| 1| 7007| 15| 3| wa| 6|
| 1| 7007| 15| 3| wa| 7|
| 0| 9009| 96| 1| bai| 8|
| 1| 9009| 126| 1| bai| 9|
| 1| 9009| 126| 5| wei| 10|
| 0| 10010| 19219| 5| wei| 11|
+-----+------+------+----+----+-----+
参考:https://blog.csdn.net/qq_35246620/article/details/56290903
二、直接对DataFrame进行操作
以某个字段分组 以name为例
/**
* 设置窗口函数的分区以及排序,因为是全局排序而不是分组排序,所有分区依据为空
* 排序规则没有特殊要求也可以随意填写
*/
import org.apache.spark.sql.expressions.Window
var dq = data2.withColumn("index", row_number().over(Window.partitionBy("name").orderBy("AMOUNT")))
dq.show()
结果:
import org.apache.spark.sql.expressions.Window
dq: org.apache.spark.sql.DataFrame = [label: string, AMOUNT: int ... 4 more fields]
+-----+------+------+----+----+-----+
|label|AMOUNT|Pclass|nums|name|index|
+-----+------+------+----+----+-----+
| 1| 9009| 126| 5| wei| 1|
| 0| 10010| 19219| 5| wei| 2|
| 1| 4004| 192| 2|wang| 1|
| 0| 7007| 95| 2|wang| 2|
| 0| 2002| 196| 1| bai| 1|
| 0| 9009| 96| 1| bai| 2|
| 1| 9009| 126| 1| bai| 3|
| 1| 0| 14| 4| zhu| 1|
| 0| 4004| 4| 3| wa| 1|
| 1| 7007| 15| 3| wa| 2|
| 1| 7007| 15| 3| wa| 3|
+-----+------+------+----+----+-----+
如果想全表排序 去掉 partitionBy(“name”)
/**
* 设置窗口函数的分区以及排序,因为是全局排序而不是分组排序,所有分区依据为空
* 排序规则没有特殊要求也可以随意填写
*/
import org.apache.spark.sql.expressions.Window
var dq = data2.withColumn("index", row_number().over(Window.orderBy("AMOUNT")))
dq.show()
import org.apache.spark.sql.expressions.Window
dq: org.apache.spark.sql.DataFrame = [label: string, AMOUNT: int ... 4 more fields]
+-----+------+------+----+----+-----+
|label|AMOUNT|Pclass|nums|name|index|
+-----+------+------+----+----+-----+
| 1| 0| 14| 4| zhu| 1|
| 0| 2002| 196| 1| bai| 2|
| 1| 4004| 192| 2|wang| 3|
| 0| 4004| 4| 3| wa| 4|
| 0| 7007| 95| 2|wang| 5|
| 1| 7007| 15| 3| wa| 6|
| 1| 7007| 15| 3| wa| 7|
| 0| 9009| 96| 1| bai| 8|
| 1| 9009| 126| 1| bai| 9|
| 1| 9009| 126| 5| wei| 10|
| 0| 10010| 19219| 5| wei| 11|
+-----+------+------+----+----+-----+
该方法的缺点就是在添加自增index的过程中,会将所有的数据集中到一个分区,会导致性能下降。
参考:
https://www.jianshu.com/p/3e998a12ec3c
https://segmentfault.com/q/1010000018238956?utm_source=tag-newest