dataset用匿名函数的问题
spark.sql("select * from user").as[User].filter($"id" > 5). select("id").show()
spark.sql("select * from user").as[User].filter(_.id>5) .select("id").show()
上面的两句话一样吗?
首先,结果肯定是一样的,那个效率高呢?
package org.apache.spark.scheduler.local
case class User(id: Int, age: Int, name: String)
import org.apache.spark.sql.SparkSession
object TestJdbcspark {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
val load = spark.read
.format("jdbc")
.option("driver", "com.mysql.jdbc.Driver")
.option("url", "jdbc:mysql://localhost:3306/test?serverTimezone=UTC")
.option("dbtable", "user")
.option("user", "root")
.option("password", "root")
.load().createOrReplaceTempView("user")
spark.sql("select * from user").as[User].
/*mysql> SELECT * from mysql.general_log ORDER BY event_time DESC limit 4;
+----------------------------+------------------------------------+-----------+-----------+--------------+---------------------------------------------------------------------+
| event_time | user_host | thread_id | server_id | command_type | argument |
+----------------------------+------------------------------------+-----------+-----------+--------------+---------------------------------------------------------------------+
| 2019-04-28 20:06:22.390209 | root[root] @ localhost [::1] | 18 | 1 | Query | SELECT * from mysql.general_log ORDER BY event_time DESC limit 4 |
| 2019-04-28 20:04:42.167509 | root[root] @ localhost [::1] | 18 | 1 | Query | SELECT * from mysql.general_log ORDER BY event_time DESC limit 4 |
| 2019-04-28 20:04:38.370703 | root[root] @ localhost [127.0.0.1] | 40 | 1 | Quit | |
| 2019-04-28 20:04:38.363730 | root[root] @ localhost [127.0.0.1] | 40 | 1 | Query | SELECT `id` FROM user WHERE (`id` IS NOT NULL) AND (`id` > 5) |
+----------------------------+------------------------------------+-----------+-----------+--------------+---------------------------------------------------------------------+*/
filter($"id" > 5)
/*mysql> SELECT * from mysql.general_log ORDER BY event_time DESC limit 4;
+----------------------------+------------------------------------+-----------+-----------+--------------+---------------------------------------------------------------------+
| event_time | user_host | thread_id | server_id | command_type | argument |
+----------------------------+------------------------------------+-----------+-----------+--------------+---------------------------------------------------------------------+
| 2019-04-28 20:03:53.074452 | root[root] @ localhost [::1] | 18 | 1 | Query | SELECT * from mysql.general_log ORDER BY event_time DESC limit 4 |
| 2019-04-28 20:03:38.503333 | root[root] @ localhost [127.0.0.1] | 38 | 1 | Quit | |
| 2019-04-28 20:03:38.494745 | root[root] @ localhost [127.0.0.1] | 38 | 1 | Query | SELECT `id`,`age`,`name` FROM user |
| 2019-04-28 20:03:38.494018 | root[root] @ localhost [127.0.0.1] | 38 | 1 | Query | SET autocommit=1 |
+----------------------------+------------------------------------+-----------+-----------+--------------+---------------------------------------------------------------------+*/
// filter(_.id>5)
.select("id").show()
}
}
- 通过查询mysql执行的sql,我们发现,这两个有很大的区别.
- SELECT
id
FROM user WHERE (id
IS NOT NULL) AND (id
> 5) - SELECT
id
,age
,name
FROM user
第二种写法查询了过多的数据.
- SELECT
- 使用匿名函数会阻止spark 去做下推的优化.因为你传递一个匿名函数给spark.spark无法知道你要使用的是dataframe中的那一列.所以无法做下推优化.
- 所以在写代码的时候,尽量使用
filter($"id" > 5)
这种方式