spark sql dataset用匿名函数 的问题

dataset用匿名函数的问题
spark.sql("select * from user").as[User].filter($"id" > 5). select("id").show()
spark.sql("select * from user").as[User].filter(_.id>5) .select("id").show()

上面的两句话一样吗?
首先,结果肯定是一样的,那个效率高呢?

package org.apache.spark.scheduler.local

case class User(id: Int, age: Int, name: String)

import org.apache.spark.sql.SparkSession

object TestJdbcspark {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().master("local[*]").getOrCreate()
    import spark.implicits._
    val load = spark.read
      .format("jdbc")
      .option("driver", "com.mysql.jdbc.Driver")
      .option("url", "jdbc:mysql://localhost:3306/test?serverTimezone=UTC")
      .option("dbtable", "user")
      .option("user", "root")
      .option("password", "root")
      .load().createOrReplaceTempView("user")

    spark.sql("select * from user").as[User].
      /*mysql> SELECT * from mysql.general_log ORDER BY    event_time DESC limit 4;
    +----------------------------+------------------------------------+-----------+-----------+--------------+---------------------------------------------------------------------+
  | event_time                 | user_host                          | thread_id | server_id | command_type | argument                                                            |
  +----------------------------+------------------------------------+-----------+-----------+--------------+---------------------------------------------------------------------+
  | 2019-04-28 20:06:22.390209 | root[root] @ localhost [::1]       |        18 |         1 | Query        | SELECT * from mysql.general_log ORDER BY    event_time DESC limit 4 |
  | 2019-04-28 20:04:42.167509 | root[root] @ localhost [::1]       |        18 |         1 | Query        | SELECT * from mysql.general_log ORDER BY    event_time DESC limit 4 |
  | 2019-04-28 20:04:38.370703 | root[root] @ localhost [127.0.0.1] |        40 |         1 | Quit         |                                                                     |
  | 2019-04-28 20:04:38.363730 | root[root] @ localhost [127.0.0.1] |        40 |         1 | Query        | SELECT `id` FROM user WHERE (`id` IS NOT NULL) AND (`id` > 5)       |
  +----------------------------+------------------------------------+-----------+-----------+--------------+---------------------------------------------------------------------+*/
      filter($"id" > 5)
      /*mysql> SELECT * from mysql.general_log ORDER BY    event_time DESC limit 4;
  +----------------------------+------------------------------------+-----------+-----------+--------------+---------------------------------------------------------------------+
  | event_time                 | user_host                          | thread_id | server_id | command_type | argument                                                            |
  +----------------------------+------------------------------------+-----------+-----------+--------------+---------------------------------------------------------------------+
  | 2019-04-28 20:03:53.074452 | root[root] @ localhost [::1]       |        18 |         1 | Query        | SELECT * from mysql.general_log ORDER BY    event_time DESC limit 4 |
  | 2019-04-28 20:03:38.503333 | root[root] @ localhost [127.0.0.1] |        38 |         1 | Quit         |                                                                     |
  | 2019-04-28 20:03:38.494745 | root[root] @ localhost [127.0.0.1] |        38 |         1 | Query        | SELECT `id`,`age`,`name` FROM user                                  |
  | 2019-04-28 20:03:38.494018 | root[root] @ localhost [127.0.0.1] |        38 |         1 | Query        | SET autocommit=1                                                    |
  +----------------------------+------------------------------------+-----------+-----------+--------------+---------------------------------------------------------------------+*/
      // filter(_.id>5)
      .select("id").show()

  }
}
  • 通过查询mysql执行的sql,我们发现,这两个有很大的区别.
    1. SELECT id FROM user WHERE (id IS NOT NULL) AND (id > 5)
    2. SELECT id,age,name FROM user

    第二种写法查询了过多的数据.

  • 使用匿名函数会阻止spark 去做下推的优化.因为你传递一个匿名函数给spark.spark无法知道你要使用的是dataframe中的那一列.所以无法做下推优化.
  • 所以在写代码的时候,尽量使用filter($"id" > 5)这种方式
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值