关于spark -sql 时间戳类型比较的一个小坑

最新推荐文章于 2024-03-14 17:26:03 发布

彼岸枫雪非

最新推荐文章于 2024-03-14 17:26:03 发布

阅读量7.9k

点赞数 5

分类专栏： Spark

本文链接：https://blog.csdn.net/u012543819/article/details/82348237

版权

Spark 专栏收录该内容

23 篇文章 1 订阅

订阅专栏

最近项目组的大哥遇到一个奇怪的问题。spark表插入的时间戳数据带毫秒，但是最后的毫秒数据都是0，即如下格式：

2018-08-31 16:46:30.0

这种情况下，在做如下条件查找时，就无法匹配下相等的数据，即如果表中有2018-08-31 16:46:30.0这样一条数据，这个数据是查询不出来的。

select * from t1  where time <= '2018-08-31 16:46:30.0'

我们查看物理计划：explain select * from t1 where time <= '2018-08-31 16:46:30.0'

+------------------------------------------------------------------------------------------------------------------------
|plan |
+-------------------------------------------------------------------------------------------------------------------------
*(1) Filter (isnotnull(time#3) && (cast(time#3 as string) >= 2018-08-31 16:46:30))
+- HiveTableScan [time#3], HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [time#3]|
+--------------------------------------------------------------------------------------------------------------------------

发现是调用了cast 函数将时间戳数据转换成了string，然后进行比较。因此我们去cast函数的实现里定位问题。

这里有一个方法：castToString，有将时间戳转换为字符串的模式匹配。

private[this] def castToString(from: DataType): Any => Any = from match {
  case BinaryType => buildCast[Array[Byte]](_, UTF8String.fromBytes)
  case DateType => buildCast[Int](_, d => UTF8String.fromString(DateTimeUtils.dateToString(d)))
  case TimestampType => buildCast[Long](_,
    t => UTF8String.fromString(DateTimeUtils.timestampToString(t, timeZone)))

定位到具体的转换函数：

// Converts Timestamp to string according to Hive TimestampWritable convention.
def timestampToString(us: SQLTimestamp, timeZone: TimeZone): String = {
  val ts = toJavaTimestamp(us)
  val timestampString = ts.toString
  val timestampFormat = getThreadLocalTimestampFormat(timeZone)
  val formatted = timestampFormat.format(ts)

  if (timestampString.length > 19 && timestampString.substring(19) != ".0") {
    formatted + timestampString.substring(19)
  } else {
// 这个地方丢掉了最后是0的毫秒精度。因此返回的字符串不带“.0”这个后缀
    formatted
  }
}

上面的问题都很明朗了。与字符串作比较，会把表中的数据cast成字符串，在cast的过程中丢失了后缀，字符串就不一样了，因此无法匹配相等的情况，所以，遇到这种类型的比较，还是把过滤的数据转成表内数据的一致的类型比较好，避免不必要的错误。

比如这样：

select * from t1 where time >= timestamp('2018-08-31 16:46:30.0')

再看一下物理计划

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|plan |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|== Physical Plan ==
*(1) Filter (isnotnull(time#3) && (time#3 >= 1535705190000000))
+- HiveTableScan [time#3], HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [time#3]|

这下就是比较时间了。肯定就不会出错。

彼岸枫雪非

关注

5
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
关于spark -sql 时间戳类型比较的一个小坑

最近项目组的大哥遇到一个奇怪的问题。spark表插入的时间戳数据带毫秒，但是最后的毫秒数据都是0，即如下格式：2018-08-31 16:46:30.0这种情况下，在做如下条件查找时，就无法匹配下相等的数据，即如果表中有2018-08-31 16:46:30.0这样一条数据，这个数据是查询不出来的。select * from t1 where time &lt;= '2018-0...
复制链接

扫一扫

专栏目录