hive sql和spark sql查询同一条SQL语句,结果不一样(甚至条数差别很大)的原因

hive sql和spark sql查询同一条SQL语句,结果不一样(甚至条数差别很大)的原因

最近,好几位小伙伴遇到了,将同一句sql,分别在hive sql 查出来 160w ,但是在spark sql (spark-shell或者spark.sql()) 查出来只有150w,笔者之前也遇到类似的问题,遂总结形成此文,备忘!

0、结论:可以看下SQL 中 join on 两边的字段类型是否一致

我之前遇到这样的问题,发现on 条件 等号两边的字段类型不一致造成的

  • 当 on 条件两边字段类型,一边是string,一边是bigint; 在hive中   会 把 string,bigint 都转成 double,结果错误
  • 当 on 条件两边字段类型,一边是string,一边是decimal;spark中: 会 把 string,decimal 都转成了 double,结果错误

1、数据准备

user.csv

```

1,a1,1,1,1
2,a2,1111111111111111112,1111111111111111112,1111111111111111112
3,a3,1111111111111111113,1111111111111111113,1111111111111111113
4,a4,1111111111111111114,1111111111111111114,1111111111111111114
5,a5,1111111111111111115,1111111111111111115,1111111111111111115
```

create database new;

```
spark.read.format("csv").schema("id int, name string, user_id_bigint bigint, user_id_decimal decimal(20,0), user_id_string string").load("file:///home/work/workdir/test_spark_hive_sql_diff/user.csv").toDF.write.mode("overwrite").saveAsTable("new.test_type_user")

spark.sql("select * from new.test_type_user").show
+---+----+-------------------+-------------------+-------------------+
| id|name|     user_id_bigint|    user_id_decimal|     user_id_string|
+---+----+-------------------+-------------------+-------------------+
|  1|  a1|                  1|                  1|                  1|
|  2|  a2|1111111111111111112|1111111111111111112|1111111111111111112|
|  3|  a3|1111111111111111113|1111111111111111113|1111111111111111113|
|  4|  a4|1111111111111111114|1111111111111111114|1111111111111111114|
|  5|  a5|1111111111111111115|1111111111111111115|1111111111111111115|
+---+----+-------------------+-------------------+-------------------+

spark.sql("desc new.test_type_user").show
+---------------+-------------+-------+
|       col_name|    data_type|comment|
+---------------+-------------+-------+
|             id|          int|   null|
|           name|       string|   null|
| user_id_bigint|       bigint|   null|
|user_id_decimal|decimal(20,0)|   null|
| user_id_string|       string|   null|
+---------------+-------------+-------+

 ```




user-info.csv
```
1,1,1,xiaoming
1111111111111111112,1111111111111111112,1111111111111111112,k2data
1111111111111111113,1111111111111111113,1111111111111111113,xiaoming
1111111111111111114,1111111111111111114,1111111111111111114,k2data
```

```
 spark.read.format("csv").schema("user_info_id_bigint bigint, user_info_id_decimal decimal(20,0), user_info_id_string string, group string").load("file:///home/work/workdir/test_spark_hive_sql_diff/user-info.csv").toDF.write.mode("overwrite").saveAsTable("new.test_type_user_info")

spark.sql("select * from new.test_type_user_info").show
+-------------------+--------------------+-------------------+------+
|user_info_id_bigint|user_info_id_decimal|user_info_id_string| group|
+-------------------+--------------------+-------------------+------+
|                  1|                   1|                  1|xiaoming|
|1111111111111111112| 1111111111111111112|1111111111111111112|k2data|
|1111111111111111113| 1111111111111111113|1111111111111111113|xiaoming|
|1111111111111111114| 1111111111111111114|1111111111111111114|k2data|
+-------------------+--------------------+-------------------+------+


spark.sql("desc new.test_type_user_info").show
+--------------------+-------------+-------+
|            col_name|    data_type|comment|
+--------------------+-------------+-------+
| user_info_id_bigint|       bigint|   null|
|user_info_id_decimal|decimal(20,0)|   null|
| user_info_id_string|       string|   null|
|               group|       string|   null|
+--------------------+-------------+-------+


 ```

2、当 on 条件两边字段类型都一致的情况下,毫无疑问,在hive和spark中都是正确的

bigint = bigint 正确结果
```
select id,name, user_id_bigint,user_info_id_bigint from new.test_type_user u left join new.test_type_user_info ui on u.user_id_bigint = ui.user_info_id_bigint


# spark
spark.sql("select id,name, user_id_bigint,user_info_id_bigint from new.test_type_user u left join new.test_type_user_info ui on u.user_id_bigint = ui.user_info_id_bigint").show
+---+----+-------------------+-------------------+
| id|name|     user_id_bigint|user_info_id_bigint|
+---+----+-------------------+-------------------+
|  1|  a1|                  1|                  1|
|  2|  a2|1111111111111111112|1111111111111111112|
|  3|  a3|1111111111111111113|1111111111111111113|
|  4|  a4|1111111111111111114|1111111111111111114|
|  5|  a5|1111111111111111115|               null|
+---+----+-------------------+-------------------+

spark.sql("select id,name, user_id_bigint,user_info_id_bigint from new.test_type_user u left join new.test_type_user_info ui on u.user_id_bigint = ui.user_info_id_bigint").explain
== Physical Plan ==
*BroadcastHashJoin [user_id_bigint#27L], [user_info_id_bigint#89L], LeftOuter, BuildRight
:- *FileScan parquet new.test_type_user[id#25,name#26,user_id_bigint#27L] Batched: false, Format: Parquet, Location: InMemoryFileIndex[hdfs://xxx/warehouse/new.db/test_type_user], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int,name:string,user_id_bigint:bigint>
+- *BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true]))
   +- *FileScan parquet new.test_type_user_info[user_info_id_bigint#89L] Batched: false, Format: Parquet, Location: InMemoryFileIndex[hdfs://xxx/warehouse/new.db/test_type_user_info], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<user_info_id_bigint:bigint>




# hive
select id,name, user_id_bigint,user_info_id_bigint from new.test_type_user u left join new.test_type_user_info ui on u.user_id_bigint = ui.user_info_id_bigint;

+-----+-------+----------------------+----------------------+--+
| id  | name  |    user_id_bigint    | user_info_id_bigint  |
+-----+-------+----------------------+----------------------+--+
| 1   | a1    | 1                    | 1                    |
| 2   | a2    | 1111111111111111112  | 1111111111111111112  |
| 3   | a3    | 1111111111111111113  | 1111111111111111113  |
| 4   | a4    | 1111111111111111114  | 1111111111111111114  |
| 5   | a5    | 1111111111111111115  | NULL                 |
+-----+-------+----------------------+----------------------+--+

explain select id,name, user_id_bigint,user_info_id_bigint from new.test_type_user u left join new.test_type_user_info ui on u.user_id_bigint = ui.user_info_id_bigint;

+-------------------------------------------------------------------------------------------------------------------+--+
|                                                      Explain                                                      |
+-------------------------------------------------------------------------------------------------------------------+--+
| STAGE DEPENDENCIES:                                                                                               |
|   Stage-4 is a root stage                                                                                         |
|   Stage-3 depends on stages: Stage-4                                                                              |
|   Stage-0 depends on stages: Stage-3                                                                              |
|                                                                                                                   |
| STAGE PLANS:                                                                                                      |
|   Stage: Stage-4                                                                                                  |
|     Map Reduce Local Work                                                                                         |
|       Alias -> Map Local Tables:                                                                                  |
|         ui                                                                                                        |
|           Fetch Operator                                                                                          |
|             limit: -1                                                                                             |
|       Alias -> Map Local Operator Tree:                                                                           |
|         ui                                                                                                        |
|           TableScan                                                                                               |
|             alias: ui                                                                                             |
|             Statistics: Num rows: 156 Data size: 1253 Basic stats: COMPLETE Column stats: NONE                    |
|             HashTable Sink Operator                                                                               |
|               keys:                                                                                               |
|                 0 user_id_bigint (type: bigint)                                                                   |
|                 1 user_info_id_bigint (type: bigint)                                                              |
|                                                                                                                   |
|   Stage: Stage-3                                                                                                  |
|     Map Reduce                                                                                                    |
|       Map Operator Tree:                                                                                          |
|           TableScan                                                                                               |
|             alias: u                                                                                              |
|             Statistics: Num rows: 12 Data size: 1365 Basic stats: COMPLETE Column stats: NONE                     |
|             Map Join Operator                                                                                     |
|               condition map:                                                                                      |
|                    Left Outer Join0 to 1                                                                          |
|               keys:                                                                                               |
|                 0 user_id_bigint (type: bigint)                                                                   |
|                 1 user_info_id_bigint (type: bigint)                                                              |
|               outputColumnNames: _col0, _col1, _col2, _col8                                                       |
|               Statistics: Num rows: 171 Data size: 1378 Basic stats: COMPLETE Column stats: NONE                  |
|               Select Operator                                                                                     |
|                 expressions: _col0 (type: int), _col1 (type: string), _col2 (type: bigint), _col8 (type: bigint)  |
|                 outputColumnNames: _col0, _col1, _col2, _col3                                                     |
|                 Statistics: Num rows: 171 Data size: 1378 Basic stats: COMPLETE Column stats: NONE                |
|                 File Output Operator                                                                              |
|                   compressed: false                                                                               |
|                   Statistics: Num rows: 171 Data size: 1378 Basic stats: COMPLETE Column stats: NONE              |
|                   table:                                                                                          |
|                       input format: org.apache.hadoop.mapred.TextInputFormat                                      |
|                       output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat                   |
|                       serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe                                   |
|       Local Work:                                                                                                 |
|         Map Reduce Local Work                                                                                     |
|                                                                                                                   |
|   Stage: Stage-0                                                                                                  |
|     Fetch Operator                                                                                                |
|       limit: -1                                                                                                   |
|       Processor Tree:                                                                                             |
|         ListSink                                                                                                  |
|                                                                                                                   |
+-------------------------------------------------------------------------------------------------------------------+--+


2、类似可知:

同样 decimal = decimal 正确结果

同样 string = string 正确结果

select id,name, user_id_decimal,user_info_id_decimal from new.test_type_user u left join new.test_type_user_info ui on u.user_id_decimal = ui.user_info_id_decimal
select id,name, user_id_string,user_info_id_string from new.test_type_user u left join new.test_type_user_info ui on u.user_id_string = ui.user_info_id_string

3、当 on 条件两边字段类型不一致的情况下,需要看类型转换方向决定,判断结果是否正确

3.1、当 on 条件两边字段类型,一边是decimal,一边是bigint

无论是
bigint=decimal
decimal=bigin

都转成 decimal 正确

1)、bigint=decimal  正确

spark:

select id,name, user_id_bigint,user_info_id_decimal from new.test_type_user u left join new.test_type_user_info ui on u.user_id_bigint = ui.user_info_id_decimal;

spark.sql("select id,name, user_id_bigint,user_info_id_decimal from new.test_type_user u left join new.test_type_user_info ui on u.user_id_bigint = ui.user_info_id_decimal").show()
+---+----+-------------------+--------------------+
| id|name|     user_id_bigint|user_info_id_decimal|
+---+----+-------------------+--------------------+
|  1|  a1|                  1|                   1|
|  2|  a2|1111111111111111112| 1111111111111111112|
|  3|  a3|1111111111111111113| 1111111111111111113|
|  4|  a4|1111111111111111114| 1111111111111111114|
|  5|  a5|1111111111111111115|                null|
+---+----+-------------------+--------------------+


spark.sql("select id,name, user_id_bigint,user_info_id_decimal from new.test_type_user u left join new.test_type_user_info ui on u.user_id_bigint = ui.user_info_id_decimal").explain

== Physical Plan ==
*BroadcastHashJoin [cast(user_id_bigint#27L as decimal(20,0))], [user_info_id_decimal#90], LeftOuter, BuildRight
:- *FileScan parquet new.test_type_user[id#25,name#26,user_id_bigint#27L] Batched: false, Format: Parquet, Location: InMemoryFileIndex[hdfs://xxx/warehouse/new.db/test_type_user], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int,name:string,user_id_bigint:bigint>
+- *BroadcastExchange HashedRelationBroadcastMode(List(input[0, decimal(20,0), true]))
   +- *FileScan parquet new.test_type_user_info[user_info_id_decimal#90] Batched: false, Format: Parquet, Location: InMemoryFileIndex[hdfs://xxx/warehouse/new.db/test_type_user_info], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<user_info_id_decimal:decimal(20,0)>


hive:

select id,name, user_id_bigint,user_info_id_decimal from new.test_type_user u left join new.test_type_user_info ui on u.user_id_bigint = ui.user_info_id_decimal;

+-----+-------+----------------------+-----------------------+--+
| id  | name  |    user_id_bigint    | user_info_id_decimal  |
+-----+-------+----------------------+-----------------------+--+
| 1   | a1    | 1                    | 1                     |
| 2   | a2    | 1111111111111111112  | 1111111111111111112   |
| 3   | a3    | 1111111111111111113  | 1111111111111111113   |
| 4   | a4    | 1111111111111111114  | 1111111111111111114   |
| 5   | a5    | 1111111111111111115  | NULL                  |
+-----+-------+----------------------+-----------------------+--+

explain select id,name, user_id_bigint,user_info_id_decimal from new.test_type_user u left join new.test_type_user_info ui on u.user_id_bigint = ui.user_info_id_decimal;

+--------------------------------------------------------------------------------------------------------------------------+--+
|                                                         Explain                                                          |
+--------------------------------------------------------------------------------------------------------------------------+--+
| STAGE DEPENDENCIES:                                                                                                      |
|   Stage-4 is a root stage                                                                                                |
|   Stage-3 depends on stages: Stage-4                                                                                     |
|   Stage-0 depends on stages: Stage-3                                                                                     |
|                                                                                                                          |
| STAGE PLANS:                                                                                                             |
|   Stage: Stage-4                                                                                                         |
|     Map Reduce Local Work                                                                                                |
|       Alias -> Map Local Tables:                                                                                         |
|         ui                                                                                                               |
|           Fetch Operator                                                                                                 |
|             limit: -1                                                                                                    |
|       Alias -> Map Local Operator Tree:                                                                                  |
|         ui                                                                                                               |
|           TableScan                                                                                                      |
|             alias: ui                                                                                                    |
|             Statistics: Num rows: 11 Data size: 1253 Basic stats: COMPLETE Column stats: NONE                            |
|             HashTable Sink Operator                                                                                      |
|               keys:                                                                                                      |
|                 0 CAST( user_id_bigint AS decimal(20,0)) (type: decimal(20,0))                                           |
|                 1 user_info_id_decimal (type: decimal(20,0))                                                             |
|                                                                                                                          |
|   Stage: Stage-3                                                                                                         |
|     Map Reduce                                                                                                           |
|       Map Operator Tree:                                                                                                 |
|           TableScan                                                                                                      |
|             alias: u                                                                                                     |
|             Statistics: Num rows: 12 Data size: 1365 Basic stats: COMPLETE Column stats: NONE                            |
|             Map Join Operator                                                                                            |
|               condition map:                                                                                             |
|                    Left Outer Join0 to 1                                                                                 |
|               keys:                                                                                                      |
|                 0 CAST( user_id_bigint AS decimal(20,0)) (type: decimal(20,0))                                           |
|                 1 user_info_id_decimal (type: decimal(20,0))                                                             |
|               outputColumnNames: _col0, _col1, _col2, _col9                                                              |
|               Statistics: Num rows: 13 Data size: 1501 Basic stats: COMPLETE Column stats: NONE                          |
|               Select Operator                                                                                            |
|                 expressions: _col0 (type: int), _col1 (type: string), _col2 (type: bigint), _col9 (type: decimal(20,0))  |
|                 outputColumnNames: _col0, _col1, _col2, _col3                                                            |
|                 Statistics: Num rows: 13 Data size: 1501 Basic stats: COMPLETE Column stats: NONE                        |
|                 File Output Operator                                                                                     |
|                   compressed: false                                                                                      |
|                   Statistics: Num rows: 13 Data size: 1501 Basic stats: COMPLETE Column stats: NONE                      |
|                   table:                                                                                                 |
|                       input format: org.apache.hadoop.mapred.TextInputFormat                                             |
|                       output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat                          |
|                       serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe                                          |
|       Local Work:                                                                                                        |
|         Map Reduce Local Work                                                                                            |
|                                                                                                                          |
|   Stage: Stage-0                                                                                                         |
|     Fetch Operator                                                                                                       |
|       limit: -1                                                                                                          |
|       Processor Tree:                                                                                                    |
|         ListSink                                                                                                         |
|                                                                                                                          |
+--------------------------------------------------------------------------------------------------------------------------+--+


2)、decimal=bigint 正确

select id,name, user_id_decimal,user_info_id_bigint from new.test_type_user u left join new.test_type_user_info ui on u.user_id_decimal = ui.user_info_id_bigint;
+-----+-------+----------------------+----------------------+--+
| id  | name  |   user_id_decimal    | user_info_id_bigint  |
+-----+-------+----------------------+----------------------+--+
| 1   | a1    | 1                    | 1                    |
| 2   | a2    | 1111111111111111112  | 1111111111111111112  |
| 3   | a3    | 1111111111111111113  | 1111111111111111113  |
| 4   | a4    | 1111111111111111114  | 1111111111111111114  |
| 5   | a5    | 1111111111111111115  | NULL                 |
+-----+-------+----------------------+----------------------+--+

explain select id,name, user_id_decimal,user_info_id_bigint from new.test_type_user u left join new.test_type_user_info ui on u.user_id_decimal = ui.user_info_id_bigint;
+--------------------------------------------------------------------------------------------------------------------------+--+
|                                                         Explain                                                          |
+--------------------------------------------------------------------------------------------------------------------------+--+
| STAGE DEPENDENCIES:                                                                                                      |
|   Stage-4 is a root stage                                                                                                |
|   Stage-3 depends on stages: Stage-4                                                                                     |
|   Stage-0 depends on stages: Stage-3                                                                                     |
|                                                                                                                          |
| STAGE PLANS:                                                                                                             |
|   Stage: Stage-4                                                                                                         |
|     Map Reduce Local Work                                                                                                |
|       Alias -> Map Local Tables:                                                                                         |
|         ui                                                                                                               |
|           Fetch Operator                                                                                                 |
|             limit: -1                                                                                                    |
|       Alias -> Map Local Operator Tree:                                                                                  |
|         ui                                                                                                               |
|           TableScan                                                                                                      |
|             alias: ui                                                                                                    |
|             Statistics: Num rows: 156 Data size: 1253 Basic stats: COMPLETE Column stats: NONE                           |
|             HashTable Sink Operator                                                                                      |
|               keys:                                                                                                      |
|                 0 user_id_decimal (type: decimal(20,0))                                                                  |
|                 1 CAST( user_info_id_bigint AS decimal(20,0)) (type: decimal(20,0))                                      |
|                                                                                                                          |
|   Stage: Stage-3                                                                                                         |
|     Map Reduce                                                                                                           |
|       Map Operator Tree:                                                                                                 |
|           TableScan                                                                                                      |
|             alias: u                                                                                                     |
|             Statistics: Num rows: 6 Data size: 1365 Basic stats: COMPLETE Column stats: NONE                             |
|             Map Join Operator                                                                                            |
|               condition map:                                                                                             |
|                    Left Outer Join0 to 1                                                                                 |
|               keys:                                                                                                      |
|                 0 user_id_decimal (type: decimal(20,0))                                                                  |
|                 1 CAST( user_info_id_bigint AS decimal(20,0)) (type: decimal(20,0))                                      |
|               outputColumnNames: _col0, _col1, _col3, _col8                                                              |
|               Statistics: Num rows: 171 Data size: 1378 Basic stats: COMPLETE Column stats: NONE                         |
|               Select Operator                                                                                            |
|                 expressions: _col0 (type: int), _col1 (type: string), _col3 (type: decimal(20,0)), _col8 (type: bigint)  |
|                 outputColumnNames: _col0, _col1, _col2, _col3                                                            |
|                 Statistics: Num rows: 171 Data size: 1378 Basic stats: COMPLETE Column stats: NONE                       |
|                 File Output Operator                                                                                     |
|                   compressed: false                                                                                      |
|                   Statistics: Num rows: 171 Data size: 1378 Basic stats: COMPLETE Column stats: NONE                     |
|                   table:                                                                                                 |
|                       input format: org.apache.hadoop.mapred.TextInputFormat                                             |
|                       output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat                          |
|                       serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe                                          |
|       Local Work:                                                                                                        |
|         Map Reduce Local Work                                                                                            |
|                                                                                                                          |
|   Stage: Stage-0                                                                                                         |
|     Fetch Operator                                                                                                       |
|       limit: -1                                                                                                          |
|       Processor Tree:                                                                                                    |
|         ListSink                                                                                                         |
|                                                                                                                          |
+--------------------------------------------------------------------------------------------------------------------------+--+
56 rows selected (0.262 seconds)


spark.sql("select id,name, user_id_decimal,user_info_id_bigint from new.test_type_user u left join new.test_type_user_info ui on u.user_id_decimal = ui.user_info_id_bigint").explain

== Physical Plan ==
*BroadcastHashJoin [user_id_decimal#28], [cast(user_info_id_bigint#89L as decimal(20,0))], LeftOuter, BuildRight
:- *FileScan parquet new.test_type_user[id#25,name#26,user_id_decimal#28] Batched: false, Format: Parquet, Location: InMemoryFileIndex[hdfs://xxx/warehouse/new.db/test_type_user], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int,name:string,user_id_decimal:decimal(20,0)>
+- *BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, bigint, true] as decimal(20,0))))
   +- *FileScan parquet new.test_type_user_info[user_info_id_bigint#89L] Batched: false, Format: Parquet, Location: InMemoryFileIndex[hdfs://xxx/warehouse/new.db/test_type_user_info], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<user_info_id_bigint:bigint>

spark.sql("select id,name, user_id_decimal,user_info_id_bigint from new.test_type_user u left join new.test_type_user_info ui on u.user_id_decimal = ui.user_info_id_bigint").show()
+---+----+-------------------+-------------------+
| id|name|    user_id_decimal|user_info_id_bigint|
+---+----+-------------------+-------------------+
|  1|  a1|                  1|                  1|
|  2|  a2|1111111111111111112|1111111111111111112|
|  3|  a3|1111111111111111113|1111111111111111113|
|  4|  a4|1111111111111111114|1111111111111111114|
|  5|  a5|1111111111111111115|               null|
+---+----+-------------------+-------------------+

3.2、当 on 条件两边字段类型,一边是string,一边是bigint

spark-sql:  把 string 转成 bigint,结果正确
spark-shell:  把 string 转成 bigint,结果正确
hive:   把 string,bigint 都转成 double,结果错误

select id,name, user_id_string,user_info_id_bigint from new.test_type_user u left join new.test_type_user_info ui on u.user_id_string = ui.user_info_id_bigint

spark.sql("select id,name, user_id_string,user_info_id_bigint from new.test_type_user u left join new.test_type_user_info ui on u.user_id_string = ui.user_info_id_bigint").show()

spark.sql("select id,name, user_id_string,user_info_id_bigint from new.test_type_user u left join new.test_type_user_info ui on u.user_id_string = ui.user_info_id_bigint").explain

== Physical Plan ==
*BroadcastHashJoin [cast(user_id_string#29 as bigint)], [user_info_id_bigint#89L], LeftOuter, BuildRight
:- *FileScan parquet new.test_type_user[id#25,name#26,user_id_string#29] Batched: false, Format: Parquet, Location: InMemoryFileIndex[hdfs://xxx/warehouse/new.db/test_type_user], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int,name:string,user_id_string:string>
+- *BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true]))
   +- *FileScan parquet new.test_type_user_info[user_info_id_bigint#89L] Batched: false, Format: Parquet, Location: InMemoryFileIndex[hdfs://xxx/warehouse/new.db/test_type_user_info], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<user_info_id_bigint:bigint>

SELECT LENGTH(1111111111111111113),length(111111111111111111);
+------+------+--+
| _c0  | _c1  |
+------+------+--+
| 19   | 18   |
+------+------+--+

## spark 结果正确
+---+----+-------------------+-------------------+
| id|name|     user_id_string|user_info_id_bigint|
+---+----+-------------------+-------------------+
|  1|  a1|                  1|                  1|
|  2|  a2|1111111111111111112|1111111111111111112|
|  3|  a3|1111111111111111113|1111111111111111113|
|  4|  a4|1111111111111111114|1111111111111111114|
|  5|  a5|1111111111111111115|               null|
+---+----+-------------------+-------------------+

#hive 结果不正确 

select id,name, user_id_string,user_info_id_bigint from new.test_type_user u left join new.test_type_user_info ui on u.user_id_string = ui.user_info_id_bigint;
+-----+-------+----------------------+----------------------+--+
| id  | name  |    user_id_string    | user_info_id_bigint  |
+-----+-------+----------------------+----------------------+--+
| 1   | a1    | 1                    | 1                    |
| 2   | a2    | 1111111111111111112  | 1111111111111111112  |
| 2   | a2    | 1111111111111111112  | 1111111111111111113  |
| 2   | a2    | 1111111111111111112  | 1111111111111111114  |
| 3   | a3    | 1111111111111111113  | 1111111111111111112  |
| 3   | a3    | 1111111111111111113  | 1111111111111111113  |
| 3   | a3    | 1111111111111111113  | 1111111111111111114  |
| 4   | a4    | 1111111111111111114  | 1111111111111111112  |
| 4   | a4    | 1111111111111111114  | 1111111111111111113  |
| 4   | a4    | 1111111111111111114  | 1111111111111111114  |
| 5   | a5    | 1111111111111111115  | 1111111111111111112  |
| 5   | a5    | 1111111111111111115  | 1111111111111111113  |
| 5   | a5    | 1111111111111111115  | 1111111111111111114  |
+-----+-------+----------------------+----------------------+--+

explain select id,name, user_id_string,user_info_id_bigint from new.test_type_user u left join new.test_type_user_info ui on u.user_id_string = ui.user_info_id_bigint;
+-------------------------------------------------------------------------------------------------------------------+--+
|                                                      Explain                                                      |
+-------------------------------------------------------------------------------------------------------------------+--+
| STAGE DEPENDENCIES:                                                                                               |
|   Stage-4 is a root stage                                                                                         |
|   Stage-3 depends on stages: Stage-4                                                                              |
|   Stage-0 depends on stages: Stage-3                                                                              |
|                                                                                                                   |
| STAGE PLANS:                                                                                                      |
|   Stage: Stage-4                                                                                                  |
|     Map Reduce Local Work                                                                                         |
|       Alias -> Map Local Tables:                                                                                  |
|         ui                                                                                                        |
|           Fetch Operator                                                                                          |
|             limit: -1                                                                                             |
|       Alias -> Map Local Operator Tree:                                                                           |
|         ui                                                                                                        |
|           TableScan                                                                                               |
|             alias: ui                                                                                             |
|             Statistics: Num rows: 156 Data size: 1253 Basic stats: COMPLETE Column stats: NONE                    |
|             HashTable Sink Operator                                                                               |
|               keys:                                                                                               |
|                 0 UDFToDouble(user_id_string) (type: double)                                                      |
|                 1 UDFToDouble(user_info_id_bigint) (type: double)                                                 |
|                                                                                                                   |
|   Stage: Stage-3                                                                                                  |
|     Map Reduce                                                                                                    |
|       Map Operator Tree:                                                                                          |
|           TableScan                                                                                               |
|             alias: u                                                                                              |
|             Statistics: Num rows: 6 Data size: 1365 Basic stats: COMPLETE Column stats: NONE                      |
|             Map Join Operator                                                                                     |
|               condition map:                                                                                      |
|                    Left Outer Join0 to 1                                                                          |
|               keys:                                                                                               |
|                 0 UDFToDouble(user_id_string) (type: double)                                                      |
|                 1 UDFToDouble(user_info_id_bigint) (type: double)                                                 |
|               outputColumnNames: _col0, _col1, _col4, _col8                                                       |
|               Statistics: Num rows: 171 Data size: 1378 Basic stats: COMPLETE Column stats: NONE                  |
|               Select Operator                                                                                     |
|                 expressions: _col0 (type: int), _col1 (type: string), _col4 (type: string), _col8 (type: bigint)  |
|                 outputColumnNames: _col0, _col1, _col2, _col3                                                     |
|                 Statistics: Num rows: 171 Data size: 1378 Basic stats: COMPLETE Column stats: NONE                |
|                 File Output Operator                                                                              |
|                   compressed: false                                                                               |
|                   Statistics: Num rows: 171 Data size: 1378 Basic stats: COMPLETE Column stats: NONE              |
|                   table:                                                                                          |
|                       input format: org.apache.hadoop.mapred.TextInputFormat                                      |
|                       output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat                   |
|                       serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe                                   |
|       Local Work:                                                                                                 |
|         Map Reduce Local Work                                                                                     |
|                                                                                                                   |
|   Stage: Stage-0                                                                                                  |
|     Fetch Operator                                                                                                |
|       limit: -1                                                                                                   |
|       Processor Tree:                                                                                             |
|         ListSink                                                                                                  |
|                                                                                                                   |
+-------------------------------------------------------------------------------------------------------------------+--+

 3.3、当 on 条件两边字段类型,一边是string,一边是decimal

spark-shell: string,decimal 都转成了 double,结果错误
spark-sql: string,decimal 都转成了 double,结果错误
hive: string,decimal 都转成了 decimal(38,18), 结果正确

spark.sql("select id,name, user_id_string,user_info_id_decimal from new.test_type_user u left join new.test_type_user_info ui on u.user_id_string = ui.user_info_id_decimal").explain
== Physical Plan ==
*BroadcastHashJoin [cast(user_id_string#29 as double)], [cast(user_info_id_decimal#90 as double)], LeftOuter, BuildRight
:- *FileScan parquet new.test_type_user[id#25,name#26,user_id_string#29] Batched: false, Format: Parquet, Location: InMemoryFileIndex[hdfs://xxx/warehouse/new.db/test_type_user], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int,name:string,user_id_string:string>
+- *BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, decimal(20,0), true] as double)))
   +- *FileScan parquet new.test_type_user_info[user_info_id_decimal#90] Batched: false, Format: Parquet, Location: InMemoryFileIndex[hdfs://xxx/warehouse/new.db/test_type_user_info], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<user_info_id_decimal:decimal(20,0)>

spark.sql("select id,name, user_id_string,user_info_id_decimal from new.test_type_user u left join new.test_type_user_info ui on u.user_id_string = ui.user_info_id_decimal").show()


explain select id,name, user_id_string,user_info_id_decimal from new.test_type_user u left join new.test_type_user_info ui on u.user_id_string = ui.user_info_id_decimal;
string = decimal

+--------------------------------------------------------------------------------------------------------------------------+--+
|                                                         Explain                                                          |
+--------------------------------------------------------------------------------------------------------------------------+--+
| STAGE DEPENDENCIES:                                                                                                      |
|   Stage-4 is a root stage                                                                                                |
|   Stage-3 depends on stages: Stage-4                                                                                     |
|   Stage-0 depends on stages: Stage-3                                                                                     |
|                                                                                                                          |
| STAGE PLANS:                                                                                                             |
|   Stage: Stage-4                                                                                                         |
|     Map Reduce Local Work                                                                                                |
|       Alias -> Map Local Tables:                                                                                         |
|         ui                                                                                                               |
|           Fetch Operator                                                                                                 |
|             limit: -1                                                                                                    |
|       Alias -> Map Local Operator Tree:                                                                                  |
|         ui                                                                                                               |
|           TableScan                                                                                                      |
|             alias: ui                                                                                                    |
|             Statistics: Num rows: 11 Data size: 1253 Basic stats: COMPLETE Column stats: NONE                            |
|             HashTable Sink Operator                                                                                      |
|               keys:                                                                                                      |
|                 0 CAST( user_id_string AS decimal(38,18)) (type: decimal(38,18))                                         |
|                 1 user_info_id_decimal (type: decimal(38,18))                                                            |
|                                                                                                                          |
|   Stage: Stage-3                                                                                                         |
|     Map Reduce                                                                                                           |
|       Map Operator Tree:                                                                                                 |
|           TableScan                                                                                                      |
|             alias: u                                                                                                     |
|             Statistics: Num rows: 6 Data size: 1365 Basic stats: COMPLETE Column stats: NONE                             |
|             Map Join Operator                                                                                            |
|               condition map:                                                                                             |
|                    Left Outer Join0 to 1                                                                                 |
|               keys:                                                                                                      |
|                 0 CAST( user_id_string AS decimal(38,18)) (type: decimal(38,18))                                         |
|                 1 user_info_id_decimal (type: decimal(38,18))                                                            |
|               outputColumnNames: _col0, _col1, _col4, _col9                                                              |
|               Statistics: Num rows: 12 Data size: 1378 Basic stats: COMPLETE Column stats: NONE                          |
|               Select Operator                                                                                            |
|                 expressions: _col0 (type: int), _col1 (type: string), _col4 (type: string), _col9 (type: decimal(20,0))  |
|                 outputColumnNames: _col0, _col1, _col2, _col3                                                            |
|                 Statistics: Num rows: 12 Data size: 1378 Basic stats: COMPLETE Column stats: NONE                        |
|                 File Output Operator                                                                                     |
|                   compressed: false                                                                                      |
|                   Statistics: Num rows: 12 Data size: 1378 Basic stats: COMPLETE Column stats: NONE                      |
|                   table:                                                                                                 |
|                       input format: org.apache.hadoop.mapred.TextInputFormat                                             |
|                       output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat                          |
|                       serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe                                          |
|       Local Work:                                                                                                        |
|         Map Reduce Local Work                                                                                            |
|                                                                                                                          |
|   Stage: Stage-0                                                                                                         |
|     Fetch Operator                                                                                                       |
|       limit: -1                                                                                                          |
|       Processor Tree:                                                                                                    |
|         ListSink                                                                                                         |
|                                                                                                                          |
+--------------------------------------------------------------------------------------------------------------------------+--+

## spark-shell 结果
+---+----+-------------------+--------------------+
| id|name|     user_id_string|user_info_id_decimal|
+---+----+-------------------+--------------------+
|  1|  a1|                  1|                   1|
|  2|  a2|1111111111111111112| 1111111111111111114|
|  2|  a2|1111111111111111112| 1111111111111111113|
|  2|  a2|1111111111111111112| 1111111111111111112|
|  3|  a3|1111111111111111113| 1111111111111111114|
|  3|  a3|1111111111111111113| 1111111111111111113|
|  3|  a3|1111111111111111113| 1111111111111111112|
|  4|  a4|1111111111111111114| 1111111111111111114|
|  4|  a4|1111111111111111114| 1111111111111111113|
|  4|  a4|1111111111111111114| 1111111111111111112|
|  5|  a5|1111111111111111115| 1111111111111111114|
|  5|  a5|1111111111111111115| 1111111111111111113|
|  5|  a5|1111111111111111115| 1111111111111111112|
+---+----+-------------------+--------------------+

## spark-sql 结果
1	a1	1	1
2	a2	1111111111111111112	1111111111111111114
2	a2	1111111111111111112	1111111111111111113
2	a2	1111111111111111112	1111111111111111112
3	a3	1111111111111111113	1111111111111111114
3	a3	1111111111111111113	1111111111111111113
3	a3	1111111111111111113	1111111111111111112
4	a4	1111111111111111114	1111111111111111114
4	a4	1111111111111111114	1111111111111111113
4	a4	1111111111111111114	1111111111111111112
5	a5	1111111111111111115	1111111111111111114
5	a5	1111111111111111115	1111111111111111113
5	a5	1111111111111111115	1111111111111111112
Time taken: 0.401 seconds, Fetched 13 row(s)


## hive 结果
+-----+-------+----------------------+-----------------------+--+
| id  | name  |    user_id_string    | user_info_id_decimal  |
+-----+-------+----------------------+-----------------------+--+
| 1   | a1    | 1                    | 1                     |
| 2   | a2    | 1111111111111111112  | 1111111111111111112   |
| 3   | a3    | 1111111111111111113  | 1111111111111111113   |
| 4   | a4    | 1111111111111111114  | 1111111111111111114   |
| 5   | a5    | 1111111111111111115  | NULL                  |
+-----+-------+----------------------+-----------------------+--+
 
  • 5
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值