问题是,一旦您持久保存数据,second_id就会将其合并到缓存表中,不再被视为常量。因此,计划程序无法再推断查询应该表示为笛卡尔积,并使用标准SortMergeJoin的哈希分区second_id。在不使用持久性的情况下实现相同的结果将是微不足道的 udffrom pyspark.sql.functions import lit, pandas_udf, PandasUDFType@pandas_udf('integer', PandasUDFType.SCALAR) def identity(x):return x
second_df = second_df.withColumn('second_id', identity(lit(1)))result_df = first_df.join(second_df, first_df.first_id == second_df.second_id,
'inner')
result_df.explain()== Physical Plan ==*(6) SortMergeJoin [cast(first_id#4 as int)], [second_id#129], Inner:- *(2) Sort [cast(first_id#4 as int) ASC NULLS FIRST], false, 0: +- Exchange hashpartitioning(cast(first_id#4 as int), 200): +- *(1) Filter isnotnull(first_id#4): +- Scan ExistingRDD[first_id#4]+- *(5) Sort [second_id#129 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(second_id#129, 200) +