python 笛卡尔_python 笛卡尔

最新推荐文章于 2024-08-30 08:37:14 发布

薛从豪

最新推荐文章于 2024-08-30 08:37:14 发布

阅读量296

点赞数

文章标签： python 笛卡尔

本文链接：https://blog.csdn.net/weixin_35851374/article/details/112876939

版权

本文探讨了在Spark SQL中，当使用持久化数据和常量键进行join操作时，可能导致数据偏斜的问题。作者指出，尽管笛卡尔积在处理大量数据时昂贵，但它避免了数据偏差，因此在某些情况下应优先考虑。文章建议启用交叉连接或者使用Spark 2.x的显式交叉连接语法，并提出了解决缓存数据时意外行为的挑战，但实现自定义优化器规则可能需要更深入的知识。

摘要由CSDN通过智能技术生成

问题是，一旦您持久保存数据，second_id就会将其合并到缓存表中，不再被视为常量。因此，计划程序无法再推断查询应该表示为笛卡尔积，并使用标准SortMergeJoin的哈希分区second_id。在不使用持久性的情况下实现相同的结果将是微不足道的 udffrom pyspark.sql.functions import lit, pandas_udf, PandasUDFType@pandas_udf('integer', PandasUDFType.SCALAR) def identity(x):return x

second_df = second_df.withColumn('second_id', identity(lit(1)))result_df = first_df.join(second_df, first_df.first_id == second_df.second_id,

'inner')

result_df.explain()== Physical Plan ==*(6) SortMergeJoin [cast(first_id#4 as int)], [second_id#129], Inner:- *(2) Sort [cast(first_id#4 as int) ASC NULLS FIRST], false, 0: +- Exchange hashpartitioning(cast(first_id#4 as int), 200): +- *(1) Filter isnotnull(first_id#4): +- Scan ExistingRDD[first_id#4]+- *(5) Sort [second_id#129 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(second_id#129, 200) +