spark1.x-spark-sql-数据倾斜解决方案

最新推荐文章于 2024-08-29 17:06:49 发布

猿与禅

最新推荐文章于 2024-08-29 17:06:49 发布

阅读量2.3k

点赞数

分类专栏： spark 文章标签： spark-sql

spark 专栏收录该内容

51 篇文章 1 订阅

订阅专栏

聚合源数据

过滤导致倾斜的key where条件

提高shuffle并行度 spark.sql.shuffle.partitions

  sqlContext.setConf("spark.sql.shuffle.partitions","1000")
   // 默认的并行度 为 200 reducetask只有200

双重group by 改写SQL 改成两次Group by

给某个字段加上随机前缀 random_prefix()
实现UDAF
call(String,Integer)
randNum+”_”+val
局部聚合去掉随机前缀
拿到值
再进行一次全局的聚合
多Key RDD 拆开了
在映射成一张表

reduce join 转换为map jon

将表做成 RDD 手动去实现 mapjoin
SPark sql内置的map join
默认有一个小表在10M以内
默认就会将该表进行broadcast 然后执行map join
调节这个阈值 20M 50M 甚至1G

    sqlContext.setConf("spark.sql.autoBroadcastJoinThreshold", "20971520");

采样倾斜Key并单独进行join

强spark-core
随机key与扩容表
sparksql 转化为sparkcore
product info 加上随机数进行扩容10倍
sql做子查询

        JavaRDD<Row> rdd = sqlContext.sql("select * from product_info").javaRDD();
        JavaRDD<Row> flattedRDD = rdd.flatMap(new FlatMapFunction<Row, Row>() {

            private static final long serialVersionUID = 1L;

            @Override
            public Iterable<Row> call(Row row) throws Exception {
                List<Row> list = new ArrayList<Row>();

                for(int i = 0; i < 10; i ++) {
                    long productid = row.getLong(0);
                    String _productid = i + "_" + productid;

                    Row _row = RowFactory.create(_productid, row.get(1), row.get(2));
                    list.add(_row);
                }

                return list;
            }

        });

        StructType _schema = DataTypes.createStructType(Arrays.asList(
                DataTypes.createStructField("product_id", DataTypes.StringType, true),
                DataTypes.createStructField("product_name", DataTypes.StringType, true),
                DataTypes.createStructField("product_status", DataTypes.StringType, true)));

        DataFrame _df = sqlContext.createDataFrame(flattedRDD, _schema);
        _df.registerTempTable("tmp_product_info");

        String _sql =
                "SELECT "
                    + "tapcc.area,"
                    + "remove_random_prefix(tapcc.product_id) product_id,"
                    + "tapcc.click_count,"
                    + "tapcc.city_infos,"
                    + "pi.product_name,"
                    + "if(get_json_object(pi.extend_info,'product_status')=0,'自营商品','第三方商品') product_status "
                + "FROM ("
                    + "SELECT "
                        + "area,"
                        + "random_prefix(product_id, 10) product_id,"
                        + "click_count,"
                        + "city_infos "
                    + "FROM tmp_area_product_click_count "
                + ") tapcc "
                + "JOIN tmp_product_info pi ON tapcc.product_id=pi.product_id ";