关于spark的sample()算子参数详解

sample(withReplacement : scala.Boolean, fraction : scala.Double,seed scala.Long)

sample算子时用来抽样用的,其有3个参数

withReplacement:表示抽出样本后是否在放回去,true表示会放回去,这也就意味着抽出的样本可能有重复

fraction :抽出多少,这是一个double类型的参数,0-1之间,eg:0.3表示抽出30%

seed:表示一个种子,根据这个seed随机抽取,一般情况下只用前两个参数就可以,那么这个参数是干嘛的呢,这个参数一般用于调试,有时候不知道是程序出问题还是数据出了问题,就可以将这个参数设置为定值

================================================================================

下面是代码:

大概思路是:通过抽样取出一部分样本,在对样本做wordCount并排序最后取出出现次数最多的key,这个key就是导致数据倾斜的key

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
 
public class Day05 {
 
    public static void main(String[] args) {
        SparkConf conf = new SparkConf().setMaster("local[2]").setAppName("Day05");
        JavaSparkContext jsc = new JavaSparkContext(conf);
 
        List<String> keys = getKeyBySample(jsc);
        System.out.println("导致数据倾斜的key是:"+keys);
        jsc.stop();
    }
 
    /**
     * 通过Sample算子进行抽样并把导致数据倾斜的key找出来
     * 然后可以做对计算做针对性的优化
     * @param jsc
     */
    public static List<String> getKeyBySample(JavaSparkContext jsc){
        List<String> data = Arrays.asList("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A",
                "A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A",
                "A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A",
                "B","B","B","B","B","B","B","B","C","D","E","F","G");
 
        JavaRDD<String> rdd =  jsc.parallelize(data,2);
        List<Tuple2> item =
                rdd.mapToPair(x->new Tuple2<String,Integer>(x,1))
                .sample(true,0.4)
                .reduceByKey((x,y)->x+y)
                .map(x->new Tuple2(x._2,x._1))
                .sortBy(x->x._1,false,2)
                .take(3);
 
        List<String> keys = new ArrayList<>();
        System.out.println("keys="+item);
        for(int i=0;i<item.size();i++){
            if(i == item.size()-1)
               break;
            Tuple2 current = item.get(i);
            Tuple2 next = item.get(i+1);
            Integer v1 = Integer.parseInt(current._1.toString());
            Integer v2 = Integer.parseInt(next._1.toString());
            System.out.println(v1+"   "+v2);
 
            /**
             * 这儿的逻辑有问题,找出导致数据倾斜的key的方式和具体的业务也有关系
             * 这里只是给了一个简单的判断方法,很有局限性
             */
            if(v1/v2 >= 3){
                System.out.println("===");
                keys.add(current._2.toString());
            }
        }
        return keys;
    }
}


--------------------- 

原文:https://blog.csdn.net/lyzx_in_csdn/article/details/79948799 
 

  • 5
    点赞
  • 18
    收藏
    觉得还不错? 一键收藏
  • 4
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值