一、sample
1.描述
根据给定的随机种子,从RDD中随机地按指定比例选一部分记录,创建新的RDD。返回RDD[T]
2.源码
//返回此RDD的抽样子集
defsample(withReplacement: Boolean, fraction: Double, seed: Long = Utils.random.nextLong): RDD[T]={
require(fraction >= 0,s"Fraction must be nonnegative, but got ${fraction}")
withScope {
require(fraction >= 0.0, "Negative fraction value: " + fraction)
if (withReplacement) {
new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed)
}else {
new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed)
}
}
}
- 参数
withReplacement:是否放回抽样。true-有放回,false-无放回
fraction:期望样本的大小作为RDD大小的一部分
当withReplacement=false时,选择每个元素的概率,分数一定是[0,1]
当withReplacement=true时,选择每个元素的期望次数,分数必须大于等于0
seed:随机数生成器的种子。一般默认
3.例子
- 无放回抽样,每个元素被抽到的概率为0.5:fraction=0.5
val rdd=sc.parallelize(List(2,3,7,4,8))
val sampleRdd=rdd.sample(false,0.5)
sampleRdd.foreach(println)
- 有放回抽样,每个元素被抽取到的期望次数是2:fraction=2
//简单1--(有/无放回抽样,抽样比例,随机数种子)
val rdd=sc.parallelize(List(2,3,7,4,8))
val sampleRdd=rdd.sample(true,2)
sampleRdd.foreach(println)
二、takeSample
1.描述
返回此RDD的固定大小的采样子集。返回Array[T]
注意:仅当预期结果数组较小时才应使用此方法,因为所有数据均已加载到驱动程序的内存中
2.源码
def takeSample(withReplacement:Boolean, num:Int, seed:Long=Utils.random.nextLong): Array[T] = withScope {
val numStDev = 10.0
require(num >= 0, "Negative number of elements requested")
require(num <= (Int.MaxValue - (numStDev * math.sqrt(Int.MaxValue)).toInt),
"Cannot support a sample size > Int.MaxValue - " +
s"$numStDev * math.sqrt(Int.MaxValue)")
if (num == 0) {
new Array[T](0)
} else {
val initialCount = this.count()
if (initialCount == 0) {
new Array[T](0)
}
else
{
val rand = new Random(seed)
if (!withReplacement && num >= initialCount) {
Utils.randomizeInPlace(this.collect(), rand)
}
else
{
val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount,
withReplacement)
var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
var numIters = 0
while (samples.length < num) {
logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
numIters += 1
}
Utils.randomizeInPlace(samples, rand).take(num)
}
}
}
}
- 参数
withReplacement:是否放回抽样。true-有放回,false-无放回
num:返回样本的大小
seed:随机数生成器的种子。一般默认
3.例子
- 无放回抽样,样本个数 > 父本个数,返回父本个数
val rdd=sc.parallelize(List(2,3,7,4,8))
val sampleRdd=rdd.takeSample(false,10)
sampleRdd.foreach(println)
- 无放回抽样,样本个数 <= 父本个数,返回样本个数
val rdd=sc.parallelize(List(2,3,7,4,8))
val sampleRdd=rdd.takeSample(false,3)
sampleRdd.foreach(println)
- 有放回抽样,返回样本个数
val rdd=sc.parallelize(List(2,3,7,4,8))
val sampleRdd=rdd.takeSample(false,3)
sampleRdd.foreach(println)