- public interface Sampler<K,V>{
- K[] getSample(InputFormat<K,V> inf,JobConf job) throws IOException;
- }
- public K[] getSample(InputFormat<K,V> inf, JobConf job) throws IOException {
- InputSplit[] splits = inf.getSplits(job, job.getNumMapTasks());
- ArrayList<K> samples = new ArrayList<K>(numSamples);
- int splitsToSample = Math.min(maxSplitsSampled, splits.length);
- Random r = new Random();
- long seed = r.nextLong();
- r.setSeed(seed);
- LOG.debug("seed: " + seed);
- // shuffle splits
- for (int i = 0; i < splits.length; ++i) {
- InputSplit tmp = splits[i];
- int j = r.nextInt(splits.length);
- splits[i] = splits[j];
- splits[j] = tmp;
- }
- // our target rate is in terms of the maximum number of sample splits,
- // but we accept the possibility of sampling additional splits to hit
- // the target sample keyset
- for (int i = 0; i < splitsToSample ||
- (i < splits.length && samples.size() < numSamples); ++i) {
- RecordReader<K,V> reader = inf.getRecordReader(splits[i], job,
- Reporter.NULL);
- K key = reader.createKey();
- V value = reader.createValue();
- while (reader.next(key, value)) {
- if (r.nextDouble() <= freq) {
- if (samples.size() < numSamples) {
- samples.add(key);
- } else {
- // When exceeding the maximum number of samples, replace a
- // random element with this one, then adjust the frequency
- // to reflect the possibility of existing elements being
- // pushed out
- int ind = r.nextInt(numSamples);
- if (ind != numSamples) {
- samples.set(ind, key);
- }
- freq *= (numSamples - 1) / (double) numSamples;
- }
- key = reader.createKey();
- }
- }
- reader.close();
- }
- return (K[])samples.toArray();
- }
首先通过InputFormat的getSplits方法得到所有的输入分区;然后确定需要抽样扫描的分区数目,取输入分区总数与用户输入的maxSplitsSampled两者的较小的值得到splitsToSample;然后对输入分区数组shuffle排序,打乱其原始顺序;然后循环逐个扫描每个分区中的记录进行采样,循环的条件是当前已经扫描的分区数小于splitsToSample或者当前已经扫描的分区数超过了splitsToSample但是小于输入分区总数并且当前的采样数小于最大采样数numSamples。
每个分区中记录采样的具体过程如下:
从指定分区中取出一条记录,判断得到的随机浮点数是否小于等于采样频率freq,如果大于则放弃这条记录,然后判断当前的采样数是否小于最大采样数,如果小于则这条记录被选中,被放进采样集合中,否则从【0,numSamples】中选择一个随机数,如果这个随机数不等于最大采样数numSamples,则用这条记录替换掉采样集合随机数对应位置的记录,同时采样频率freq减小变为freq*(numSamples-1)/numSamples。然后依次遍历分区中的其它记录。
- public K[] getSample(InputFormat<K,V> inf, JobConf job) throws IOException {
- InputSplit[] splits = inf.getSplits(job, job.getNumMapTasks());
- ArrayList<K> samples = new ArrayList<K>(numSamples);
- int splitsToSample = Math.min(maxSplitsSampled, splits.length);
- int splitStep = splits.length / splitsToSample;
- int samplesPerSplit = numSamples / splitsToSample;
- long records = 0;
- for (int i = 0; i < splitsToSample; ++i) {
- RecordReader<K,V> reader = inf.getRecordReader(splits[i * splitStep],
- job, Reporter.NULL);
- K key = reader.createKey();
- V value = reader.createValue();
- while (reader.next(key, value)) {
- samples.add(key);
- key = reader.createKey();
- ++records;
- if ((i+1) * samplesPerSplit <= records) {
- break;
- }
- }
- reader.close();
- }
- return (K[])samples.toArray();
- }
首先根据InputFormat得到输入分区数组;然后确定需要采样的分区数splitsToSample为最大分区数和输入分区总数之间的较小值;然后确定对分区采样时的间隔splitStep为输入分区总数除splitsToSample的商;然后确定每个分区的采样数samplesPerSplit为最大采样数除splitsToSample的商。被采样的分区下标为i*splitStep,已经采样的分区数目达到splitsToSample即停止采样。
对于每一个分区,读取一条记录,将这条记录添加到样本集合中,如果当前样本数大于当前的采样分区所需要的样本数,则停止对这个分区的采样。如此循环遍历完这个分区的所有记录。
- public K[] getSample(InputFormat<K,V> inf, JobConf job) throws IOException {
- InputSplit[] splits = inf.getSplits(job, job.getNumMapTasks());
- ArrayList<K> samples = new ArrayList<K>();
- int splitsToSample = Math.min(maxSplitsSampled, splits.length);
- int splitStep = splits.length / splitsToSample;
- long records = 0;
- long kept = 0;
- for (int i = 0; i < splitsToSample; ++i) {
- RecordReader<K,V> reader = inf.getRecordReader(splits[i * splitStep],
- job, Reporter.NULL);
- K key = reader.createKey();
- V value = reader.createValue();
- while (reader.next(key, value)) {
- ++records;
- if ((double) kept / records < freq) {
- ++kept;
- samples.add(key);
- key = reader.createKey();
- }
- }
- reader.close();
- }
- return (K[])samples.toArray();
- }
首先根据InputFormat得到输入分区数组;然后确定需要采样的分区数splitsToSample为最大分区数和输入分区总数之间的较小值;然后确定对分区采样时的间隔splitStep为输入分区总数除splitsToSample的商。被采样的分区下标为i*splitStep,已经采样的分区数目达到splitsToSample即停止采样。
对于每一个分区,读取一条记录,如果当前样本数与已经读取的记录数的比值小于freq,则将这条记录添加到样本集合,否则读取下一条记录。这样依次循环遍历完这个分区的所有记录。