读代码-RandomSeedGenerator

package org.apache.mahout.clustering.kmeans;
public final class RandomSeedGenerator
完成中心点随机取样的过程

hdfs操作,比较普遍,先删除再新建

FileSystem fs = FileSystem.get(output.toUri(), conf);
HadoopUtil.delete(conf, output);
Path outFile = new Path(output, "part-randomSeed");
boolean newFile = fs.createNewFile(outFile);


遍历hdfs路径框架
fs.globStatus(inputPathPattern, PathFilters.logsCRCFilter());
globStatus返回了匹配pattern的所有路径
logsCRCFilter过滤掉了以_开头的日志,点开头的隐藏及.crc文件
循环时滤掉文件夹,只处理文件

if (newFile) {
Path inputPathPattern;

if (fs.getFileStatus(input).isDir()) {
inputPathPattern = new Path(input, "*");
} else {
inputPathPattern = input;
}

FileStatus[] inputFiles = fs.globStatus(inputPathPattern, PathFilters.logsCRCFilter());
for (FileStatus fileStatus : inputFiles) {
if (fileStatus.isDir()) {
continue;
}
//process file
}

}



初始化writer
准备k个容量的list存储选出的值

SequenceFile.Writer writer = SequenceFile.createWriter(fs, conf, outFile, Text.class, Cluster.class);
Random random = RandomUtils.getRandom();
List<Text> chosenTexts = new ArrayList<Text>(k);
List<Cluster> chosenClusters = new ArrayList<Cluster>(k);
int nextClusterId = 0;



随机的核心--蓄水池算法

for (Pair<Writable,VectorWritable> record
: new SequenceFileIterable<Writable,VectorWritable>(fileStatus.getPath(), true, conf)) {
Writable key = record.getFirst();
VectorWritable value = record.getSecond();
Cluster newCluster = new Cluster(value.get(), nextClusterId++, measure);
newCluster.observe(value.get(), 1);
Text newText = new Text(key.toString());
int currentSize = chosenTexts.size();
if (currentSize < k) {
chosenTexts.add(newText);
chosenClusters.add(newCluster);
} else if (random.nextInt(currentSize + 1) == 0) { // with chance 1/(currentSize+1) pick new element
int indexToRemove = random.nextInt(currentSize); // evict one chosen randomly
chosenTexts.remove(indexToRemove);
chosenClusters.remove(indexToRemove);
chosenTexts.add(newText);
chosenClusters.add(newCluster);
}
}
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值