从M行的文件随机抽取N行(可以假定M>=N),这是需要对数据进行抽样处理时很长常见的需求。
首先想到的方法是每读取一行,扔一个0到M-1的随机数,如果随机数小于N,则输出该行,否则不输出。Perl源代码如下:
#!/usr/bin/perl
# subset.pl
# Usage: sub_set.pl file sample_num
if (@ARGV != 2) {
die "Usage: $0 sample_file sample_num/n";
}
my ($sample_file, $sample_num) = @ARGV;
open my $fh, "<$sample_file" or die "Cannot open $sample_file to read $!";
my $all_num = `wc -l $sample_file |awk '{print $1}'`;
while (<$fh>) {
if (rand($all_num) < $sample_num) {
print;
}
}
这个程序的缺陷是输出的行数只是粗略等于N(由于随机性的原因)。
下面是能输出确定N的版本:
#!/usr/bin/perl
# fixed_num_subset.pl
# Usage: fixnum_sub_set.pl file sample_num
if (@ARGV != 2) {
die "Usage: $0 sample_file sample_num/n";
}
my ($sample_file, $sample_num) = @ARGV;
open my $fh, "<$sample_file" or die "Cannot open $sample_file to read $!";
my $all_num = `wc -l $sample_file |awk '{print $1}'`;
my %labels;
my $k = $sample_num;
for (my $i = 0; $i < $all_num; $i++) {
if (rand($all_num-$i) < $k ) {
$labels{$i} = 1;
$k--;
}
}
my $i = 0;
while (<$fh>) {
if (exists $labels{$i}) {
print ;
}
$i++;
}