Tuesday, October 27, 2009

Random lines in a text file

Sometimes I need to sample large data sets, so I randomly select some lines in the file. My files are generally text-records, one record by line, then I wrote this small script to do the task:

#!/usr/bin/perl -w
use strict;

=head1 NAME

selectRandomLines.pl

=head1 DESCRIPTION

Select random lines in a file.

=cut

$ARGV[2] or die "Usage: selectRandomLines.pl TOTAL_LINES_IN_FILE NUM_LINES_WANTED FILE_NAME\n";

my $total = shift @ARGV; # Total lines in the file
my $want = shift @ARGV; # Total lines to select
my $file = shift @ARGV; # The file
my $line = 0; # Line counter
my %select = randomSelect($total, $want); # Hash with selected lines

open FILE, "$file" or die "cannot open $file\n";
while (<FILE>) { print "$_" if (defined $select{$line++}); }
close FILE;

=head1 SUBROUTINES
randomSelect()
CALL: randomSelect(TOTAL_ELEMENTS, TOTAL_WANTED) [NUM, NUM]
RETURN: %s [HASH]
=cut
sub randomSelect {
my $t = shift @_;
my $n = shift @_;
my %s = ();
for (my $i = 0; $i < $n; $i++) {
my $v = int(rand $t);
(defined $s{$v}) ? $i-- : $s{$v}++;
}
return %s;
}

=head1 AUTHOR

JC @ 2009

=head1 LICENSE

This is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with code. If not, see .

=cut

No comments:

Post a Comment