Tuesday, October 27, 2009

Random lines in a text file

Sometimes I need to sample large data sets, so I randomly select some lines in the file. My files are generally text-records, one record by line, then I wrote this small script to do the task:

#!/usr/bin/perl -w
use strict;

=head1 NAME



Select random lines in a file.


$ARGV[2] or die "Usage: selectRandomLines.pl TOTAL_LINES_IN_FILE NUM_LINES_WANTED FILE_NAME\n";

my $total = shift @ARGV; # Total lines in the file
my $want = shift @ARGV; # Total lines to select
my $file = shift @ARGV; # The file
my $line = 0; # Line counter
my %select = randomSelect($total, $want); # Hash with selected lines

open FILE, "$file" or die "cannot open $file\n";
while (<FILE>) { print "$_" if (defined $select{$line++}); }
close FILE;

sub randomSelect {
my $t = shift @_;
my $n = shift @_;
my %s = ();
for (my $i = 0; $i < $n; $i++) {
my $v = int(rand $t);
(defined $s{$v}) ? $i-- : $s{$v}++;
return %s;

=head1 AUTHOR

JC @ 2009

=head1 LICENSE

This is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with code. If not, see .


Monday, October 12, 2009

Error in DBI::SQLite

I was coding to query a SQLite database using Perl::DBI, I have a small code like this:
#!/usr/bin/perl -w
use strict;
use DBI;
my $dbh = DBI->connect("dbi:SQLite:dbname=db_file", "", "") or die "error in connection\n";
my $sth = $dbh->prepare($sql) or die "cannot prepare SQL\n";
while (my @data = $sth->fetchrow_array()) {
#process data
But, every time I execute this script I obtain this warning:
closing dbh with active statement handles .... blah, blah, blah
I double checked the code, the documentation and ran some debug examples, I obtained the expected result, so why this message? Finally, I found the solution in the PerlMonks website, the problem is a bad status return in the module when you use close() method, so the simple solution is to "undef $sth" at the end:
undef $sth;

Thursday, October 8, 2009

Reading the DNA

Current DNA sequencing technologies are based on sequencing-by-synthesis or similar technics, where enzymatic reactions add one or more nucleic bases marked. But IBM is working in another way to know the composition of a DNA sequence, using nanotechnology and electronics to "read" base by base. No enzymes means faster and cheaper process. We're closer to the $1,000 genome ...

Source: http://www.engadget.com/2009/10/08/ibms-ultra-cheap-dna-transistor-dream-could-lead-to-personalize/