Tuesday, November 25, 2008

Human annotation of sequences

The "Metagenome Annotation Using a Distributed Grid of Undergraduate Students" is a nice article in PLoS Biology, it remind me than the annotation of sequences is a common problem and hard to implement automatically by programs and a human can "decide" better than a machine (what about AI for sequence annotation?).

I like the article tittle using "distributed grid" terms, maybe it is also considered as "heterogeneous nodes". LOL

So, the strategy is to mix some students, computers with internet access, sequences and a control version system (validated by a supervisor), the work flow is:

Step 1: screen a sequence

Step 2. Validate the annotation

And the best part you can resolve 2 problems in one hit: 1. annotate your sequences, 2. teach the students how to use bioinformatic tools and annotation.

Some years ago, I participate in a similar version of "annotathon", more simplistic because was based in the sequence homology in DBs, but it was very fun.

Monday, November 24, 2008

BASIC Stamp supercomputer

The BASIC Stamp supercomputer is a cool project of humanoido, the system is a small portable cluster with batteries, he remarks the advantages:
  • Smaller
  • Lighter
  • Portable
  • Field operable
  • Runs on batteries
  • Has the greatest number of (I/O)
  • Has the greates number of sensors/variety
  • Lower power consumption
  • Lower unit cost
  • Easy to program

The nodes are 12 Parallax Basic Stamp microcontrollers, many wires and some LCD screens. I love the motivations:
  • Learning experiences & challenges
  • Expanding education & knowledge
  • Gaining useful background for career
  • Research Benefits
  • Extending Basic Stamp power
  • Creating new inventions, ideas, applications
  • Own your own, prestige
  • School project, credit
  • Involvement, sense of great accomplishment
  • Psychological relaxation, Symbolic Value
  • Sharing, making new friends
And the video is the coolest part (great sound!):

Thursday, November 20, 2008

Perl universe

I know it!

Original in XKCD: http://xkcd.com/224/
What's your favorite programmer computer cartoon?: http://stackoverflow.com/questions/84556/whats-your-favorite-programmer-cartoon

Wednesday, November 19, 2008

Perl recursion for oligo creation

#!/usr/bin/perl -w
use strict;

=head1 NAME



Perl script to generate all possible combinations of size k
using an alphabet @a, we use function recursion.

My intention is to create all possible oligonucleotides (DNA
alphabet or ACGT) but can be extended to any other field
using a different alphabet.

Output also can be printed in other forms, you can put other
delimiters in the push function or in the final array printed.


my ($k) = 4; # definition of the word size
my @a = qw/A C G T/; # definition of the alphabet
my @words = createWords($k, @a);# main function
print join("\n", @words), "\n"; # print output

sub createWords {
my $k = shift @_; $k--;
my @old = @_;
my @new = ();
if ($k < 1) {
return @old;
else {
foreach my $e (@old) {
foreach my $n (@a) {
push @new, "$e$n"; # add new element
createWords($k, @new); # recursion call

=head1 AUTHOR

Juan Caballero @ 2008

=head1 CONTACT

linxe __a__ glib.org.mx

=head1 LICENSE

Perl Artistic License v2.0


Tuesday, November 18, 2008


This is an environment like "Minority Report", some actions are just visually stunning but video editing and 3D molding are fantastic. The gloves looks very weird.

g-speak overview 1828121108 from john underkoffler on Vimeo.

Monday, November 17, 2008

I'm a PC ... but I use Linux

Last commercials from Windows show us a "real PC guy", because the "I'm a Mac" from Apple, this Spanish-English video show the PC but with Linux.

Personally, I'm a complex mix of Linux/Windows/Mac/Solaris.

Wednesday, November 12, 2008

The Cray Returns

This week the new fastest and most powerful computer of the world is the NCCS's Jaguar, a big cluster build by Cray and AMD, with a 1.64 petaflops of peak capacity. This means the return of the king of super-computers, Cray Inc. who was the reference for many years of the most powerful systems for computing.

Now, with AMD build a cluster system (with Linux of course) and reclaim the Top 1.

The cabinets also looks cool:

Monday, November 10, 2008

The Matrix runs on Windows XP

From: http://www.collegehumor.com/video:1886349

Thursday, November 6, 2008

Slower with age ... 3rd. part

Again Phoronix had tested the big U, now the rival is MacOS X. They used a MacMini and installed Ubuntu 32b and 64b with BootCamp.

Ok, some rounds Ubuntu won, like this:

but others MacOS X was the best:

Just one comment, remember than MacOS X is still in 32b, some parts are migrated to 64b but the kernel and base system is 32b native. Should MacOS X in 64b native be a big difference? Only the Snow Leopard knows and I predict: Yes.

Tuesday, November 4, 2008

Simple assembly

Last week my good friend M asks me to create a script to draw some sequences, the main problem is to visually see differences in transcript orientation (sense and antisense), some time ago I created similar tools for mapping short sequences like 454 pyro or siRNAs. Now the big problem is to have a good assembly, many regions can extend a "contig" to the left or the right, so internal coordinates change every time you add a new element, with short sequences you have a target well defined and the extension is relative to it.

After thinking a little, I decide not extend my code to the assembly (yes, I'm lazy) and use Phrap to perform the job. A parser read the output and extract the alignment information.

Second problem was the fasta naming convention, the original fasta file is a mixed names of other sequences, so I decided to create and tag an unique ID, "seq_##".

Extending the problem, M asks me to be possible to add more sequences, based in the sequence homology, or blast output, then I added a small option to blast against another fasta file and add the matches to the assembly sequences.

Lot of work, require GD for graphics (with 1 base = 1 pixel) and executables of phrap/blast, the output are the assemblies with reports in GFF format and pretty images in PNG.

An example:

This is the code:

#!/usr/bin/perl -w
use strict;
use Getopt::Long;

=head1 NAME

clusterSeq.pl FASTA


Script to simply assembly, the FASTA input is re-named for unique IDs,
prepared for blast and search blastn against itself, the blast output
is parsed and "contigs" are assembled and a GFF3 file is created.


my ($help); # Help flag
my ($f); # principal Fasta file
my ($a); # secondary fasta file to add
my ($plot); # Plot flag
my ($gff); # Print GFF files flag
my ($phrapdir) = '.'; # where is Phrap ?
my ($blastdir) = '.'; # where is Blast ?
my ($blastprog) = 'megablast'; # What program (megablast | blastall -p blastn)
my ($blastopt) = '-e 0.000001 -D 3 -a 2'; # some Blast options

die "Cannot find Phrap exec !!!\n" unless(-e "$phrapdir/phrap");

usage() if(!GetOptions(
'add|a:s' =>\$a,
usage() if(defined $help);
usage() unless(defined $f and -e $f);

if (defined $a) {
print "Preparinf files in $a to be blasted\n";
die "Cannot find Blast binary !!!\n" unless (-e "$blastdir/$blastprog");
die "Cannot find formatdb binary !!!\n" unless (-e "$blastdir/formatdb");
system ("$blastdir/formatdb -i $a -p F -o F");

print "Blasting $f in $a db\n";
system ("$blastdir/$blastprog -i $f -d $a $blastopt -o blast.out");

print "Adding new sequences to $f";
my %addme = ();
open B, "blast.out" or die "cannot open blast.out\n";
while (<B>) {
next if(/#/);
my ($s, $h, @rest) = split (/\t/, $_);
$h =~ s/lcl\|//;
close B;
local $/ = '>';
open A, "$a" or die "cannot open $a\n";
open F, ">>$f" or die "cannot open $f\n";
while (<A>) {
foreach my $m (keys %addme) {
if ($_ =~ /$m/) {
print F ">$_";
undef $addme{$m};
close A;
close F;

open F, "$f" or die "cannot open $f\n";
open T, "> $f.tmp" or die "cannot open $f.tmp\n";
my %ids = ();
my ($id) = 0;
while (<F>) {
if (/>/) {
s/>/>seq_$id | /;
$ids{"seq_$id"}{'name'} = $_;
else {
$ids{"seq_$id"}{'len'} += length $_;
print T "$_\n";
close F;
close T;

system ("$phrapdir/phrap $f.tmp > $f.phrap_out");

open P, "$f.phrap_out" or die "cannot open $f.phrap_out\n";
my ($active) = 0;
my ($contig) = '';
my %contigs = ();
my %contigs_len = ();
while (<P>) {
if (/^Contig\s+(\d+)\.\s+(\d+)\s+reads;\s+(\d+)/) {
if ($2 > 1) { # More than one read by contig
$contig = "contig_$1";
$contigs_len{$contig} = $3;

elsif ($active >= 1) {
if (/^\s+(\d+)\s+(\d+)\s+(seq_\d+)/) { # Capture forward sequeces
$contigs{$contig}{$3}{'ini'} = $1;
$contigs{$contig}{$3}{'end'} = $2;
$contigs{$contig}{$3}{'dir'} = '+';
#print "$contig $1 $2 $3 +\n";
elsif (/^C\s+(\d+)\s+(\d+)\s+(seq_\d+)/) { # Capture reverse sequences
$contigs{$contig}{$3}{'ini'} = $1;
$contigs{$contig}{$3}{'end'} = $2;
$contigs{$contig}{$3}{'dir'} = '-';
#print "$contig $1 $2 $3 -\n";
else {
$active = 0;
close P;

createGFF() if(defined $gff);
createPNG() if(defined $plot);


=head2 createGFF

Read the sequence info and print a GFF v3.0 file for each contig


sub createGFF {
print "Creating GFF files ...\n";
foreach my $c (keys %contigs) {
open G, ">$c.gff" or die "cannot write $c.gff\n";
my $l = $contigs_len{$c};
print G "##gff-version 3\n##sequence-region $c 1 $l\n";
#print "contig=$c len=$l\n";
foreach my $s (keys %{ $contigs{$c} }) {
my $i = $contigs{$c}{$s}{'ini'};
my $e = $contigs{$c}{$s}{'end'};
my $d = $contigs{$c}{$s}{'dir'};
my $n = $ids{$s}{'name'};
#print "contig=$c seq=$s ini=$i end=$e dir=$d name=$n\n";
print G "$c\tassembly\tblock\t$i\t$e\t.\t$d\t.\t$n\n";
close G;

=head2 createPNG

Read the sequence info and print a PNG map file for each contig


sub createPNG {
use GD;
print "Creating PNG images ...\n";

# define color schemes
my $col_f = 'silver'; # String +
my $col_r = 'gold'; # String -
my $col_t = 'black'; # Text color
my $col_l = 'navy'; # Lines
my $col_b = 'red'; # Bold

foreach my $c (keys %contigs) {
open G, ">$c.png" or die "cannot write $c.png\n";
my $l = $contigs_len{$c};
# Number of sequences to plot
my $nseq = 2; # Base sequence (contig) and the ruler
foreach my $s (keys %{ $contigs{$c} }) { $nseq++; }

# Create image
my $im = new GD::Image($l + 20, $nseq * 20);
my $x = 10;
my $y = 5;

# Allocate some colors
my %colors= ();
$colors{'white' } = $im->colorAllocate(255,255,255);
$colors{'black' } = $im->colorAllocate( 0, 0, 0);
$colors{'red' } = $im->colorAllocate(255, 0, 0);
$colors{'blue' } = $im->colorAllocate( 0, 0,255);
$colors{'grey' } = $im->colorAllocate(190,190,190);
$colors{'silver' } = $im->colorAllocate(191,191,191);
$colors{'maroon' } = $im->colorAllocate(128, 0, 0);
$colors{'purple' } = $im->colorAllocate(128, 0,128);
$colors{'green' } = $im->colorAllocate( 0,128, 0);
$colors{'lime' } = $im->colorAllocate( 0,255, 0);
$colors{'olive' } = $im->colorAllocate(107,142, 35);
$colors{'yellow' } = $im->colorAllocate(255,255, 0);
$colors{'gold' } = $im->colorAllocate(255,215, 0);
$colors{'orange' } = $im->colorAllocate(255,127, 0);
$colors{'navy' } = $im->colorAllocate( 0, 0,128);
$colors{'teal' } = $im->colorAllocate( 0,128,128);
$colors{'aqua' } = $im->colorAllocate( 0,255,255);

# Print the ruler
$im->line($x, $y, $x + $l, $y, $colors{$col_l});
for (my $i = $x; $i <= $x + $l ; $i += 100) { $im->line($i, $y - 3, $i, $y + 3, $colors{$col_l});
$im->string(gdTinyFont, $i + 1, $y, $i - 10, $colors{$col_b});
$y += 20;

# Print the contig
$im->filledRectangle($x, $y, $x + $l, $y + 12, $colors{$col_f});
$im->string(gdTinyFont, $x+5, $y, "$c($l)", $colors{$col_t});
my $arrow = new GD::Polygon;
$arrow->addPt($x + $l, $y - 3);
$arrow->addPt($x + $l + 5, $y + 6);
$arrow->addPt($x + $l, $y + 15);
$arrow->addPt($x + $l, $y - 3);
$im->filledPolygon($arrow, $colors{$col_f});

# Add each sequence
foreach my $s (keys %{ $contigs{$c} }) {
$y += 20;
my $i = $contigs{$c}{$s}{'ini'};
my $e = $contigs{$c}{$s}{'end'};
my $d = $contigs{$c}{$s}{'dir'};
my $n = $ids{$s}{'name'}; $n =~ s/>//g;

# define the color
my $col = $col_f;
$col = $col_r if ($d eq '-');

# Main block
$im->filledRectangle($i, $y, $e, $y + 12, $colors{$col});

# Create the arrow tip
my $arw = new GD::Polygon;
if ($d eq '+') { # to the left
$arw->addPt($e, $y - 3);
$arw->addPt($e + 5, $y + 6);
$arw->addPt($e, $y + 15);
$arw->addPt($e, $y - 3);
else { # to the right
$arw->addPt($i, $y - 3);
$arw->addPt($i - 5, $y + 6);
$arw->addPt($i, $y + 15);
$arw->addPt($i, $y - 3);
$im->filledPolygon($arw, $colors{$col});

# add the name
my $flen = $ids{$s}{'len'};
my $rmat = $e - $i + 1;
$im->string(gdTinyFont, $i+5, $y, "$n($rmat/$flen)", $colors{$col_t});

# make sure we are writing to a binary stream
binmode G;
# Convert the image to PNG and print it
print G $im->png;
close G;

=head2 usage

Print help if no values or aks for help


sub usage {
print <<__HELP__
Usage: clusterSeq.pl -f FASTA [-plot] [-gff] [-add FASTA]

-help | -h Print this screen
-file | -f Fasta file to use
-add | -a Add the fasta file with blast hits
-plot | -p Create PNG images for each contig
-gff | -g Create GFF3 files for each contig
exit 1;

=head1 AUTHOR

Juan Caballero

=head1 CONTACT

linxe __a__ glib.org.mx

=head1 LICENSE

Perl Artistic License v2.0