Tuesday, October 28, 2008

BioPython

Last day, I need to download some sequences from NCBI GenBank, I have a list of ids, typically I used BioPerl to connect and get the fasta sequence of each one, with a code like this:

#!/usr/bin/perl -w
use strict;
use Bio::DB::GenBank;
my $gb = new Bio::DB::GenBank;
open F, "gene_list" or die "cannot open genes_list\n";
while (<F>) {
chomp;
my $seq = $gb->get_Seq_by_id($_);
print $seq->seq;
}


Because Broadcast with Mac OS X 10.5.X cannot compile the BioPerl modules (I found many problems when you try to compile from source code, because many dependencies are broken). So, I take a look to BioPython, install it (with some warnings and missing optional packages), and use the next script:

#!/usr/bin/python
from Bio import Entrez
Entrez.email = "mymail@something.org" # Always tell NCBI who you are
f = open("genes_list", "r")
while True:
myid = f.readline()
if not myid: break
handle = Entrez.efetch(db="nucleotide", id=myid, rettype="fasta")
print handle.read()

and this works well ...

I know this is not so much efficient because require a call for each id, but some times is better when you are downloading big sequences or so many.

My two recommendations:
1. Dominate one computer language but be familiar with others. Better if you dominate more than one.
2. Don't waste time when you know there are more than one solution, if one fails, try next option.

No comments:

Post a Comment