Friday, May 30, 2008

GBK parser

Last day, le Jacou asks me about the GenBank format, because he is writing a parser for this to integrate in his super-shining-colorfull-visual man-see annotation interface, so he needs a parser to split the file by sequences, each one will be in a separate HTML file. Later he show me his code in C# and after I think how it can be translate in Perl, I suggest:


#!/usr/bin/perl -w

# parseGBK2HTML.pl
# Juan Caballero @ 2008

$ARGV[0] or die "Usage: parseGBK2HTML \n";

use CGI qw(:standard);
use BIO:SeqIO;

my $file_in = shift @ARGV;
my ($seq_in, $seq_out, $file_out, $seq, $features);

$seq_in = Bio::SeqIO->new(-format=>'genbank', -file=>$file_in);
while ($seq = $seq_n->next_seq() ) {
$file_out = $seq->accession_number;
$file_out .= '.html';
$features = $seq->get_SeqFeatures;
open OUT, ">$file_out" or die "Cannot open $file_out";
print OUT header, start_html("$file_out");
print OUT p($features);
print OUT end_html;
close OUT;
}

Good: less code, easy to understand, parser is controlled by BioPerl::SeqIO, HTML output is controlled by CGI.
Bad: require 2 modules, maybe modules make slow the routine, can not control the output in get_SeqFeatures, but you can build your own calling the objects inside $seq, it needs some eval points.


That's why I love Perl.

1 comment:

  1. This comment has been removed by a blog administrator.

    ReplyDelete