Wednesday, September 12, 2012

Programming challenge - synthetic whole genome vcf

I found in BioStar an nice programming challenge to produce an alternative VCF file from a complete genome sequence (the motivation to have such file is a mystery to me), anyway, I and many others produce a solution in C, Python, Perl and even AWK. As expected the C solution is the faster (but longest code), surprisily Python is really close in speed and really compact. My Perl wasn't bad, but is still a little slow.

Here is my final code after reducing the initial solution:

print join "\t",'#CHROM','POS','ID','REF','ALT','QUAL','FILTER','INFO';
print "\n";
%a = ('A'=>'C,G,T', 'C'=>'A,G,T', 'G'=>'A,C,T', 'T'=>'A,C,G');
while (<>) {
  if (m/>(.+)/) { $chr = $1; $i = 0; }
  else {
    @a = split(//, uc $_);
    foreach $b (@a) {
      if ($a{$b}) {
        print join "\t", $chr, $i, '.', $b, $a{$b}, 100, 'PASS', 'DP=100';
        print "\n";

No comments:

Post a Comment