Wednesday, October 15, 2008

Perl BioGolf

Do you know what is a Perl Golf problem? It's a general problem formulated and you try to resolve with a minimal number of characters in a perl script, who writes less win. Some times is a good habit to see, admire and think in this beautiful pearls. Commonly there are a lot in the Perl Monks website.

Today I was looking for a more simple and effective subroutine to translate a DNA/RNA sequence into the corresponding peptide version using the typical genetic code, I used the typical solution with a hash storing the code and call the sequence in block with substr or pop/shift.

I found this solutions in a Perl Golf challenge:


# Typical solution hashing the codes:
sub f0 { #by tadman
my %g = (
# . - Stop
'UAA'=>'.','UAG'=>'.','UGA'=>'.',
# A - Alanine
'GCU'=>'A','GCC'=>'A','GCA'=>'A','GCG'=>'A',
# C - Cysteine
'UGU'=>'C','UGC'=>'C',
# D - Aspartic Acid
'GAU'=>'D','GAC'=>'D',
# E - Glutamic Acid
'GAA'=>'E','GAG'=>'E',
# F - Phenylalanine
'UUU'=>'F','UUC'=>'F',
# G - Glycine
'GGU'=>'G','GGC'=>'G','GGA'=>'G','GGG'=>'G',
# H - Histidine
'CAU'=>'H','CAC'=>'H',
# I - Isoleucine
'AUU'=>'I','AUC'=>'I','AUA'=>'I',
# K - Lysine
'AAA'=>'K','AAG'=>'K',
# L - Leucine
'CUU'=>'L','CUC'=>'L','CUA'=>'L','CUG'=>'L',
'UUA'=>'L','UUG'=>'L',
# M - Methionine
'AUG'=>'M',
# N - Asparagine
'AAU'=>'N','AAC'=>'N',
# P - Proline
'CCU'=>'P','CCC'=>'P','CCA'=>'P','CCG'=>'P',
# Q - Glutamine
'CAA'=>'Q','CAG'=>'Q',
# R - Arginine
'CGU'=>'R','CGC'=>'R','CGA'=>'R','CGG'=>'R',
'AGA'=>'R','AGG'=>'R',
# S - Serine
'UCU'=>'S','UCC'=>'S','UCA'=>'S','UCG'=>'S',
'AGU'=>'S','AGC'=>'S',
# T - Threonine
'ACU'=>'T','ACC'=>'T','ACA'=>'T','ACG'=>'T',
# V - Valine
'GUU'=>'V','GUC'=>'V','GUA'=>'V','GUG'=>'V',
# W - Tryptophan
'UGG'=>'W',
# Y - Tyrosine
'UAU'=>'Y','UAC'=>'Y',
);
$_=pop;s/.{1,3}/$g{$&}/g;$_
}

# Second solution using the non-specific code.
sub f2{ #by MeowChow
my @r = qw(UA[AG]|UGA GC. - UG[UC] GA[UC] GA[AG] UU[UC] GG. CA[UC] AU[^G] - AA[AG] CU.|UU[AG] AUG AA[UC] - CC. CA[AG] CG.|AG[AG] UC.|AG[UC] AC. - GU. UGG - UA[UC] ^);
((my$t=pop)=~s|..?.?|chr 64+(grep$&=~/$r[$_]/,0..26)[0]|eg);$t=~y/@Z/./d;
$t
}

# Third solution including regex and substitutions
sub f3 { #by no_slogan
$_="KNNKtIIIMRSSRQHHQplr.YY.sLFFL.CCWEDDEavg";
s/[a-z]/uc$&x4/eg;@x=/./g;join"",@x[map{$x=0;$x=$x*4|6&ord for/./g;$x/2}pop=~/.../g]
}

# Fourth solution similar to 3rd.
sub f4 { #by srawls
$_="KNNKtIIIMRSSRQHHQplr.YY.sLFFL.CCWEDDEavg";s/[a-z]/uc$&x4/eg;
join"",(/./g)[map{$x=0;$x=$x*4|6&ord for/./g;$x/2}pop=~/.../g]
}

# Fifth solution inverse from 3rd and 4th
sub f5 { #by tachyon
@_{'UAAUAGUGAGCUGCCGCAGCGUGUUGCGAUGACGAAGAGUUUUUCGGUGGCGGAGGGCAUCACAUUAUCAUAAAAAAGCUUCUCCUACUGUUAUUGAUGAAUAACCCUCCCCCACCGCAACAGCGUCGCCGACGGAGAAGGUCUUCCUCAUCGAGUAGCACUACCACAACGGUUGUCGUAGUGUGGUAUUAC'=~/(...)/g}=split//,'...AAAACCDDEEFFGGGGHHIIIKKLLLLLLMNNPPPPQQRRRRRRSSSSSSTTTTVVVVWYY';
$_=pop;
s/..?.?/$_{$&}/g;
$_
}

# Sixth solution and this is the fastest solution
sub f5{ #by tadman
$_=pop;
y/UCAG/0123/;
s/(.)(.)(.)/substr"FFLLSSSSYY..CC.WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",$1<<4|$2<<2|$3,1/ge; y/0123//d;
$_
}
All solutions have less bytes but I added some break-lines to present a more clear code (really?).

I use the last solution, just change the code for ATGC (DNA code) and not AUGC (RNA code).

That's why Perl rules in Bioinformatic.

No comments:

Post a Comment