Perl BioGolf

Do you know what is a Perl Golf problem? It's a general problem formulated and you try to resolve with a minimal number of characters in a perl script, who writes less win. Some times is a good habit to see, admire and think in this beautiful pearls. Commonly there are a lot in the Perl Monks website.

Today I was looking for a more simple and effective subroutine to translate a DNA/RNA sequence into the corresponding peptide version using the typical genetic code, I used the typical solution with a hash storing the code and call the sequence in block with substr or pop/shift.

I found this solutions in a Perl Golf challenge:


# Typical solution hashing the codes:
sub f0 { #by tadman
my %g = (
# . - Stop
'UAA'=>'.','UAG'=>'.','UGA'=>'.',
# A - Alanine
'GCU'=>'A','GCC'=>'A','GCA'=>'A','GCG'=>'A',
# C - Cysteine
'UGU'=>'C','UGC'=>'C',
# D - Aspartic Acid
'GAU'=>'D','GAC'=>'D',
# E - Glutamic Acid
'GAA'=>'E','GAG'=>'E',
# F - Phenylalanine
'UUU'=>'F','UUC'=>'F',
# G - Glycine
'GGU'=>'G','GGC'=>'G','GGA'=>'G','GGG'=>'G',
# H - Histidine
'CAU'=>'H','CAC'=>'H',
# I - Isoleucine
'AUU'=>'I','AUC'=>'I','AUA'=>'I',
# K - Lysine
'AAA'=>'K','AAG'=>'K',
# L - Leucine
'CUU'=>'L','CUC'=>'L','CUA'=>'L','CUG'=>'L',
'UUA'=>'L','UUG'=>'L',
# M - Methionine
'AUG'=>'M',
# N - Asparagine
'AAU'=>'N','AAC'=>'N',
# P - Proline
'CCU'=>'P','CCC'=>'P','CCA'=>'P','CCG'=>'P',
# Q - Glutamine
'CAA'=>'Q','CAG'=>'Q',
# R - Arginine
'CGU'=>'R','CGC'=>'R','CGA'=>'R','CGG'=>'R',
'AGA'=>'R','AGG'=>'R',
# S - Serine
'UCU'=>'S','UCC'=>'S','UCA'=>'S','UCG'=>'S',
'AGU'=>'S','AGC'=>'S',
# T - Threonine
'ACU'=>'T','ACC'=>'T','ACA'=>'T','ACG'=>'T',
# V - Valine
'GUU'=>'V','GUC'=>'V','GUA'=>'V','GUG'=>'V',
# W - Tryptophan
'UGG'=>'W',
# Y - Tyrosine
'UAU'=>'Y','UAC'=>'Y',
);
$_=pop;s/.{1,3}/$g{$&}/g;$_
}

# Second solution using the non-specific code.
sub f2{ #by MeowChow
my @r = qw(UA[AG]|UGA GC. - UG[UC] GA[UC] GA[AG] UU[UC] GG. CA[UC] AU[^G] - AA[AG] CU.|UU[AG] AUG AA[UC] - CC. CA[AG] CG.|AG[AG] UC.|AG[UC] AC. - GU. UGG - UA[UC] ^);
((my$t=pop)=~s|..?.?|chr 64+(grep$&=~/$r[$_]/,0..26)[0]|eg);$t=~y/@Z/./d;
$t
}

# Third solution including regex and substitutions
sub f3 { #by no_slogan
$_="KNNKtIIIMRSSRQHHQplr.YY.sLFFL.CCWEDDEavg";
s/[a-z]/uc$&x4/eg;@x=/./g;join"",@x[map{$x=0;$x=$x*4|6&ord for/./g;$x/2}pop=~/.../g]
}

# Fourth solution similar to 3rd.
sub f4 { #by srawls
$_="KNNKtIIIMRSSRQHHQplr.YY.sLFFL.CCWEDDEavg";s/[a-z]/uc$&x4/eg;
join"",(/./g)[map{$x=0;$x=$x*4|6&ord for/./g;$x/2}pop=~/.../g]
}

# Fifth solution inverse from 3rd and 4th
sub f5 { #by tachyon
@_{'UAAUAGUGAGCUGCCGCAGCGUGUUGCGAUGACGAAGAGUUUUUCGGUGGCGGAGGGCAUCACAUUAUCAUAAAAAAGCUUCUCCUACUGUUAUUGAUGAAUAACCCUCCCCCACCGCAACAGCGUCGCCGACGGAGAAGGUCUUCCUCAUCGAGUAGCACUACCACAACGGUUGUCGUAGUGUGGUAUUAC'=~/(...)/g}=split//,'...AAAACCDDEEFFGGGGHHIIIKKLLLLLLMNNPPPPQQRRRRRRSSSSSSTTTTVVVVWYY';
$_=pop;
s/..?.?/$_{$&}/g;
$_
}

# Sixth solution and this is the fastest solution
sub f5{ #by tadman
$_=pop;
y/UCAG/0123/;
s/(.)(.)(.)/substr"FFLLSSSSYY..CC.WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG",$1<<4|$2<<2|$3,1/ge; y/0123//d;
$_
}
All solutions have less bytes but I added some break-lines to present a more clear code (really?).

I use the last solution, just change the code for ATGC (DNA code) and not AUGC (RNA code).

That's why Perl rules in Bioinformatic.

Comments

Popular posts from this blog

Code evolution

When visuals gone wrong ...