Tuesday, February 24, 2009

DNA fractals

Some days ago, I had a talk with I who suggested me to use some mathematical approaches to compare artificial with real sequences (one of my projects). Besides mutual information and SVM classification systems, he told me about fractals presents in the sequences. The method had been called "Chaos game for DNA" (emboss has a subprogram to do it named chaos) and the algorithm is very simple:
  1. use a square with each corner represents a "letter" (AGCT),
  2. start in the middle of the square and read the sequence,
  3. you add a point in the halfway of the previous point and the new letter,
  4. continue until you finish the sequence

This method was previously described in the 90's, recently we have more genome information to continue exploring this characteristic in the DNA sequence, some application include alignment-free comparison and other iterative systems which reproduce this effect.

After I plot some sequences, I asked me to test other species to see differences, but as G concludes, the pattern is present because genomic sequence has low CpG content (this is "CG" are poorly represent, biological implications are DNA regulation with methylation).

We still discussing if DNA can be called "fractal" and if this characteristic can be used to scan sequences looking for functional/non-functional parts or some insight.

I twitted some thoughts about this like "fractals are beauty", Marcha replied: "fractals are evil", so "evil is beauty".

Thursday, February 19, 2009

Tips And Tricks For Bioinformatics Software Engineering

Gus told me about this blog post which is the condensed version of this presentation:
Let's discuss some points:
  1. Learn UNIX. This is basic, many people comes from the MS-world and wants to do the same tasks as in UNIX, wrong, the UNIX philosophy and design is optimized for heavy task and multiple activities, besides the complete support for programming, servers, administration, ... I say: don't waste your time, use UNIX (Linux, Mac OS, BSD, Solaris, ...).
  2. Know many computer languages but be master in ONE. In this part the talk is focused in Java/Python/Ruby, I still suggest Perl. Perl big problem is the object-oriented pragma, but nothing compares the power of regular expressions and text-parsing of Perl. Besides Perl is less grammar strict than Java/Python, it's easy to learn but easy to acquire bad habits.
  3. Don’t reinvent the wheel. I agree, why do spend time and money developing something which do the same as other existing tools but in different colors? Code less means faster development.
  4. Learn one text editor really well. You can write in many text editors, but few really support code development and other features which enhance your code-fu.
  5. Control version. CVS and similar tools helps to monitor, share and backup your valuable code.
  6. Don’t be afraid to use more than 3 letters to define a variable. Common mistake in beginners (and lazy programmers), you must have a legible code, always.
  7. Balance architecture and accomplishment. Software development can be elegant, complete, extensible and with order, but this takes time, find an equilibrium.
  8. Automate documentation. Or insert the documentation in the code, I use PerlDoc features in my code, easy way to write the help files.
  9. Kill the flat file. New technologies has high throughput files, so is better to use a DB (or kind-of) to store and retrieve the data. BerkelyDB, SQLite, SOAP/ReST, ...
  10. New ways to do parallel computing. Check MapReduce, cloud systems and support for multiple cores technologies.
  11. Embrace hardware. Read and understand novel technologies in data-crunching, vectorial CPU are back with graphical processors (GPU) and FGPA chips. Many algorithms can be ported and be faster in this chips, but not all of them.
  12. Playing nice with others. Support multiple outputs like XML, JSON, YAML to easy integrate with other tools.

Wednesday, February 18, 2009

Filtering common miRNAs

Last day, M asks me how to compact the microRNA mature fasta-file from miRbase, the problem is many miRNA are conservated across species, so the fasta file with all the mature sequences has a lot of redundancy and he wants unique sequences to screen some sequences.

This is a simple Perl solution, the first step is to read and collect the sequences, the trick is to save in a hash using the nucleic sequences as "key" and append the names as "value". After this, simply extract the "keys" and parse a little the "values" to have a common identifier.


#!/usr/bin/perl -w
use strict;

=head1 NAME

uniq_mir.pl FASTA


Filter the multifasta file of miRbase (mature.fa)
into unique miR by sequence.


$ARGV[0] or die "Usage: uniq_mir.pl FASTA\n";
my $name = '';
my %seqs = ();
open F, "$ARGV[0]" or die "cannot open $ARGV[0]\n";
while (<F>) {
if (/^>/) {
my @line = split (/\s+/, $_);
$name = shift @line;
else {
$seqs{$_} .= "$name:";
close F;

foreach my $seq (keys %seqs) {
my $sp = $seqs{$s};
my @sp = split (/:/, $n);
my $mir = shift @sp;
$mir =~ s/^\w+-//;
$sp =~ s/:$//;
print ">$mir $sp\n$seq\n";

=head1 AUTHOR

Juan Caballero @ 2009

=head1 CONTACT

linxe11 _at_ gmail.com

=head1 LICENSE

Perl Artistic http://perldoc.perl.org/perlartistic.html


Thursday, February 12, 2009

Command-line Fu

Recently I found this great site Command-line Fu, a site where you can find these beautiful gems for the command line but it's different from other collections:

What's this?

Command-Line-Fu is the place to record those command-line gems that you return to again and again.

Delete that bloated snippets file you've been using and share your personal repository with the world. That way others can gain from your CLI wisdom and you from theirs too. All commands can be commented on and discussed - digg-esque voting is also encouraged so the best float to the top.

So, anyone can share their commands, you can look, learn and use or suggests better ways to do it, and beyond the Bash, also include tags (functions) for one-liners scripts in perl, python, etc. or by specific commands like curl, ssh, grep, ...

With Twitter and RSS report systems you can follow the latest commands too (but only report the link and the function, not the command).


Via: DownloadSquad

Monday, February 9, 2009


The next February 13th has a special event, not for everyone, just for people like us (yes, you know, accept it ...) because the UNIX time at 11:31:30pm UTC will read in the terminal in curious form: 1234567890.

Why? Check Wikipedia:
"Unix time, or POSIX time, is a system for describing points in time, defined as the number of seconds elapsed since midnight Coordinated Universal Time (UTC) of January 1st. 1970, not counting lap seconds. It is widely used not only on Unix-like operation systems but also in many other computing systems. It is neither a linear representation of time nor a true representation of UTC (though it is frecuently mistaken of both) as the times it represents are UTC but it has no way of repesenting UTC leap seconds (e.g. 1998-12-31 23:59:60)."

If you want to check it, or see other values for your favorite number, you can use the Perl one-line:

perl -e 'print scalar localtime(1234567890),"\n";'

Note: image from Wikimedia showing when the 1000000000 seconds since epoch was celebrated at a party held by DKUUG in Copenhagen, Denmark 2001-09-09T03:46:4.

Friday, February 6, 2009

People loves Win7 ... wait it's KDE4!

ZDNet Australia test some people in the street, they showed a Linux (which one? they didn't mention ...) with KDE4, they activate some fancy effects and told the people "this is the new windows 7", ... many people like it!

I think they must tell the true, to see if they want to switch to Linux, anyway the cheaper W7 will be $200 usd (I prefer to buy a new computer with this money, I'm still saving for my AAO, donations accepted).

Wednesday, February 4, 2009

Bigger, Faster, ...

IBM announced the new big project to build the next #1 supercomputer of the world, codename Sequoia, it's impressive the limit of 20 petaflops ... that's more than the combined power of all top 500 supercomputers (#1 is only 1 petaflops).

The Sequoia will be powered by 1.6 million cores (specific 45-nanometer chips in development) and 1.6 petabytes of memory. It will be housed in 96 refrigerators spanning roughly 280 square meters.

Of course the owner will be the USA gov, who's else?

From: Gizmodo
Image from: xkcd

Monday, February 2, 2009

Bash scripting tips

Bash is almost a complete programming language, I use every day (switching from the default tcsh in the servers, in Mac OS X is default) for automatic job launching.

Extracted from: http://hacktux.com/bash/script/efficient.

  1. Avoid full paths to bash built-ins. Except when you require specific commands, when I use systems with 32 or 64b binaries, I need to specify which one to use, or define an specific directory to use in the PATH or as a variable.
  2. Avoid external commands for integer math. Because bash has math evaluation.

  3. Avoid using cat. Commonly people use cat and send to other commands with a pipe, but many commands can read files in the parameters.

  4. Avoid piping grep to awk. Because awk can filter the inputs, use /pattern/

  5. Avoid piping sed to sed. Multiple filters can be applied to the input with the -e option ( sed -e "s/this/that/" -e "s/old/new/" filename ).

  6. Use double brackets for compound and RegEx tests. [[ ]] evaluate the operations between, ( if [[ expr1 && expr2 ]]; then something; fi )

  7. Use functions for repetitive tasks. Define some functions is simple ( function () { do_something; return $? } )

  8. Use Arrays Instead of multiple variables. Yes, bash has array support, declare as normal variables, and access with []. ARRAY = ("one", "two"); echo ${ARRAY[0]}

  9. Use /bin/mktemp to create temp files. This is new for me, good to know. tempfile = /bin/mktemp

  10. Use /bin/egrep or /bin/sed for RegEx pattern matching. Only for basic RegEx, for more complex patterns it's better to use Perl.