Thursday, July 31, 2008

Working with large sequences

I have used Perl for my projects for more than 6 years, when I used the Arabidopsis genome it was easy to load a full chromosome into a string variable, but now I work mining bigger genomes, like human or mouse, and I had the habit to pass direct variable or references to subroutines, bad idea.

The point is when you are using many subs to perform calculations inside the sequences, the memory increase exponentially, one of my scripts takes the sequences, divide into blocks and perform calculation by each block, and store in a hash (O-O programming), first time I made a mistake and create an infinite loop and sucks the entire memory, including the SWAP, our server has 32 GB in RAM and 32GB in SWAP, fortunately the server survive to my error. Next I fixed it, and now the script require about 3 GB, so much ...

After think a little I decide to not pass nothing to the subs, declare globally access for principal variables and this reduces the memory needed to 300 MB for the same process. Nice.

Also I'm learning MatLab, later maybe I compare it to R and other OS maths programs.

No comments:

Post a Comment