Wednesday, February 18, 2009

Filtering common miRNAs

Last day, M asks me how to compact the microRNA mature fasta-file from miRbase, the problem is many miRNA are conservated across species, so the fasta file with all the mature sequences has a lot of redundancy and he wants unique sequences to screen some sequences.

This is a simple Perl solution, the first step is to read and collect the sequences, the trick is to save in a hash using the nucleic sequences as "key" and append the names as "value". After this, simply extract the "keys" and parse a little the "values" to have a common identifier.


#!/usr/bin/perl -w
use strict;



Filter the multifasta file of miRbase (mature.fa)
into unique miR by sequence.


$ARGV[0] or die "Usage: FASTA\n";
my $name = '';
my %seqs = ();
open F, "$ARGV[0]" or die "cannot open $ARGV[0]\n";
while (<F>) {
if (/^>/) {
my @line = split (/\s+/, $_);
$name = shift @line;
else {
$seqs{$_} .= "$name:";
close F;

foreach my $seq (keys %seqs) {
my $sp = $seqs{$s};
my @sp = split (/:/, $n);
my $mir = shift @sp;
$mir =~ s/^\w+-//;
$sp =~ s/:$//;
print ">$mir $sp\n$seq\n";

=head1 AUTHOR

Juan Caballero @ 2009

=head1 CONTACT

linxe11 _at_

=head1 LICENSE

Perl Artistic


No comments:

Post a Comment