Filtering common miRNAs

Last day, M asks me how to compact the microRNA mature fasta-file from miRbase, the problem is many miRNA are conservated across species, so the fasta file with all the mature sequences has a lot of redundancy and he wants unique sequences to screen some sequences.

This is a simple Perl solution, the first step is to read and collect the sequences, the trick is to save in a hash using the nucleic sequences as "key" and append the names as "value". After this, simply extract the "keys" and parse a little the "values" to have a common identifier.

Code:

#!/usr/bin/perl -w
use strict;

=head1 NAME

uniq_mir.pl FASTA

=head1 DESCRIPTION

Filter the multifasta file of miRbase (mature.fa)
into unique miR by sequence.

=cut

$ARGV[0] or die "Usage: uniq_mir.pl FASTA\n";
my $name = '';
my %seqs = ();
open F, "$ARGV[0]" or die "cannot open $ARGV[0]\n";
while (<F>) {
chomp;
if (/^>/) {
s/>//;
my @line = split (/\s+/, $_);
$name = shift @line;
}
else {
$seqs{$_} .= "$name:";
}
}
close F;

foreach my $seq (keys %seqs) {
my $sp = $seqs{$s};
my @sp = split (/:/, $n);
my $mir = shift @sp;
$mir =~ s/^\w+-//;
$sp =~ s/:$//;
print ">$mir $sp\n$seq\n";
}

=head1 AUTHOR

Juan Caballero @ 2009

=head1 CONTACT

linxe11 _at_ gmail.com

=head1 LICENSE

Perl Artistic http://perldoc.perl.org/perlartistic.html

=cut

Comments

Popular posts from this blog

PS3 cluster

Ubuntu Cluster - Slave nodes