Filtering common miRNAs
Last day, M asks me how to compact the microRNA mature fasta-file from miRbase, the problem is many miRNA are conservated across species, so the fasta file with all the mature sequences has a lot of redundancy and he wants unique sequences to screen some sequences.
This is a simple Perl solution, the first step is to read and collect the sequences, the trick is to save in a hash using the nucleic sequences as "key" and append the names as "value". After this, simply extract the "keys" and parse a little the "values" to have a common identifier.
Code:
This is a simple Perl solution, the first step is to read and collect the sequences, the trick is to save in a hash using the nucleic sequences as "key" and append the names as "value". After this, simply extract the "keys" and parse a little the "values" to have a common identifier.
Code:
#!/usr/bin/perl -w
use strict;
=head1 NAME
uniq_mir.pl FASTA
=head1 DESCRIPTION
Filter the multifasta file of miRbase (mature.fa)
into unique miR by sequence.
=cut
$ARGV[0] or die "Usage: uniq_mir.pl FASTA\n";
my $name = '';
my %seqs = ();
open F, "$ARGV[0]" or die "cannot open $ARGV[0]\n";
while (<F>) {
chomp;
if (/^>/) {
s/>//;
my @line = split (/\s+/, $_);
$name = shift @line;
}
else {
$seqs{$_} .= "$name:";
}
}
close F;
foreach my $seq (keys %seqs) {
my $sp = $seqs{$s};
my @sp = split (/:/, $n);
my $mir = shift @sp;
$mir =~ s/^\w+-//;
$sp =~ s/:$//;
print ">$mir $sp\n$seq\n";
}
=head1 AUTHOR
Juan Caballero @ 2009
=head1 CONTACT
linxe11 _at_ gmail.com
=head1 LICENSE
Perl Artistic http://perldoc.perl.org/perlartistic.html
=cut
Comments
Post a Comment