Maping big sequences
For my work I get some slices of 100 kb from real chromosomes, but I forget to insert in the fasta comment the name and coordinates for each sequences, now I need to know this information for each one, so I tried to map using blast, but in the server a weird error marks "Segmentation fault" (I format the full genome ~3 Gb).
I think in the Perl solution and this is the code:
I think in the Perl solution and this is the code:
#!/usr/bin/perl -w
use strict;
=head1 NAME
mapSeq2Chr.pl
=head1 DESCRIPTION
Perl script to find and get the position of each sequence in a
multifasta file into a big sequence (a chromosome).
The comparation is direct, so it only works with sequences in same
direction (typical 5' -> 3').
Output is a simple text file with the name of the sequence, the name
of the big sequence and the positions where match.
Note we match every line with a fasta comment (defined with ">") but
the last sequence is omitted, I add a ">" as last line with:
echo ">" >> fasta
=cut
$ARGV[2] or die "usage: perl mapSeq2Chr.pl CHR SEQ MAP\n";
# Global variables
my ($chr_file) = shift @ARGV; # Fasta file with chromosome
my ($seq_file) = shift @ARGV; # Fasta file with sequences
my ($map_file) = shift @ARGV; # Output file
my ($seq) = ''; # Sequence to map
my ($name) = ''; # Fasta comment
my ($chr) = ''; # Genomic sequence
my ($first) = 'y'; # Flag for first line
print "Reading genome sequences ...\n";
open C, "$chr_file" or die "Cannot open $chr_file\n";
while (<C>) {
next if (/>/);
chomp;
$chr .= $_;
}
close C;
print " loaded $chr_file with ", length $chr, " bases\n";
open F, "$seq_file" or die "Cannot open $seq_file\n";
open O, ">$map_file" or die "Cannot open $map_file\n";
print "Mapping sequences ...\n";
while (<F>) {
chomp;
if ($first eq 'y') {
s/>//;
$name = $_;
$first = 'n';
}
elsif (/>/) {
s/>//;
if ($chr =~ /$seq/) {
print " $name mapped in $chr_file ";
print O "$name\t$chr_file\t";
# We use @- to look for multiple matches
foreach my $match (@-) {
print O "$match:";
print "$match ";
}
print O "\n";
print "\n";
}
else {
print " $name not match in $chr_file\n";
}
$name = $_;
$seq = '';
}
else {
$seq .= $_;
}
}
close F;
close O;
print "Done\n";
=head1 AUTHOR
Juan Caballero (linxe _at_ glib.org.mx)
=head1 LICENSE
Perl Artistic License (http://dev.perl.org/licenses/artistic.html)
=cut
Hey Bro!
ReplyDeleteThen you are able to find the localization (numerical) of your sequences, but it is possible to obtain also a map or graphical position (draw) of your sequences?
I mean, as in mapviewer in maize, you can check the chromosome and the position (graphical view)
I am also need solve the positions of some sequences against a partial sequence (maize genome), I try your script.
See you Bro!
Cesar
Bro:
ReplyDeleteYes, you can draw a view using GD or some BioPerl module.
This script is for a perfect match sequence, I don't recommend it for partial sequences (even one base mismatch can be fatal).
Thinking a little more, you can use this code to map some oligos (for PCR or HT sequencing), if you multifasta file include the direct and reverse sequence.
ReplyDelete