Maping big sequences

August 28, 2008

For my work I get some slices of 100 kb from real chromosomes, but I forget to insert in the fasta comment the name and coordinates for each sequences, now I need to know this information for each one, so I tried to map using blast, but in the server a weird error marks "Segmentation fault" (I format the full genome ~3 Gb).

I think in the Perl solution and this is the code:


#!/usr/bin/perl -w
use strict;

=head1 NAME

mapSeq2Chr.pl

=head1 DESCRIPTION

Perl script to find and get the position of each sequence in a
multifasta file into a big sequence (a chromosome).
 
The comparation is direct, so it only works with sequences in same
direction (typical 5' -> 3').
 
Output is a simple text file with the name of the sequence, the name
of the big sequence and the positions where match.
 
Note we match every line with a fasta comment (defined with ">") but
the last sequence is omitted, I add a ">" as last line with:

echo ">" >> fasta

=cut

$ARGV[2] or die "usage: perl mapSeq2Chr.pl CHR SEQ MAP\n";

# Global variables
my ($chr_file) = shift @ARGV; # Fasta file with chromosome
my ($seq_file) = shift @ARGV; # Fasta file with sequences
my ($map_file) = shift @ARGV; # Output file
my ($seq)      =          ''; # Sequence to map
my ($name)     =          ''; # Fasta comment
my ($chr)      =          ''; # Genomic sequence
my ($first)    =         'y'; # Flag for first line

print "Reading genome sequences ...\n";
open C, "$chr_file" or die "Cannot open $chr_file\n";
while (<C>) {
 next if (/>/);
 chomp;
 $chr .= $_;
}
close C;
print "   loaded $chr_file with ", length $chr, " bases\n";

open F,  "$seq_file" or die "Cannot open $seq_file\n";
open O, ">$map_file" or die "Cannot open $map_file\n";
print "Mapping sequences ...\n";
while (<F>) {
 chomp;
 if ($first eq 'y') {
  s/>//;
  $name = $_;
  $first = 'n';
 }
 elsif (/>/) {
  s/>//;
  if ($chr =~ /$seq/) {
   print "   $name mapped in $chr_file ";
   print O "$name\t$chr_file\t";
   
   # We use @- to look for multiple matches
   foreach my $match (@-) {
    print O "$match:";
    print "$match ";
   }
   print O "\n";
   print "\n";
  }
  else {
   print "   $name not match in $chr_file\n";
  }
  $name = $_;
  $seq  = '';
 }
 else {
  $seq .= $_;
 }
}
close F;
close O;
print "Done\n";

=head1 AUTHOR

Juan Caballero (linxe _at_ glib.org.mx)

=head1 LICENSE

Perl Artistic License (http://dev.perl.org/licenses/artistic.html)

=cut

Comments

AA2August 28, 2008 at 11:27 AM
Hey Bro!
Then you are able to find the localization (numerical) of your sequences, but it is possible to obtain also a map or graphical position (draw) of your sequences?
I mean, as in mapviewer in maize, you can check the chromosome and the position (graphical view)

I am also need solve the positions of some sequences against a partial sequence (maize genome), I try your script.

See you Bro!

Cesar
ReplyDelete
Replies
LinxeAugust 28, 2008 at 11:46 AM
Bro:

Yes, you can draw a view using GD or some BioPerl module.

This script is for a perfect match sequence, I don't recommend it for partial sequences (even one base mismatch can be fatal).
ReplyDelete
Replies
LinxeAugust 28, 2008 at 11:51 AM
Thinking a little more, you can use this code to map some oligos (for PCR or HT sequencing), if you multifasta file include the direct and reverse sequence.
ReplyDelete
Replies

Add comment

Search This Blog

Linxe's eye

Maping big sequences

Comments

Post a Comment

Popular posts from this blog

When visuals gone wrong ...

New project start

Old computers Dream