Thursday, August 28, 2008

Maping big sequences

For my work I get some slices of 100 kb from real chromosomes, but I forget to insert in the fasta comment the name and coordinates for each sequences, now I need to know this information for each one, so I tried to map using blast, but in the server a weird error marks "Segmentation fault" (I format the full genome ~3 Gb).

I think in the Perl solution and this is the code:



#!/usr/bin/perl -w
use strict;

=head1 NAME

mapSeq2Chr.pl

=head1 DESCRIPTION

Perl script to find and get the position of each sequence in a
multifasta file into a big sequence (a chromosome).

The comparation is direct, so it only works with sequences in same
direction (typical 5' -> 3').

Output is a simple text file with the name of the sequence, the name
of the big sequence and the positions where match.

Note we match every line with a fasta comment (defined with ">") but
the last sequence is omitted, I add a ">" as last line with:

echo ">" >> fasta

=cut

$ARGV[2] or die "usage: perl mapSeq2Chr.pl CHR SEQ MAP\n";

# Global variables
my ($chr_file) = shift @ARGV; # Fasta file with chromosome
my ($seq_file) = shift @ARGV; # Fasta file with sequences
my ($map_file) = shift @ARGV; # Output file
my ($seq) = ''; # Sequence to map
my ($name) = ''; # Fasta comment
my ($chr) = ''; # Genomic sequence
my ($first) = 'y'; # Flag for first line

print "Reading genome sequences ...\n";
open C, "$chr_file" or die "Cannot open $chr_file\n";
while (<C>) {
next if (/>/);
chomp;
$chr .= $_;
}
close C;
print " loaded $chr_file with ", length $chr, " bases\n";

open F, "$seq_file" or die "Cannot open $seq_file\n";
open O, ">$map_file" or die "Cannot open $map_file\n";
print "Mapping sequences ...\n";
while (<F>) {
chomp;
if ($first eq 'y') {
s/>//;
$name = $_;
$first = 'n';
}
elsif (/>/) {
s/>//;
if ($chr =~ /$seq/) {
print " $name mapped in $chr_file ";
print O "$name\t$chr_file\t";

# We use @- to look for multiple matches
foreach my $match (@-) {
print O "$match:";
print "$match ";
}
print O "\n";
print "\n";
}
else {
print " $name not match in $chr_file\n";
}
$name = $_;
$seq = '';
}
else {
$seq .= $_;
}
}
close F;
close O;
print "Done\n";

=head1 AUTHOR

Juan Caballero (linxe _at_ glib.org.mx)

=head1 LICENSE

Perl Artistic License (http://dev.perl.org/licenses/artistic.html)

=cut

3 comments:

  1. Hey Bro!
    Then you are able to find the localization (numerical) of your sequences, but it is possible to obtain also a map or graphical position (draw) of your sequences?
    I mean, as in mapviewer in maize, you can check the chromosome and the position (graphical view)

    I am also need solve the positions of some sequences against a partial sequence (maize genome), I try your script.

    See you Bro!

    Cesar

    ReplyDelete
  2. Bro:

    Yes, you can draw a view using GD or some BioPerl module.

    This script is for a perfect match sequence, I don't recommend it for partial sequences (even one base mismatch can be fatal).

    ReplyDelete
  3. Thinking a little more, you can use this code to map some oligos (for PCR or HT sequencing), if you multifasta file include the direct and reverse sequence.

    ReplyDelete