Wednesday, February 17, 2010

Maping TFBS in a sequence

I needed to test some sequence for known TFBS, I was looking for: (1) a public database of TFBS and (2) a program to search the motifs in a sequence(s). First I found Jaspar as an alternative to the commercial TransFac DB, I downloaded the matrices for vertebrate.

The second problem was to find a program to search the motifs in any sequence, I did not find any updated program (many are published, but few websites are active nowadays).

So, I decided to write my own program, I share the code if anyone can use it:

#!/usr/bin/perl -w
use strict;
use Getopt::Long;

=head1 NAME

jasparScan.pl

=head1 USAGE

jasparScan.pl -f FASTA -m MATRIX [-s MODELS] [-c CUT_OFF]

OPTIONS:

-h --help Help screen

-f --fasta Fasta file to scan

-m --matrix Jaspar motif matrix

-s --models List of models, separated by ':' [Def: all]

-c --cutoff Similarity cut-off [Def: 0.90]

=head1 DESCRIPTION

Script to map a TF binding site matrix (from Jaspar DB)
into a fasta sequence. The similarity is defined as a
value between 0-1 (1 is the best match). Base by base
are compared in the matrix frequencies and normalized
by the best match expected. This scoring system is based
on a reduced MatInspector similarity matrix (Quandt et
al, 1995).

=cut

# Program parameters
my $fasta = undef;
my $matrix = undef;
my $models = undef;
my $help = undef;
my $cutoff = 0.9;

usage() unless (GetOptions('help|h' =>\$help,
'fasta|f=s' =>\$fasta,
'matrix|m=s'=>\$matrix,
'models|s:s'=>\$models,
'cutoff|c:s'=>\$cutoff ));
usage() if(defined $help);
usage() unless(defined $fasta and defined $matrix);

# Header print
print "# jasperScan.pl\n";
print "# FASTA=$fasta, MATRIX=$matrix, CUTOFF=$cutoff";
if (defined $models) { print ", MODELS=$models"; }
print "\n";

# Global variables
my %seqs = ();
my %mods = ();
my $sid = '';
my $seq = '';
my $num_seq = 0;
my $num_mod = 0;

# Loading sequences
$num_seq = loadSeqs($fasta);
if ($num_seq >= 1) {
print "# loaded $num_seq sequences\n";
} else {
die "no valid sequences\n";
}

# Loading matrix models
if (defined $models) {
$num_mod = loadMods($matrix, $models);
} else {
$num_mod = loadMods($matrix);
}

if ($num_mod <= 1) {
print "# loaded $num_mod models\n"
} else {
die "no valid models\n";
}

print "# SEQ\tPOS\tTFBS\tSIMIL\tRAW_SIM\tBEST_SIM\n";
# Main search
while ( ($sid, $seq) = each %seqs ) {
foreach my $mod (keys %mods) {
my $len = $mods{$mod}{'len'};
my $best = $mods{$mod}{'best'};
for (my $i = 0; $i <= ((length $seq) - $len); $i++) {
my $word = substr($seq, $i, $len);
next if($word =~ /[^ACGT]/);
my $sco = 0;
my $sco_n = 0;
for (my $j = 0; $j <= ($len - 1); $j++) {
my $b = substr($word, $j, 1);
$sco += $mods{$mod}{$b}[$j];
}
$sco /= $len;
$sco_n = $sco / $best;
next unless ($sco_n >= $cutoff);
my $ini = $i + 1;
my $end = $ini + $len;
print "$sid\t$ini-$end\t$mod\t$sco_n\t$sco\t$best\n";
}
}
}


=head1 SUBROUTINES

=cut

sub loadSeqs {
my $f = shift @_;
my $n = 0;
open F, "$f" or die "cannot open $f\n";
while (<F>) {
chomp;
if (/>(.+)/) {
$sid = $1;
$n++;
} else {
$seqs{$sid} .= $_;
}
}
close F;
return $n;
}

sub loadMods {
my $file = shift @_;
my $sel = shift @_;
my %sel = ();
my $n = 0;
if (defined $sel) {
my @sel = split (/:/, $sel);
foreach my $elem (@sel) { $sel{$elem}++; }
}
local $/ = "\n>";
open F, "$file" or die "cannot open $file\n";
while (<F>) {
if (defined $sel) {
/MA\d{4}/;
my $id = $&;
next unless(defined $sel{$id});
}
s/>//g;
s/\]//g;
s/\[//g;
my ($name, $fA, $fC, $fG, $fT) = split (/\n/, $_);
my @fA = split (/\s+/, $fA); shift @fA;
my @fC = split (/\s+/, $fC); shift @fC;
my @fG = split (/\s+/, $fG); shift @fG;
my @fT = split (/\s+/, $fT); shift @fT;
my $best = 0;
my $len = 0;

for (my $i = 0; $i <= $#fA; $i++) {
my $sum = $fA[$i] + $fC[$i] + $fG[$i] + $fT[$i];
$best += max($fA[$i],$fC[$i],$fG[$i],$fT[$i])/$sum;
$mods{$name}{'A'}[$i] = $fA[$i] / $sum;
$mods{$name}{'C'}[$i] = $fC[$i] / $sum;
$mods{$name}{'G'}[$i] = $fG[$i] / $sum;
$mods{$name}{'T'}[$i] = $fT[$i] / $sum;
$len++;
}
$mods{$name}{'best'} = $best / $len;
$mods{$name}{'len'} = $len;
$n++;
}
close F;
return $n;
}

sub max {
my $max = shift @_;
foreach my $x (@_) {
$max = $x if($x > $max);
}
return $max;
}

sub usage {

print <<_THIS_
jasparScan.pl -f FASTA -m MATRIX [-s MODELS] [-c CUT_OFF]

Script to map a TF binding site matrix (from Jaspar DB)
into a fasta sequence.

OPTIONS:
-h --help Help screen
-f --fasta Fasta file to scan
-m --matrix Jaspar motif matrix
-s --models List of model to use, separated by ':' [Def: all]
-c --cutoff Similarity cut-off [Def: 0.9]

Juan Caballero @ ISB 2010

_THIS_
;
exit 1;
}

=head1 AUTHOR

Juan Caballero
Institute for Systems Biology @ 2010

=head1 LICENSE

This is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with code. If not, see <http://www.gnu.org/licenses/>.


=cut

Any nice script to color my code inside Blogger?

No comments:

Post a Comment