Motivation: Genomes in higher eucaryotic organisms contain a substantial amount
of (almost) repeated sequences. Tandem Repeats (TRs) constitute a large
class of repetitive sequences that are originated via phenomena such
as replication slippage and are characterized by close spatial contiguity.
They play an important role in several molecular regulatory mechanisms, and also in
several diseases (e.g. in the group of trinucleotide repeat disorders).
While for tandem repeats with a low or medium level of divergence the current methods are
rather effective, the problem of detecting TRs with higher divergence (fuzzy TRs) is still
open. The detection of fuzzy TRs is propaedeutic to enriching our view of their role in regulatory mechanisms and diseases.
Fuzzy TRs are also important as tools to shed light on the evolutionary history of
the genome, where higher divergence correlates with more remote duplication events.
We have developed an algorithm (christened TRStalker) with the aim of detecting efficiently Tandem Repeats that are hard to detect because of their inherent fuzziness, due to high levels of base substitutions, insertions and deletions.
To attain this goal we developed heuristics to solve a Steiner version of the problem for which the fuzziness is measured with respect to a motif string not necessarily present in the input string. This problem akin to the "generalized median string" that is know to be an NP-hard problem. Experiments with both synthetic and biological sequences demonstrate that our method performs better than current state of the art for fuzzy TRs and that the fuzzy TRs of the type we detect are indeed present in important biological sequences.
Availability: TRStalker will be integrated in the Web-based Tandem Repeats Discovery Service (TReaDS) at bioalgo.iit.cnr.it.