Biological sequence data mining

Yuh-Jyh Hu*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations


Biologists have determined that the control and regulation of gene expression is primarily determined by relatively short sequences in the region surrounding a gene. These sequences vary in length, p osition, redundancy, orien tation, and bases. Finding these short sequences is a fundamental problem in molecular biology with important applications. Though there exist many different approaches to signal/motif (i. e. short sequence) finding, in 2000 Pevzner and Sze reported that most current motif finding algorithms are incapable of detecting the target signals in their so-called Challenge Problem. In this paper, w e show that using an iterative-restart design, our new algorithm can correctly find the targets. Furthermore, taking into account the fact that some transcription factors form a dimer or even more complex structures, and transcription process can sometimes involve multiple factors, w e extend the original problem to an even more challenging one. We address the issue of combinatorial signals with gaps of variable lengths. To demonstrate the efficacy of our algorithm, w e tested it on a series of the original and the new challenge problems, and compared it with some representative motif-finding algorithms. In addition, to verify its feasibility in real-world applications, w e also tested it on several regulatory families of yeast genes with known motifs. The purpose of this paper is two-fold. One is to introduce an improved biological data mining algorithm that is capable of dealing with more variable regulatory signals in DNA sequences. The other is to propose a new research direction for the general KDD community.

Original languageEnglish
Title of host publicationPrinciples of Data Mining and Knowledge Discovery - 5th European Conference, PKDD 2001, Proceedings
EditorsArno Siebes, Luc De Raedt
PublisherSpringer Verlag
Number of pages13
ISBN (Print)9783540425342
StatePublished - 1 Jan 2001
Event5th European Conference on Principles of Data Mining and Knowledge Discovery, PKDD 2001 - Freiburg, Germany
Duration: 3 Sep 20015 Sep 2001

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Conference5th European Conference on Principles of Data Mining and Knowledge Discovery, PKDD 2001

Fingerprint Dive into the research topics of 'Biological sequence data mining'. Together they form a unique fingerprint.

Cite this