Biological sequence data mining

Yuh-Jyh Hu*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Scopus citations

Abstract

Biologists have determined that the control and regulation of gene expression is primarily determined by relatively short sequences in the region surrounding a gene. These sequences vary in length, p osition, redundancy, orien tation, and bases. Finding these short sequences is a fundamental problem in molecular biology with important applications. Though there exist many different approaches to signal/motif (i. e. short sequence) finding, in 2000 Pevzner and Sze reported that most current motif finding algorithms are incapable of detecting the target signals in their so-called Challenge Problem. In this paper, w e show that using an iterative-restart design, our new algorithm can correctly find the targets. Furthermore, taking into account the fact that some transcription factors form a dimer or even more complex structures, and transcription process can sometimes involve multiple factors, w e extend the original problem to an even more challenging one. We address the issue of combinatorial signals with gaps of variable lengths. To demonstrate the efficacy of our algorithm, w e tested it on a series of the original and the new challenge problems, and compared it with some representative motif-finding algorithms. In addition, to verify its feasibility in real-world applications, w e also tested it on several regulatory families of yeast genes with known motifs. The purpose of this paper is two-fold. One is to introduce an improved biological data mining algorithm that is capable of dealing with more variable regulatory signals in DNA sequences. The other is to propose a new research direction for the general KDD community.

Original languageEnglish
Title of host publicationPrinciples of Data Mining and Knowledge Discovery - 5th European Conference, PKDD 2001, Proceedings
EditorsArno Siebes, Luc De Raedt
PublisherSpringer Verlag
Pages228-240
Number of pages13
ISBN (Print)9783540425342
DOIs
StatePublished - 1 Jan 2001
Event5th European Conference on Principles of Data Mining and Knowledge Discovery, PKDD 2001 - Freiburg, Germany
Duration: 3 Sep 20015 Sep 2001

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume2168
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference5th European Conference on Principles of Data Mining and Knowledge Discovery, PKDD 2001
CountryGermany
CityFreiburg
Period3/09/015/09/01

Fingerprint Dive into the research topics of 'Biological sequence data mining'. Together they form a unique fingerprint.

  • Cite this

    Hu, Y-J. (2001). Biological sequence data mining. In A. Siebes, & L. De Raedt (Eds.), Principles of Data Mining and Knowledge Discovery - 5th European Conference, PKDD 2001, Proceedings (pp. 228-240). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 2168). Springer Verlag. https://doi.org/10.1007/3-540-44794-6_19