Topic-Based Hierarchical Segmentation

Jen-Tzung Chien, Chuang Hua Chueh

Research output: Contribution to journalArticlepeer-review

24 Scopus citations

Abstract

Latent Dirichlet allocation (LDA) is a new paradigm of topic model which is powerful to capture the latent topic information from natural language. However, the topic information in text streams, e.g. meeting recording, lecture transcription and conversational dialogue, are inherently heterogeneous and nonsta-tionary without explicit boundaries. It is difficult to train a precise topic model from the observed text streams. Furthermore, the usage of words in different paragraphs within a document is varied with different composition styles. In this paper, we present a new hierarchical segmentation model (HSM) where the heterogeneous topic information in stream level and the word variations in document level are characterized. We incorporate the contextual topic information in stream-level segmentation. The topic similarity between sentences is used to form a beta distribution reflecting the prior knowledge of document boundaries in a text stream. The distribution of segmentation variable is adaptively updated to achieve flexible segmentation and is used to group coherent sentences into a topic-specific document. For each pseudo-document, we further use a Markov chain to detect the stylistic segments within a document. The words in a segment are accordingly generated by the same composition style, which differs from the style of the next segment. Each segment is represented by a Markov state, and so the word variations within a document are compensated. The whole model is trained by a variational Bayesian EM procedure and is evaluated on using TDT2 corpus. Experimental results show benefits by using the proposed HSM in terms of perplexity, segmentation error, detection accuracy and F measure.

Original languageEnglish
Pages (from-to)55-66
Number of pages12
JournalIEEE Transactions on Audio, Speech and Language Processing
Volume20
Issue number1
DOIs
StatePublished - 1 Jan 2012

Keywords

  • Hierarchical model
  • natural language
  • text segmentation
  • topic model
  • variational Bayes

Fingerprint Dive into the research topics of 'Topic-Based Hierarchical Segmentation'. Together they form a unique fingerprint.

Cite this