In the area of multimedia processing, a number of studies have been devoted to narrowing the gap between multimedia content and human sense. In fact, multimedia understanding is a difficult and challenging task even using machine-learning techniques. To deal with this challenge, in this paper, we propose an innovative method that employs data mining techniques and content-based paradigm to conceptualize videos. Mainly, our proposed method puts the focus on: (1) Construction of prediction models, namely speech-association model Model sass and visual-statistical model Model CRM, and (2) Fusion of prediction models to annotate unknown videos automatically. Without additional manual cost, discovered speech-association patterns can show the implicit relationships among the sequential images. On the other hand, visual features can atone for the inadequacy of speech-association patterns. Empirical evaluations reveal that our approach makes, on the average, the promising results than other methods for annotating videos.