Fast query evaluation through document identifier assignment for inverted file-based information retrieval systems

Cher Sheng Cheng*, Chung-Ping Chung, Jyh-Jiun Shann

*Corresponding author for this work

Research output: Contribution to journalArticle

2 Scopus citations

Abstract

Compressing an inverted file can greatly improve query performance of an information retrieval system (IRS) by reducing disk I/Os. We observe that a good document identifier assignment (DIA) can make the document identifiers in the posting lists more clustered, and result in better compression as well as shorter query processing time. In this paper, we tackle the NP-complete problem of finding an optimal DIA to minimize the average query processing time in an IRS when the probability distribution of query terms is given. We indicate that the greedy nearest neighbor (Greedy-NN) algorithm can provide excellent performance for this problem. However, the Greedy-NN algorithm is inappropriate if used in large-scale IRSs, due to its high complexity O(N2 × n), where N denotes the number of documents and n denotes the number of distinct terms. In real-world IRSs, the distribution of query terms is skewed. Based on this fact, we propose a fast O(N × n) heuristic, called partition-based document identifier assignment (PBDIA) algorithm, which can efficiently assign consecutive document identifiers to those documents containing frequently used query terms, and improve compression efficiency of the posting lists for those terms. This can result in reduced query processing time. The experimental results show that the PBDIA algorithm can yield a competitive performance versus the Greedy-NN for the DIA problem, and that this optimization problem has significant advantages for both long queries and parallel information retrieval (IR).

Original languageEnglish
Pages (from-to)729-750
Number of pages22
JournalInformation Processing and Management
Volume42
Issue number3
DOIs
StatePublished - 1 May 2006

Keywords

  • Document identifier assignment
  • Inverted file compression
  • Inverted index
  • Query evaluation
  • d-Gap technique

Fingerprint Dive into the research topics of 'Fast query evaluation through document identifier assignment for inverted file-based information retrieval systems'. Together they form a unique fingerprint.

  • Cite this