On the web data extraction model

I-Chen Wu, Jui Yuan Su, Loon Been Chen

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Scopus citations

Abstract

This paper investigates the data extraction models on the Web. First, this paper delines the general data extraction model. Second, this paper introduces the URL-oriented data extraction (UODE) model, used in many traditional data extraction systems. In the UODE model, the systems extract URLs from pages and then use the extracted URLs to access next pages. However, more and more pages use script functions, such as JavaScript and VBScript, to access next pages, it becomes very difficult to extract URLs from script programs. In order to solve this problem, this paper proposes a new data extraction model, named the browser-oriented data extraction (BODE) model. In this model, the data extraction systems built on top of browsers accesses pages by simulating users' operations on browsers to invoke script functions. However, a potential problem of the BODE model is the consistency of extracted data. This paper also shows how to solve this problem in the BODE model.

Original languageEnglish
Title of host publication17th International Conference on Software Engineering and Knowledge Engineering, SEKE 2005
Pages330-335
Number of pages6
StatePublished - 1 Dec 2005
Event17th International Conference on Software Engineering and Knowledge Engineering, SEKE 2005 - Taipei, Taiwan
Duration: 14 Jul 200516 Jul 2005

Publication series

Name17th International Conference on Software Engineering and Knowledge Engineering, SEKE 2005

Conference

Conference17th International Conference on Software Engineering and Knowledge Engineering, SEKE 2005
CountryTaiwan
CityTaipei
Period14/07/0516/07/05

Keywords

  • BODE
  • Data extraction
  • The internet
  • URL

Fingerprint Dive into the research topics of 'On the web data extraction model'. Together they form a unique fingerprint.

Cite this