Abstract
Crawlers in a knowledge management system need to collect and archive documents from websites, and also track the change status of these documents. However, the existence of URL rewriting mechanism raises a page tracking problem since the URLs of a pair of dynamic page instances obtained during different sessions will no longer be the same. This paper proposes a series of algorithms in a bottom-up manner to find the corresponding pairs of dynamic page instances, and then to judge the change status of them. Experiments showed that the performance was very good and the outcome was 100% accurate.
Original language | English |
---|---|
Pages (from-to) | 169-176 |
Number of pages | 8 |
Journal | Conferences in Research and Practice in Information Technology Series |
Volume | 61 |
State | Published - 1 Dec 2006 |
Event | 5th Australasian Data Mining Conference, AusDM 2006 - Sydney, NSW, Australia Duration: 29 Nov 2006 → 30 Nov 2006 |
Keywords
- Crawler
- HTTP session
- String matching
- URL rewriting