Title | Soft-404 pages, a crawling problem |
Publication Type | Journal Article |
Year of Publication | 2014 |
Authors | Prieto, VM, Álvarez, M, Cacheda, F |
Journal | Journal of Digital Information Management |
Volume | 12 |
Issue | 2 |
Pagination | 73 - 92 |
Date Published | 2014 |
Keywords | Algorithms, Data mining, Design, Link analysis, Performance, Soft-404 error, Statistical properties of the web, Web decay, Web spam |
Abstract | During its traversal of the Web, crawler systems have to deal with multiple challenges. Some of them are related with detecting garbage content to avoid wasting resources processing it. Soft-404 pages are a type of garbage content generated when some web servers do not use the appropriate HTTP response code for death links making them to be incorrectly identified. Our analysis of the Web has revealed that 7.35% of web servers send a 200 HTTP code when a request for an unknown document is received, instead of a 404 code, which indicates that the document is not found. This paper presents a system called Soft404Detector, based on web content analysis to identify web pages that are Soft-404 pages. Our system uses a set of content-based heuristics and combines them with a C4.5 classifier. For testing purposes, we built a Soft-404 pages dataset. Our experiments indicate that our system is very effective, achieving a precision of 0.992 and a recall of 0.980 at Soft-404 pages. |
URL | http://www.scopus.com/inward/record.url?eid=2-s2.0-84903202899&partnerID=40&md5=fa1bd1e59e4da6c413c6527da17e6e6b |