Web information extraction using web-specific features

TitleWeb information extraction using web-specific features
Publication TypeJournal Article
Year of Publication2008
AuthorsChen, J, Zhong, P
JournalJournal of Digital Information Management
Volume6
Issue3
Pagination235 - 243
Date Published2008
KeywordsHidden, Information extraction, Markov model
Abstract

Several problems exist with traditional HMM based approaches for Web information extraction (IE) due to the lack of consideration on Web-specific features. To address this Issue we present a Generalized Hidden Markov Model (GHMM) that extends HMMs by making use of Web-specific Information for Web IE. In GHMMbased approach, Web content blocks instead of terms are used as basic extraction unit. Besides, instead of using the traditional sequential state transition order, GHMM decides the state transition order based on layout structure of the corresponding web page. Furthermore, GHMM uses multiple emission features derived from Web information instead of single emission feature. Experimental study shows that GHMM based approach can effectively improve Web IE comparing to traditional HMM based approaches.

URLhttp://www.scopus.com/inward/record.url?eid=2-s2.0-70350728184&partnerID=40&md5=f0b7ac3e808f4c890768399e96e15bca

Collaborative Partner

Institute of Electronic and Information Technology (IEIT)

Collaborative Partner

Collaborative Partner