Title | Web information extraction using web-specific features |
Publication Type | Journal Article |
Year of Publication | 2008 |
Authors | Chen, J, Zhong, P |
Journal | Journal of Digital Information Management |
Volume | 6 |
Issue | 3 |
Pagination | 235 - 243 |
Date Published | 2008 |
Keywords | Hidden, Information extraction, Markov model |
Abstract | Several problems exist with traditional HMM based approaches for Web information extraction (IE) due to the lack of consideration on Web-specific features. To address this Issue we present a Generalized Hidden Markov Model (GHMM) that extends HMMs by making use of Web-specific Information for Web IE. In GHMMbased approach, Web content blocks instead of terms are used as basic extraction unit. Besides, instead of using the traditional sequential state transition order, GHMM decides the state transition order based on layout structure of the corresponding web page. Furthermore, GHMM uses multiple emission features derived from Web information instead of single emission feature. Experimental study shows that GHMM based approach can effectively improve Web IE comparing to traditional HMM based approaches. |
URL | http://www.scopus.com/inward/record.url?eid=2-s2.0-70350728184&partnerID=40&md5=f0b7ac3e808f4c890768399e96e15bca |