Extracting information from coarser-grained data in XML documents

TitleExtracting information from coarser-grained data in XML documents
Publication TypeJournal Article
Year of Publication2006
AuthorsBadr, Y
JournalJournal of Digital Information Management
Volume4
Issue2
Pagination117 - 123
Date Published2006
KeywordsData exchange, Finite state transducers, XML database, Xtractor wrapper
Abstract

XML is fast emerging as the dominant standard for representing data in the applications centric documents. While there has been a great deal of works recently proposing the extraction of relevant data of natural langujge texts, Most of the underlying works confront with the irregular structure hidden in the text. To this end, a large spectrum of wrappers has been conceived lot- web pages. Unfortunately, they cannot deal with semi-structured data and cannot still take into consideration the natural language processing. In this paper, we present a specification language to write expressive and easy extraction patterns. The specification relies on rectular expression fashion in order to write patterns by non expert users. In addition, we introduce the Xtractor wrapper for coarser-grained data (i.e. paragraphs). The Xtractor hinges on linguistic parsing. of paragraphs and applies technical and natural language dictionaries. Then it employs the extraction patterns against the pre-processed paragraphs in order to locate relevant data. The key idea of our approach consists of translating the extraction patterns to Finite State Transducers (FST) and even using the FST to build the domain specific dictionaries.

URLhttp://www.scopus.com/inward/record.url?eid=2-s2.0-44649098973&partnerID=40&md5=e3df331dc7e1172e6d94226eecc4f744

Collaborative Partner

Institute of Electronic and Information Technology (IEIT)

Collaborative Partner

Collaborative Partner