Hybrid clustering approach for term partitioning in document data sets

Title	Hybrid clustering approach for term partitioning in document data sets
Publication Type	Journal Article
Year of Publication	2008
Authors	Reddy, KT, Shashi, M, Reddy, LP
Journal	Journal of Digital Information Management
Volume	6
Issue	3
Pagination	272 - 277
Date Published	2008
Keywords	Dimensionality reduction, Hierarchical clustering, K means algorithm, Partitional clustering, Text mining
Abstract	Information retrieval is one of the major research areas due to accumulation of huge information in digital form. Various techniques of Information retrieval are based on the fact that terms contained in a document along with their frequency of occurrence signify the semantics of the document. Recent attempts to find the relevant document for a context represents documents in a vector space model as document-term vector containing term weights for every index term in that document. As there will be enormous number of index terms this leads to high dimensionality problem. We can reduce the dimensionality based on the observation that groups of terms associated with related concepts occur together or do not occur in a document based on whether the document is relevant or not to that concept. Such a group of terms Is identified as an equivalence class and can be viewed as a single dimension in a Rough set based information retrieval system. In this paper we present a hybrid clustering approach for the formation of equivalence classes of terms associated with related concepts. It uses the outcome of hierarchical clustering to provide seed points for implementing Incremental K-means algorithm. Due to the sparsity of the term vector the cosine similarity estimate was found to be ineffective for term clustering. Another promising measure of proximity estimate used in information retrieval namely Euclidian distance has a drawback that It is biased towards changes in the term frequencies in larger documents when the term weights are represented by tf-ldf estimates. Hence we propose normalization for tf-idf estimates while representing a term as a vector in a document space before clustering the terms.
URL	http://www.scopus.com/inward/record.url?eid=2-s2.0-70350735762&partnerID=40&md5=911fe260f9a7d869bf7ee8660851cdac

Collaborative Partner

Institute of Electronic and Information Technology (IEIT)

Collaborative Partner

Collaborative Partner

High Education Forum, Taiwan