Title | Hybrid clustering approach for term partitioning in document data sets |
Publication Type | Journal Article |
Year of Publication | 2008 |
Authors | Reddy, KT, Shashi, M, Reddy, LP |
Journal | Journal of Digital Information Management |
Volume | 6 |
Issue | 3 |
Pagination | 272 - 277 |
Date Published | 2008 |
Keywords | Dimensionality reduction, Hierarchical clustering, K means algorithm, Partitional clustering, Text mining |
Abstract | Information retrieval is one of the major research areas due to accumulation of huge information in digital form. Various techniques of Information retrieval are based on the fact that terms contained in a document along with their frequency of occurrence signify the semantics of the document. Recent attempts to find the relevant document for a context represents documents in a vector space model as document-term vector containing term weights for every index term in that document. As there will be enormous number of index terms this leads to high dimensionality problem. We can reduce the dimensionality based on the observation that groups of terms associated with related concepts occur together or do not occur in a document based on whether the document is relevant or not to that concept. Such a group of terms Is identified as an equivalence class and can be viewed as a single dimension in a Rough set based information retrieval system. In this paper we present a hybrid clustering approach for the formation of equivalence classes of terms associated with related concepts. It uses the outcome of hierarchical clustering to provide seed points for implementing Incremental K-means algorithm. Due to the sparsity of the term vector the cosine similarity estimate was found to be ineffective for term clustering. Another promising measure of proximity estimate used in information retrieval namely Euclidian distance has a drawback that It is biased towards changes in the term frequencies in larger documents when the term weights are represented by tf-ldf estimates. Hence we propose normalization for tf-idf estimates while representing a term as a vector in a document space before clustering the terms. |
URL | http://www.scopus.com/inward/record.url?eid=2-s2.0-70350735762&partnerID=40&md5=911fe260f9a7d869bf7ee8660851cdac |