Evaluation of topic identification methods on arabic corpora

Title	Evaluation of topic identification methods on arabic corpora
Publication Type	Journal Article
Year of Publication	2011
Authors	Abbas, M, Smaili, K, Berkani, D
Journal	Journal of Digital Information Management
Volume	9
Issue	5
Pagination	185 - 192
Date Published	2011
Keywords	Alwatan-2004 corpus, Arabic language, Neural network, SVM, Topic Identification, TR, TULM
Abstract	Topic Identification is one of the important keys for the success of many applications. Indeed, there are few works in this field concerning Arabic language because of lack of standard corpora. In this study, we will provide directly comparable results of six text categorization methods on a new Arabic corpus Alwatan-2004. Hence, Topic Unigram Language Model (TULM), Term Frequency/Inverse Document Frequency (TFIDF), Neural Network, SVM, M-SVM and TR have been experimented, and showed that TR-Classifier is the most efficient among the set of classifiers, nevertheless, only binary SVM outperformed it thanks to its characteristics. Moreover, we should note that the size of Alwatan-2004 corpus used to achieve our experiments is considered the most important compared to any other Arabic corpus which had been used for topic identification experiments until now. In addition, we aim through using small sizes of vocabularies to reduce the time of computation. This is important for adaptive language modeling, particularly Topic Adaptation, which is required in real time applications such as speech recognition and machine translation systems. Our experiments indicate that the results are better than other works dealing with Arabic text categorization.
URL	http://www.scopus.com/inward/record.url?eid=2-s2.0-84855410016&partnerID=40&md5=8736f4304c67d00c33512bbcb14394fa

Collaborative Partner

Institute of Electronic and Information Technology (IEIT)

Collaborative Partner

Collaborative Partner

High Education Forum, Taiwan