Optimization of Topic Recognition Model for News Texts Based on LDA-

TitleOptimization of Topic Recognition Model for News Texts Based on LDA-
Publication TypeJournal Article
Year of Publication2019
AuthorsWang, H, Wang, J, Zhang, Y, Wang, M, Mao, C
JournalJournal of Digital Information Management
Volume17
Issue5
Start Page257
Pagination257-269
Date Published10/2019
Type of ArticleResearch
Abstract

 Latent Dirichlet Allocation (LDA) is the technique most commonly used in topic modeling methods, but it requires the number of topics generated by LDA to be specified for topic recognition modeling. Except the main iterative methods based on perplexity and nonparametric methods, recent research has no simple way to select the optimal number of topics in the model. Aiming at appropriately determining the number of topics and then optimizing the LDA topic model, this paper proposes a non-iterative method for automatically determining the number of topics. The clustering method is based on fast seeking and locating density peaks. This method transforms the traditional topic cluster number selection problem into clustering problem and thus can be used to optimize the topic recognition model for news texts. It does not need iterative optimization and can simplify model development. This method uses Word2Vec for word embedding on corpus text to explore the superior performance of word-related relationships and to express the implicit semantic relationship between topic corpora. Then, using a clustering algorithm that quickly searches for and finds the cluster peaks; the word vectors after word embedding are clustered to obtain the number of word vector clusters after word embedding. The number of clusters is used as the number of topics in the text. Finally, the experimental results show that the proposed method enjoys better precision and F1 value than the perplexitybased method, and is suitable for the identification of the number of topics in corpora in different sizes. This method can effectively find the appropriate number of topics from the news text dataset and improve the accuracy of the LDA theme model.

URLhttp://dline.info/fpaper/jdim/v17i5/jdimv17i5_1.pdf
DOI10.6025/jdim/2019/17/5/257-269
Refereed DesignationRefereed

Collaborative Partner

Institute of Electronic and Information Technology (IEIT)

Collaborative Partner

Collaborative Partner