GLMTopic: A Hybrid Chinese Topic Model Leveraging Large Language Models
- Publisher:
- TECH SCIENCE PRESS
- Publication Type:
- Journal Article
- Citation:
- Computers Materials and Continua, 2025, 85, (1), pp. 1559-1583
- Issue Date:
- 2025-01-01
Open Access
Copyright Clearance Process
- Recently Added
- In Progress
- Open Access
This item is open access.
Topic modeling is a fundamental technique of content analysis in natural language processing, widely applied in domains such as social sciences and finance. In the era of digital communication, social scientists increasingly rely on large-scale social media data to explore public discourse, collective behavior, and emerging social concerns. However, traditional models like Latent Dirichlet Allocation (LDA) and neural topic models like BERTopic struggle to capture deep semantic structures in short-text datasets, especially in complex non-English languages like Chinese. This paper presents Generative Language Model Topic (GLMTopic) a novel hybrid topic modeling framework leveraging the capabilities of large language models, designed to support social science research by uncovering coherent and interpretable themes from Chinese social media platforms. GLMTopic integrates Adaptive Community-enhanced Graph Embedding for advanced semantic representation, Uniform Manifold Approximation and Projection-based (UMAP-based) dimensionality reduction, Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) clustering, and large language model-powered (LLM-powered) representation tuning to generate more contextually relevant and interpretable topics. By reducing dependence on extensive text preprocessing and human expert intervention in post-analysis topic label annotation, GLMTopic facilitates a fully automated and user-friendly topic extraction process. Experimental evaluations on a social media dataset sourced from Weibo demonstrate that GLMTopic outperforms Latent Dirichlet Allocation (LDA) and BERTopic in coherence score and usability with automated interpretation, providing a more scalable and semantically accurate solution for Chinese topic modeling. Future research will explore optimizing computational efficiency, integrating knowledge graphs and sentiment analysis for more complicated workflows, and extending the framework for real-time and multilingual topic modeling.
Please use this identifier to cite or link to this item:
