Neural Topic Modelling with Deep Generative Models

Publication Type:
Thesis
Issue Date:
2023
Full metadata record
Topic modelling is a popular task of natural language processing (NLP) aimed to automatically discover the main, shared topics of a given collection of documents. In addition, topic modelling is able to determine the topic proportions of each individual document in the collection, which can help with their categorization and organization. Over the years, topic models have found application and proved useful for a broad variety of fields including business, finance, healthcare, education, the media industry, social media, digital agriculture and many others. Like many other applications of NLP and machine learning, in recent times topic models have substantially improved their effectiveness thanks to the integration with deep learning---and deep generative models in particular---which has gained them the collective appellation of neural topic models. However, many improvements are still possible and needed, and this thesis has aimed to make significant contributions in this direction. As a first contribution, we have explored the use of reinforcement learning for refining the training of the models. To this aim, we have proposed novel training objectives based on the policy gradient theorem and contemporary gradient estimators such as REINFORCE with baseline, the Gumbel-Softmax and REBAR. The experimental results over several topic modelling datasets have invariably shown the improved performance of the models. As a second contribution, we have explored how to integrate the powerful, contextualized document representations (i.e., Transformer-based embeddings) in the training objective of the model. This, too, has led to marked performance improvements over probing datasets. Eventually, we have extended the investigation to dynamic topic models, which are models capable of analyzing time-stamped document collections and extracting sets of topics that adapt over time. For these models, we have proposed a modification of the topic distributions which allows controlling their sparsity, thus adjusting to the characteristics of the collection to be analyzed. Once more, the experimental results have given evidence to the effectiveness of the proposed approach.
Please use this identifier to cite or link to this item: