SMGKM: An Efficient Incremental Algorithm for Clustering Document Collections

Publisher:
Springer Nature
Publication Type:
Chapter
Citation:
Computational Linguistics and Intelligent Text Processing, 2023, 13397 LNCS, pp. 314-328
Issue Date:
2023-01-01
Filename Description Size
CICLing 2018 review results, paper 132.pdfSupporting information50.43 kB
Adobe PDF
ToC.pdfSupporting information162.91 kB
Adobe PDF
2s2.0-85149969615 AM.pdfAccepted version1.07 MB
Adobe PDF
Full metadata record
Given a large unlabeled document collection, the aim of this paper is to develop an accurate and efficient algorithm for solving the clustering problem over this collection. Document collections typically contain tens or hundreds of thousands of documents, with thousands or tens of thousands of features (i.e., distinct words). Most existing clustering algorithms struggle to find accurate solutions on such large data sets. The proposed algorithm overcomes this difficulty by an incremental approach, incrementing the number of clusters progressively from an initial value of one to a set value. At each iteration, the new candidate cluster is initialized using a partitioning approach which is guaranteed to minimize the objective function. Experiments have been carried out over six, diverse datasets and with different evaluation criteria, showing that the proposed algorithm has outperformed comparable state-of-the-art clustering algorithms in all cases.
Please use this identifier to cite or link to this item: