SMGKM: An Efficient Incremental Algorithm for Clustering Document Collections

Bagirov, A; Seifollahi, S; Piccardi, M; Zare Borzeshi, E; Kruger, B

SMGKM: An Efficient Incremental Algorithm for Clustering Document Collections

Bagirov, A Seifollahi, S Piccardi, M

Zare Borzeshi, E Kruger, B

Permalink

Publisher:: Springer Nature
Publication Type:: Chapter
Citation:: Computational Linguistics and Intelligent Text Processing, 2023, 13397 LNCS, pp. 314-328
Issue Date:: 2023-01-01

In Progress

Filename	Description	Size
CICLing 2018 review results, paper 132.pdf	Supporting information	50.43 kB	Adobe PDF	View/Open
ToC.pdf	Supporting information	162.91 kB	Adobe PDF	View/Open
2s2.0-85149969615 AM.pdf	Accepted version	1.07 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is being processed and is not currently available.

The embargo period expires on 1 Jan 2025

Full metadata record

Field	Value	Language
dc.contributor.author	Bagirov, A
dc.contributor.author	Seifollahi, S
dc.contributor.author	Piccardi, M https://orcid.org/0000-0001-9250-6604
dc.contributor.author	Zare Borzeshi, E
dc.contributor.author	Kruger, B
dc.date.accessioned	2023-06-15T10:29:42Z
dc.date.available	2023-06-15T10:29:42Z
dc.date.issued	2023-01-01
dc.identifier.citation	Computational Linguistics and Intelligent Text Processing, 2023, 13397 LNCS, pp. 314-328
dc.identifier.isbn	9783031238031
dc.identifier.uri	http://hdl.handle.net/10453/170765
dc.description.abstract	Given a large unlabeled document collection, the aim of this paper is to develop an accurate and efficient algorithm for solving the clustering problem over this collection. Document collections typically contain tens or hundreds of thousands of documents, with thousands or tens of thousands of features (i.e., distinct words). Most existing clustering algorithms struggle to find accurate solutions on such large data sets. The proposed algorithm overcomes this difficulty by an incremental approach, incrementing the number of clusters progressively from an initial value of one to a set value. At each iteration, the new candidate cluster is initialized using a partitioning approach which is guaranteed to minimize the objective function. Experiments have been carried out over six, diverse datasets and with different evaluation criteria, showing that the proposed algorithm has outperformed comparable state-of-the-art clustering algorithms in all cases.
dc.language	en
dc.publisher	Springer Nature
dc.relation.ispartof	Computational Linguistics and Intelligent Text Processing
dc.relation.ispartofseries	Lecture Notes in Computer Science
dc.relation.isbasedon	10.1007/978-3-031-23804-8_25
dc.rights	info:eu-repo/semantics/embargoedAccess
dc.subject.classification	Artificial Intelligence & Image Processing
dc.title	SMGKM: An Efficient Incremental Algorithm for Clustering Document Collections
dc.type	Chapter
utslib.citation.volume	13397 LNCS
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - GBDTC - Global Big Data Technologies
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Electrical and Data Engineering
utslib.copyright.status	in_progress	*
utslib.copyright.embargo	2025-01-01T00:00:00+1000Z
dc.date.updated	2023-06-15T10:29:38Z
pubs.publication-status	Published
pubs.volume	13397 LNCS

Abstract:

Given a large unlabeled document collection, the aim of this paper is to develop an accurate and efficient algorithm for solving the clustering problem over this collection. Document collections typically contain tens or hundreds of thousands of documents, with thousands or tens of thousands of features (i.e., distinct words). Most existing clustering algorithms struggle to find accurate solutions on such large data sets. The proposed algorithm overcomes this difficulty by an incremental approach, incrementing the number of clusters progressively from an initial value of one to a set value. At each iteration, the new candidate cluster is initialized using a partitioning approach which is guaranteed to minimize the objective function. Experiments have been carried out over six, diverse datasets and with different evaluation criteria, showing that the proposed algorithm has outperformed comparable state-of-the-art clustering algorithms in all cases.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/170765