Sequential and Unsupervised Document Authorial Clustering Based on Hidden Markov Model

Publication Type:
Conference Proceeding
Citation:
Trustcom/BigDataSE/ICESS.2017, 2017
Issue Date:
2017-08-01
Full metadata record
Files in This Item:
Filename Description Size
4938664A-7BA6-4B2D-B6CF-28A61A0F862B am.pdfAccepted Manuscript Version313.34 kB
Adobe PDF
Document clustering groups documents of certain similar characteristics in one cluster. Document clustering has shown advantages on organization, retrieval, navigation and summarization of a huge amount of text documents on Internet. This paper presents a novel, unsupervised approach for clustering single-author documents into groups based on authorship. The key novelty is that we propose to extract contextual correlations to depict the writing style hidden among sentences of each document for clustering the documents. For this purpose, we build an Hidden Markov Model (HMM) for representing the relations of sequential sentences, and a two-level, unsupervised framework is constructed. Our proposed approach is evaluated on four benchmark datasets, widely used for document authorship analysis. A scientific paper is also used to demonstrate the performance of the approach on clustering short segments of a text into authorial components. Experimental results show that the proposed approach outperforms the state-of-the-art approaches.
Please use this identifier to cite or link to this item: