SUDMAD: Sequential and unsupervised decomposition of a multi-author document based on a hidden markov model

Aldebei, K; He, X; Yeh, W; Jia, W

SUDMAD: Sequential and unsupervised decomposition of a multi-author document based on a hidden markov model

Aldebei, K He, X

Yeh, W Jia, W

Permalink

Publication Type:: Journal Article
Citation:: Journal of the Association for Information Science and Technology, 2018, 69 (2), pp. 201 - 214
Issue Date:: 2018-02-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Accepted Manuscript VersionAdobe PDF (1.13 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Aldebei, K	en_US
dc.contributor.author	He, X https://orcid.org/0000-0001-8962-540X	en_US
dc.contributor.author	Yeh, W	en_US
dc.contributor.author	Jia, W https://orcid.org/0000-0002-0940-3338	en_US
dc.date.issued	2018-02-01	en_US
dc.identifier.citation	Journal of the Association for Information Science and Technology, 2018, 69 (2), pp. 201 - 214	en_US
dc.identifier.issn	2330-1635	en_US
dc.identifier.uri	http://hdl.handle.net/10453/119930
dc.description.abstract	© 2017 ASIS & T. Decomposing a document written by more than one author into sentences based on authorship is of great significance due to the increasing demand for plagiarism detection, forensic analysis, civil law (i.e., disputed copyright issues), and intelligence issues that involve disputed anonymous documents. Among existing studies for document decomposition, some were limited by specific languages, according to topics or restricted to a document of two authors, and their accuracies have big room for improvement. In this paper, we consider the contextual correlation hidden among sentences and propose an algorithm for Sequential and Unsupervised Decomposition of a Multi-Author Document (SUDMAD) written in any language, disregarding topics, through the construction of a Hidden Markov Model (HMM) reflecting the authors’ writing styles. To build and learn such a model, an unsupervised, statistical approach is first proposed to estimate the initial values of HMM parameters of a preliminary model, which does not require the availability of any information of author’s or document’s context other than how many authors contributed to writing the document. To further boost the performance of this approach, a boosted HMM learning procedure is proposed next, where the initial classification results are used to create labeled training data to learn a more accurate HMM. Moreover, the contextual relationship among sentences is further utilized to refine the classification results. Our proposed approach is empirically evaluated on three benchmark datasets that are widely used for authorship analysis of documents. Comparisons with recent state-of-the-art approaches are also presented to demonstrate the significance of our new ideas and the superior performance of our approach.	en_US
dc.relation.ispartof	Journal of the Association for Information Science and Technology	en_US
dc.relation.isbasedon	10.1002/asi.23956	en_US
dc.rights	info:eu-repo/semantics/openAccess
dc.title	SUDMAD: Sequential and unsupervised decomposition of a multi-author document based on a hidden markov model	en_US
dc.type	Journal Article
utslib.citation.volume	2	en_US
utslib.citation.volume	69	en_US
utslib.for	0906 Electrical and Electronic Engineering	en_US
utslib.for	0807 Library and Information Studies	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Electrical and Data Engineering
pubs.organisational-group	/University of Technology Sydney/Strength - CRIN - Realtime Information Networks
pubs.organisational-group	/University of Technology Sydney/Strength - GBDTC - Global Big Data Technologies
utslib.copyright.status	open_access	*
pubs.issue	2	en_US
pubs.publication-status	Published	en_US
pubs.volume	69	en_US

Abstract:

© 2017 ASIS & T. Decomposing a document written by more than one author into sentences based on authorship is of great significance due to the increasing demand for plagiarism detection, forensic analysis, civil law (i.e., disputed copyright issues), and intelligence issues that involve disputed anonymous documents. Among existing studies for document decomposition, some were limited by specific languages, according to topics or restricted to a document of two authors, and their accuracies have big room for improvement. In this paper, we consider the contextual correlation hidden among sentences and propose an algorithm for Sequential and Unsupervised Decomposition of a Multi-Author Document (SUDMAD) written in any language, disregarding topics, through the construction of a Hidden Markov Model (HMM) reflecting the authors’ writing styles. To build and learn such a model, an unsupervised, statistical approach is first proposed to estimate the initial values of HMM parameters of a preliminary model, which does not require the availability of any information of author’s or document’s context other than how many authors contributed to writing the document. To further boost the performance of this approach, a boosted HMM learning procedure is proposed next, where the initial classification results are used to create labeled training data to learn a more accurate HMM. Moreover, the contextual relationship among sentences is further utilized to refine the classification results. Our proposed approach is empirically evaluated on three benchmark datasets that are widely used for authorship analysis of documents. Comparisons with recent state-of-the-art approaches are also presented to demonstrate the significance of our new ideas and the superior performance of our approach.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/119930