Clustering research across Tibetan and Chinese texts

Xu, GX; Sun, W; Peng, XP

Clustering research across Tibetan and Chinese texts

Xu, GX Sun, W Peng, XP

Permalink

Publication Type:: Journal Article
Citation:: Journal of Digital Information Management, 2015, 13 (3), pp. 162 - 168
Issue Date:: 2015-01-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Published VersionAdobe PDF (769.78 kB)

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Xu, GX	en_US
dc.contributor.author	Sun, W	en_US
dc.contributor.author	Peng, XP https://orcid.org/0000-0002-8901-1472	en_US
dc.date.issued	2015-01-01	en_US
dc.identifier.citation	Journal of Digital Information Management, 2015, 13 (3), pp. 162 - 168	en_US
dc.identifier.issn	0972-7272	en_US
dc.identifier.uri	http://hdl.handle.net/10453/135304
dc.description.abstract	Tibetan text clustering has potential in Tibetan information processing domain. In this paper, clustering research across Chinese and Tibetan texts is proposed to benefit Chinese and Tibetan machine translation and sentence alignment. A Tibetan and Chinese keyword table is the main way to implement the text clustering across these two languages. Improved K-means and improved density-based spatial clustering of applications with noise (DBSCAN) algorithm are proposed. Experiments show that improved K-means algorithm gains stable text clustering result and performs better than traditional K-means after eliminating the limitation of random selection of initial k data. The improved DBSCAN algorithm obtains good performance through reasonable parameter setting. Improved DBSCAN performs better than improved K-means. The study is helpful and meaningful for the parallel corpus construction of Chinese and Tibetan texts.	en_US
dc.relation.ispartof	Journal of Digital Information Management	en_US
dc.subject.classification	Artificial Intelligence & Image Processing	en_US
dc.title	Clustering research across Tibetan and Chinese texts	en_US
dc.type	Journal Article
utslib.citation.volume	3	en_US
utslib.citation.volume	13	en_US
utslib.for	0806 Information Systems	en_US
utslib.for	0807 Library and Information Studies	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - CAI - Centre for Artificial Intelligence
utslib.copyright.status	open_access
pubs.issue	3	en_US
pubs.publication-status	Published	en_US
pubs.volume	13	en_US

Abstract:

Tibetan text clustering has potential in Tibetan information processing domain. In this paper, clustering research across Chinese and Tibetan texts is proposed to benefit Chinese and Tibetan machine translation and sentence alignment. A Tibetan and Chinese keyword table is the main way to implement the text clustering across these two languages. Improved K-means and improved density-based spatial clustering of applications with noise (DBSCAN) algorithm are proposed. Experiments show that improved K-means algorithm gains stable text clustering result and performs better than traditional K-means after eliminating the limitation of random selection of initial k data. The improved DBSCAN algorithm obtains good performance through reasonable parameter setting. Improved DBSCAN performs better than improved K-means. The study is helpful and meaningful for the parallel corpus construction of Chinese and Tibetan texts.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/135304