Enhanced co-occurrence distances for categorical data in unsupervised learning

Feng, JY; Wang, MC; Wang, C; Cao, LB

Enhanced co-occurrence distances for categorical data in unsupervised learning

Feng, JY Wang, MC Wang, C Cao, LB

Permalink

Publication Type:: Conference Proceeding
Citation:: 2010 International Conference on Machine Learning and Cybernetics, ICMLC 2010, 2010, 4 pp. 2071 - 2078
Issue Date:: 2010-11-15

Closed Access

	Filename	Description	Size
	2010003128OK.pdf		312.84 kB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Feng, JY	en_US
dc.contributor.author	Wang, MC	en_US
dc.contributor.author	Wang, C	en_US
dc.contributor.author	Cao, LB https://orcid.org/0000-0003-1562-9429	en_US
dc.date.issued	2010-11-15	en_US
dc.identifier.citation	2010 International Conference on Machine Learning and Cybernetics, ICMLC 2010, 2010, 4 pp. 2071 - 2078	en_US
dc.identifier.isbn	9781424465262	en_US
dc.identifier.uri	http://hdl.handle.net/10453/16133
dc.description.abstract	Distance metrics for categorical data play an important role in unsupervised learning such as clustering. They also dramatically affect learning accuracy and computational complexities. Recently, two co-occurrence methods, Co-occurrence Distance based on Power Set (CDPS) and Co-occurrence Distance based on Universal Set (CDUS), have been proposed to calculate distances for categorical attribute values with significantly improved clustering accuracy by taking advantage of co-occurrences of attributes. However, their computational load is high enough to restrict their applications in unsupervised learning. This paper proposes two new enhanced co-occurrence approaches, i.e. Co-occurrence Distance based on Join Set (CDJS) and Co-occurrence Distance based on Intersection Set (CDIS), to calculate the distance between two values of a categorical attribute by considering its relationships to other attributes. Theoretical analysis reveals the equivalent accuracy of CDJS and CDIS to CDPS and CDUS, while CDJS and CDIS can significantly reduce computational complexity. Substantial experiments on ten benchmark and real-world data sets have evidenced that our proposed approaches are equivalently accurate but with a much higher efficiency than CDPS and CDUS, in particular for large scale data sets. © 2010 IEEE.	en_US
dc.relation.ispartof	2010 International Conference on Machine Learning and Cybernetics, ICMLC 2010	en_US
dc.relation.isbasedon	10.1109/ICMLC.2010.5580500	en_US
dc.title	Enhanced co-occurrence distances for categorical data in unsupervised learning	en_US
dc.type	Conference Proceeding
utslib.citation.volume	4	en_US
utslib.for	0801 Artificial Intelligence and Image Processing	en_US
dc.location.activity	Qingdao	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAI - Advanced Analytics Institute Research Centre
utslib.copyright.status	closed_access
pubs.publication-status	Published	en_US
pubs.volume	4	en_US

Abstract:

Distance metrics for categorical data play an important role in unsupervised learning such as clustering. They also dramatically affect learning accuracy and computational complexities. Recently, two co-occurrence methods, Co-occurrence Distance based on Power Set (CDPS) and Co-occurrence Distance based on Universal Set (CDUS), have been proposed to calculate distances for categorical attribute values with significantly improved clustering accuracy by taking advantage of co-occurrences of attributes. However, their computational load is high enough to restrict their applications in unsupervised learning. This paper proposes two new enhanced co-occurrence approaches, i.e. Co-occurrence Distance based on Join Set (CDJS) and Co-occurrence Distance based on Intersection Set (CDIS), to calculate the distance between two values of a categorical attribute by considering its relationships to other attributes. Theoretical analysis reveals the equivalent accuracy of CDJS and CDIS to CDPS and CDUS, while CDJS and CDIS can significantly reduce computational complexity. Substantial experiments on ten benchmark and real-world data sets have evidenced that our proposed approaches are equivalently accurate but with a much higher efficiency than CDPS and CDUS, in particular for large scale data sets. © 2010 IEEE.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/16133