Enhanced co-occurrence distances for categorical data in unsupervised learning

Publication Type:
Conference Proceeding
Citation:
2010 International Conference on Machine Learning and Cybernetics, ICMLC 2010, 2010, 4 pp. 2071 - 2078
Issue Date:
2010-11-15
Filename Description Size
Thumbnail2010003128OK.pdf312.84 kB
Adobe PDF
Full metadata record
Distance metrics for categorical data play an important role in unsupervised learning such as clustering. They also dramatically affect learning accuracy and computational complexities. Recently, two co-occurrence methods, Co-occurrence Distance based on Power Set (CDPS) and Co-occurrence Distance based on Universal Set (CDUS), have been proposed to calculate distances for categorical attribute values with significantly improved clustering accuracy by taking advantage of co-occurrences of attributes. However, their computational load is high enough to restrict their applications in unsupervised learning. This paper proposes two new enhanced co-occurrence approaches, i.e. Co-occurrence Distance based on Join Set (CDJS) and Co-occurrence Distance based on Intersection Set (CDIS), to calculate the distance between two values of a categorical attribute by considering its relationships to other attributes. Theoretical analysis reveals the equivalent accuracy of CDJS and CDIS to CDPS and CDUS, while CDJS and CDIS can significantly reduce computational complexity. Substantial experiments on ten benchmark and real-world data sets have evidenced that our proposed approaches are equivalently accurate but with a much higher efficiency than CDPS and CDUS, in particular for large scale data sets. © 2010 IEEE.
Please use this identifier to cite or link to this item: