Enhanced co-occurrence distances for categorical data in unsupervised learning
- Publication Type:
- Conference Proceeding
- 2010 International Conference on Machine Learning and Cybernetics, ICMLC 2010, 2010, 4 pp. 2071 - 2078
- Issue Date:
Distance metrics for categorical data play an important role in unsupervised learning such as clustering. They also dramatically affect learning accuracy and computational complexities. Recently, two co-occurrence methods, Co-occurrence Distance based on Power Set (CDPS) and Co-occurrence Distance based on Universal Set (CDUS), have been proposed to calculate distances for categorical attribute values with significantly improved clustering accuracy by taking advantage of co-occurrences of attributes. However, their computational load is high enough to restrict their applications in unsupervised learning. This paper proposes two new enhanced co-occurrence approaches, i.e. Co-occurrence Distance based on Join Set (CDJS) and Co-occurrence Distance based on Intersection Set (CDIS), to calculate the distance between two values of a categorical attribute by considering its relationships to other attributes. Theoretical analysis reveals the equivalent accuracy of CDJS and CDIS to CDPS and CDUS, while CDJS and CDIS can significantly reduce computational complexity. Substantial experiments on ten benchmark and real-world data sets have evidenced that our proposed approaches are equivalently accurate but with a much higher efficiency than CDPS and CDUS, in particular for large scale data sets. © 2010 IEEE.
Please use this identifier to cite or link to this item: