Learning representations of ultrahigh-dimensional data for random distance-based outlier detection

Pang, G; Chen, L; Cao, L; Liu, H

Learning representations of ultrahigh-dimensional data for random distance-based outlier detection

Pang, G

Chen, L

Cao, L

Liu, H

Permalink

Publication Type:: Conference Proceeding
Citation:: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2018, pp. 2041 - 2050
Issue Date:: 2018-07-19

Closed Access

	Filename	Description	Size
	REPEN_KDD18_CR2.pdf	Published version	1 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Pang, G https://orcid.org/0000-0002-9877-2716	en_US
dc.contributor.author	Chen, L https://orcid.org/0000-0002-6468-5729	en_US
dc.contributor.author	Cao, L https://orcid.org/0000-0003-1562-9429	en_US
dc.contributor.author	Liu, H	en_US
dc.date.issued	2018-07-19	en_US
dc.identifier.citation	Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2018, pp. 2041 - 2050	en_US
dc.identifier.isbn	9781450355520	en_US
dc.identifier.uri	http://hdl.handle.net/10453/127521
dc.description.abstract	© 2018 Association for Computing Machinery. Learning expressive low-dimensional representations of ultrahigh-dimensional data, e.g., data with thousands/millions of features, has been a major way to enable learning methods to address the curse of dimensionality. However, existing unsupervised representation learning methods mainly focus on preserving the data regularity information and learning the representations independently of subsequent outlier detection methods, which can result in suboptimal and unstable performance of detecting irregularities (i.e., outliers). This paper introduces a ranking model-based framework, called RAMODO, to address this issue. RAMODO unifies representation learning and outlier detection to learn low-dimensional representations that are tailored for a state-of-the-art outlier detection approach - the random distance-based approach. This customized learning yields more optimal and stable representations for the targeted outlier detectors. Additionally, RAMODO can leverage little labeled data as prior knowledge to learn more expressive and application-relevant representations. We instantiate RAMODO to an efficient method called REPEN to demonstrate the performance of RAMODO. Extensive empirical results on eight real-world ultrahigh dimensional data sets show that REPEN (i) enables a random distance-based detector to obtain significantly better AUC performance and two orders of magnitude speedup; (ii) performs substantially better and more stably than four state-of-the-art representation learning methods; and (iii) leverages less than 1% labeled data to achieve up to 32% AUC improvement.	en_US
dc.relation.ispartof	Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining	en_US
dc.relation.isbasedon	10.1145/3219819.3220042	en_US
dc.title	Learning representations of ultrahigh-dimensional data for random distance-based outlier detection	en_US
dc.type	Conference Proceeding
utslib.for	0801 Artificial Intelligence and Image Processing	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
pubs.organisational-group	/University of Technology Sydney/Strength - AAI - Advanced Analytics Institute Research Centre
pubs.organisational-group	/University of Technology Sydney/Strength - CAI - Centre for Artificial Intelligence
pubs.organisational-group	/University of Technology Sydney/Students
utslib.copyright.status	closed_access
pubs.publication-status	Published	en_US

Abstract:

© 2018 Association for Computing Machinery. Learning expressive low-dimensional representations of ultrahigh-dimensional data, e.g., data with thousands/millions of features, has been a major way to enable learning methods to address the curse of dimensionality. However, existing unsupervised representation learning methods mainly focus on preserving the data regularity information and learning the representations independently of subsequent outlier detection methods, which can result in suboptimal and unstable performance of detecting irregularities (i.e., outliers). This paper introduces a ranking model-based framework, called RAMODO, to address this issue. RAMODO unifies representation learning and outlier detection to learn low-dimensional representations that are tailored for a state-of-the-art outlier detection approach - the random distance-based approach. This customized learning yields more optimal and stable representations for the targeted outlier detectors. Additionally, RAMODO can leverage little labeled data as prior knowledge to learn more expressive and application-relevant representations. We instantiate RAMODO to an efficient method called REPEN to demonstrate the performance of RAMODO. Extensive empirical results on eight real-world ultrahigh dimensional data sets show that REPEN (i) enables a random distance-based detector to obtain significantly better AUC performance and two orders of magnitude speedup; (ii) performs substantially better and more stably than four state-of-the-art representation learning methods; and (iii) leverages less than 1% labeled data to achieve up to 32% AUC improvement.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/127521