SKIF: A data imputation framework for concept drifting data streams

Zhang, P; Zhu, X; Tan, J; Guo, L

SKIF: A data imputation framework for concept drifting data streams

Zhang, P

Zhu, X Tan, J Guo, L

Permalink

Publication Type:: Conference Proceeding
Citation:: International Conference on Information and Knowledge Management, Proceedings, 2010, pp. 1869 - 1872
Issue Date:: 2010-12-01

Closed Access

	Filename	Description	Size
	2010001768OK.pdf		769.74 kB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Zhang, P https://orcid.org/0000-0001-7973-2746	en_US
dc.contributor.author	Zhu, X	en_US
dc.contributor.author	Tan, J	en_US
dc.contributor.author	Guo, L	en_US
dc.date.issued	2010-12-01	en_US
dc.identifier.citation	International Conference on Information and Knowledge Management, Proceedings, 2010, pp. 1869 - 1872	en_US
dc.identifier.isbn	9781450300995	en_US
dc.identifier.uri	http://hdl.handle.net/10453/16680
dc.description.abstract	Missing data commonly occur in many applications. While many data imputation methods exist to handle the missing data problem for databases, when applied to concept drifting data streams, these methods share some common difficulties. First, due to large and continuous data volumes, we are unable to maintain all stream records to form a candidate pool for missing value estimation, as most existing methods commonly do. Second, even if we could maintain all complete stream records using a summary structure, the concept drifting problem would make some information obsolete, and thus deteriorate the imputation accuracy. Third, in data streams, it is necessary to develop a fast yet accurate algorithm to find most similar data for imputation. Fourth, due to dynamic and sophisticated data collection environments, the missing rate of most stream data may be much higher than that in databases, so the imputation method should be able to handle high missing rate in the data. To tackle these challenges, we propose a Streaming k-Nearest-Neighbors Imputation Framework (SKIF) for concept drifting data streams. To handle concept drifting and large volume problems in data streams, SKIF first summarizes historical complete records in some micro-resources (which are high-level statistical data structures), and maintains these micro-resources in a candidate pool as benchmark data. After that, SKIF employs a novel hybrid-kNN imputation procedure, which uses a hybrid similarity search mechanism, to find the most similar micro-resources from the large scale candidate pool efficiently. Experimental results demonstrate the effectiveness of the proposed SKIF framework for data stream imputation tasks. © 2010 ACM.	en_US
dc.relation.ispartof	International Conference on Information and Knowledge Management, Proceedings	en_US
dc.relation.isbasedon	10.1145/1871437.1871750	en_US
dc.title	SKIF: A data imputation framework for concept drifting data streams	en_US
dc.type	Conference Proceeding
utslib.for	150301 Business Information Management (incl. Records, Knowledge and Information Management, and Intelligence)	en_US
dc.location.activity	Toronto, Ontario, Canada	en_US
dc.location.activity	Incheon
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAI - Advanced Analytics Institute Research Centre
utslib.copyright.status	closed_access
pubs.publication-status	Published	en_US

Abstract:

Missing data commonly occur in many applications. While many data imputation methods exist to handle the missing data problem for databases, when applied to concept drifting data streams, these methods share some common difficulties. First, due to large and continuous data volumes, we are unable to maintain all stream records to form a candidate pool for missing value estimation, as most existing methods commonly do. Second, even if we could maintain all complete stream records using a summary structure, the concept drifting problem would make some information obsolete, and thus deteriorate the imputation accuracy. Third, in data streams, it is necessary to develop a fast yet accurate algorithm to find most similar data for imputation. Fourth, due to dynamic and sophisticated data collection environments, the missing rate of most stream data may be much higher than that in databases, so the imputation method should be able to handle high missing rate in the data. To tackle these challenges, we propose a Streaming k-Nearest-Neighbors Imputation Framework (SKIF) for concept drifting data streams. To handle concept drifting and large volume problems in data streams, SKIF first summarizes historical complete records in some micro-resources (which are high-level statistical data structures), and maintains these micro-resources in a candidate pool as benchmark data. After that, SKIF employs a novel hybrid-kNN imputation procedure, which uses a hybrid similarity search mechanism, to find the most similar micro-resources from the large scale candidate pool efficiently. Experimental results demonstrate the effectiveness of the proposed SKIF framework for data stream imputation tasks. © 2010 ACM.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/16680