TRIP: An Interactive Retrieving-Inferring Data Imputation Approach

Li, Z; Qin, L; Cheng, H; Zhang, X; Zhou, X

TRIP: An Interactive Retrieving-Inferring Data Imputation Approach

Li, Z Qin, L

Cheng, H Zhang, X Zhou, X

Permalink

Publication Type:: Journal Article
Citation:: IEEE Transactions on Knowledge and Data Engineering, 2015, 27 (9), pp. 2550 - 2563
Issue Date:: 2015-09-01

Closed Access

	Filename	Description	Size
	[2015 TKDE] TRIP - An Interactive Retrieving-Inferring Data Imputation Approach.pdf	Published Version	2.3 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Li, Z	en_US
dc.contributor.author	Qin, L https://orcid.org/0000-0001-6068-5062	en_US
dc.contributor.author	Cheng, H	en_US
dc.contributor.author	Zhang, X	en_US
dc.contributor.author	Zhou, X	en_US
dc.date.issued	2015-09-01	en_US
dc.identifier.citation	IEEE Transactions on Knowledge and Data Engineering, 2015, 27 (9), pp. 2550 - 2563	en_US
dc.identifier.issn	1041-4347	en_US
dc.identifier.uri	http://hdl.handle.net/10453/41395
dc.description.abstract	© 2015 IEEE. Data imputation aims at filling in missing attribute values in databases. Most existing imputation methods to string attribute values are inferring-based approaches, which usually fail to reach a high imputation recall by just inferring missing values from the complete part of the data set. Recently, some retrieving-based methods are proposed to retrieve missing values from external resources such as the World Wide Web, which tend to reach a much higher imputation recall, but inevitably bring a large overhead by issuing a large number of search queries. In this paper, we investigate the interaction between the inferring-based methods and the retrieving-based methods. We show that retrieving a small number of selected missing values can greatly improve the imputation recall of the inferring-based methods. With this intuition, we propose an inTeractive Retrieving-Inferring data imPutation approach (TRIP), which performs retrieving and inferring alternately in filling in missing attribute values in a data set. To ensure the high recall at the minimum cost, TRIP faces a challenge of selecting the least number of missing values for retrieving to maximize the number of inferable values. Our proposed solution is able to identify an optimal retrieving-inferring scheduling scheme in deterministic data imputation, and the optimality of the generated scheme is theoretically analyzed with proofs. We also analyze with an example that the optimal scheme is not feasible to be achieved in τ-constrained stochastic data imputation (τ-SDI), but still, our proposed solution identifies an expected-optimal scheme in τ-SDI. Extensive experiments on four data collections show that TRIP retrieves on average 20 percent missing values and achieves the same high recall that was reached by the retrieving-based approach.	en_US
dc.relation	http://purl.org/au-research/grants/arc/DP160101513
dc.relation	http://purl.org/au-research/grants/arc/DE140100999
dc.relation.ispartof	IEEE Transactions on Knowledge and Data Engineering	en_US
dc.relation.isbasedon	10.1109/TKDE.2015.2411276	en_US
dc.subject.classification	Information Systems	en_US
dc.title	TRIP: An Interactive Retrieving-Inferring Data Imputation Approach	en_US
dc.type	Journal Article
utslib.citation.volume	9	en_US
utslib.citation.volume	27	en_US
utslib.for	080101 Adaptive Agents and Intelligent Robotics	en_US
utslib.for	080109 Pattern Recognition and Data Mining	en_US
utslib.for	08 Information and Computing Sciences	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - CAI - Centre for Artificial Intelligence
utslib.copyright.status	closed_access
pubs.issue	9	en_US
pubs.publication-status	Published	en_US
pubs.volume	27	en_US

Abstract:

© 2015 IEEE. Data imputation aims at filling in missing attribute values in databases. Most existing imputation methods to string attribute values are inferring-based approaches, which usually fail to reach a high imputation recall by just inferring missing values from the complete part of the data set. Recently, some retrieving-based methods are proposed to retrieve missing values from external resources such as the World Wide Web, which tend to reach a much higher imputation recall, but inevitably bring a large overhead by issuing a large number of search queries. In this paper, we investigate the interaction between the inferring-based methods and the retrieving-based methods. We show that retrieving a small number of selected missing values can greatly improve the imputation recall of the inferring-based methods. With this intuition, we propose an inTeractive Retrieving-Inferring data imPutation approach (TRIP), which performs retrieving and inferring alternately in filling in missing attribute values in a data set. To ensure the high recall at the minimum cost, TRIP faces a challenge of selecting the least number of missing values for retrieving to maximize the number of inferable values. Our proposed solution is able to identify an optimal retrieving-inferring scheduling scheme in deterministic data imputation, and the optimality of the generated scheme is theoretically analyzed with proofs. We also analyze with an example that the optimal scheme is not feasible to be achieved in τ-constrained stochastic data imputation (τ-SDI), but still, our proposed solution identifies an expected-optimal scheme in τ-SDI. Extensive experiments on four data collections show that TRIP retrieves on average 20 percent missing values and achieves the same high recall that was reached by the retrieving-based approach.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/41395