Unsupervised feature construction for improving data representation and semantics

Rizoiu, MA; Velcin, J; Lallich, S

Unsupervised feature construction for improving data representation and semantics

Rizoiu, MA

Velcin, J Lallich, S

Permalink

Publication Type:: Journal Article
Citation:: Journal of Intelligent Information Systems, 2013, 40 (3), pp. 501 - 527
Issue Date:: 2013-06-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Accepted Manuscript VersionAdobe PDF (2.46 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Rizoiu, MA https://orcid.org/0000-0003-0381-669X	en_US
dc.contributor.author	Velcin, J	en_US
dc.contributor.author	Lallich, S	en_US
dc.date.issued	2013-06-01	en_US
dc.identifier.citation	Journal of Intelligent Information Systems, 2013, 40 (3), pp. 501 - 527	en_US
dc.identifier.issn	0925-9902	en_US
dc.identifier.uri	http://hdl.handle.net/10453/137932
dc.description.abstract	Attribute-based format is the main data representation format used by machine learning algorithms. When the attributes do not properly describe the initial data, performance starts to degrade. Some algorithms address this problem by internally changing the representation space, but the newly constructed features rarely have any meaning. We seek to construct, in an unsupervised way, new attributes that are more appropriate for describing a given dataset and, at the same time, comprehensible for a human user. We propose two algorithms that construct the new attributes as conjunctions of the initial primitive attributes or their negations. The generated feature sets have reduced correlations between features and succeed in catching some of the hidden relations between individuals in a dataset. For example, a feature like sky \wedge \neg building \wedge panorama would be true for non-urban images and is more informative than simple features expressing the presence or the absence of an object. The notion of Pareto optimality is used to evaluate feature sets and to obtain a balance between total correlation and the complexity of the resulted feature set. Statistical hypothesis testing is employed in order to automatically determine the values of the parameters used for constructing a data-dependent feature set. We experimentally show that our approaches achieve the construction of informative feature sets for multiple datasets. © 2013 Springer Science+Business Media New York.	en_US
dc.relation.ispartof	Journal of Intelligent Information Systems	en_US
dc.relation.isbasedon	10.1007/s10844-013-0235-x	en_US
dc.subject.classification	Information Systems	en_US
dc.title	Unsupervised feature construction for improving data representation and semantics	en_US
dc.type	Journal Article
utslib.citation.volume	3	en_US
utslib.citation.volume	40	en_US
utslib.for	0801 Artificial Intelligence and Image Processing	en_US
utslib.for	0804 Data Format	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
utslib.copyright.status	open_access
pubs.issue	3	en_US
pubs.publication-status	Published	en_US
pubs.volume	40	en_US

Abstract:

Attribute-based format is the main data representation format used by machine learning algorithms. When the attributes do not properly describe the initial data, performance starts to degrade. Some algorithms address this problem by internally changing the representation space, but the newly constructed features rarely have any meaning. We seek to construct, in an unsupervised way, new attributes that are more appropriate for describing a given dataset and, at the same time, comprehensible for a human user. We propose two algorithms that construct the new attributes as conjunctions of the initial primitive attributes or their negations. The generated feature sets have reduced correlations between features and succeed in catching some of the hidden relations between individuals in a dataset. For example, a feature like sky \wedge \neg building \wedge panorama would be true for non-urban images and is more informative than simple features expressing the presence or the absence of an object. The notion of Pareto optimality is used to evaluate feature sets and to obtain a balance between total correlation and the complexity of the resulted feature set. Statistical hypothesis testing is employed in order to automatically determine the values of the parameters used for constructing a data-dependent feature set. We experimentally show that our approaches achieve the construction of informative feature sets for multiple datasets. © 2013 Springer Science+Business Media New York.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/137932