Strong compound-risk factors: Efficient discovery through emerging patterns and contrast sets

Li, J; Yang, Q

Strong compound-risk factors: Efficient discovery through emerging patterns and contrast sets

Li, J

Yang, Q

Permalink

Publication Type:: Journal Article
Citation:: IEEE Transactions on Information Technology in Biomedicine, 2007, 11 (5), pp. 544 - 552
Issue Date:: 2007-09-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download full textAdobe PDF (185.77 kB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Li, J https://orcid.org/0000-0003-1833-7413	en_US
dc.contributor.author	Yang, Q	en_US
dc.date.issued	2007-09-01	en_US
dc.identifier.citation	IEEE Transactions on Information Technology in Biomedicine, 2007, 11 (5), pp. 544 - 552	en_US
dc.identifier.issn	1089-7771	en_US
dc.identifier.uri	http://hdl.handle.net/10453/15426
dc.description.abstract	Odds ratio (OR), relative risk (RR) (risk ratio), and absolute risk reduction (ARR) (risk difference) are biostatistics measurements that are widely used for identifying significant risk factors in dichotomous groups of subjects. In the past, they have often been used to assess simple risk factors. In this paper, we introduce the concept of compound-risk factors to broaden the applicability of these statistical tests for assessing factor interplays. We observe that compound-risk factors with a high risk ratio or a big risk difference have an one-to-one correspondence to strong emerging patterns or strong contrast sets-two types of patterns that have been extensively studied in the data mining field. Such a relationship has been unknown to researchers in the past, and efficient algorithms for discovering strong compound-risk factors have been lacking. In this paper, we propose a theoretical framework and a new algorithm that unify the discovery of compound-risk factors that have a strong OR, risk ratio, or a risk difference. Our method guarantees that all patterns meeting a certain test threshold can be efficiently discovered. Our contribution thus represents the first of its kind in linking the risk ratios and ORs to pattern mining algorithms, making it possible to find compound-risk factors in large-scale data sets. In addition, we show that using compound-risk factors can improve classification accuracy in probabilistic learning algorithms on several disease data sets, because these compound-risk factors capture the interdependency between important data attributes. © 2007 IEEE.	en_US
dc.relation.ispartof	IEEE Transactions on Information Technology in Biomedicine	en_US
dc.relation.isbasedon	10.1109/TITB.2007.891163	en_US
dc.subject.classification	Medical Informatics	en_US
dc.subject.mesh	Data Interpretation, Statistical	en_US
dc.subject.mesh	Models, Statistical	en_US
dc.subject.mesh	Odds Ratio	en_US
dc.subject.mesh	Risk Assessment	en_US
dc.subject.mesh	Evidence-Based Medicine	en_US
dc.subject.mesh	Biometry	en_US
dc.subject.mesh	Computer Simulation	en_US
dc.subject.mesh	Pattern Recognition, Automated	en_US
dc.title	Strong compound-risk factors: Efficient discovery through emerging patterns and contrast sets	en_US
dc.type	Journal Article
utslib.citation.volume	5	en_US
utslib.citation.volume	11	en_US
utslib.for	0903 Biomedical Engineering	en_US
utslib.for	08 Information and Computing Sciences	en_US
utslib.for	09 Engineering	en_US
utslib.for	11 Medical and Health Sciences	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAI - Advanced Analytics Institute Research Centre
pubs.organisational-group	/University of Technology Sydney/Strength - CHT - Health Technologies
utslib.copyright.status	open_access
pubs.issue	5	en_US
pubs.publication-status	Published	en_US
pubs.volume	11	en_US

Abstract:

Odds ratio (OR), relative risk (RR) (risk ratio), and absolute risk reduction (ARR) (risk difference) are biostatistics measurements that are widely used for identifying significant risk factors in dichotomous groups of subjects. In the past, they have often been used to assess simple risk factors. In this paper, we introduce the concept of compound-risk factors to broaden the applicability of these statistical tests for assessing factor interplays. We observe that compound-risk factors with a high risk ratio or a big risk difference have an one-to-one correspondence to strong emerging patterns or strong contrast sets-two types of patterns that have been extensively studied in the data mining field. Such a relationship has been unknown to researchers in the past, and efficient algorithms for discovering strong compound-risk factors have been lacking. In this paper, we propose a theoretical framework and a new algorithm that unify the discovery of compound-risk factors that have a strong OR, risk ratio, or a risk difference. Our method guarantees that all patterns meeting a certain test threshold can be efficiently discovered. Our contribution thus represents the first of its kind in linking the risk ratios and ORs to pattern mining algorithms, making it possible to find compound-risk factors in large-scale data sets. In addition, we show that using compound-risk factors can improve classification accuracy in probabilistic learning algorithms on several disease data sets, because these compound-risk factors capture the interdependency between important data attributes. © 2007 IEEE.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/15426