Breast cancer data analysis for survivability studies and prediction

Shukla, N; Hagenbuchner, M; Win, KT; Yang, J

Breast cancer data analysis for survivability studies and prediction

Shukla, N Hagenbuchner, M Win, KT

Yang, J

Permalink

Publication Type:: Journal Article
Citation:: Computer Methods and Programs in Biomedicine, 2018, 155 pp. 199 - 208
Issue Date:: 2018-03-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Accepted Manuscript VersionAdobe PDF (1.66 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Shukla, N	en_US
dc.contributor.author	Hagenbuchner, M	en_US
dc.contributor.author	Win, KT https://orcid.org/0000-0002-7810-6388	en_US
dc.contributor.author	Yang, J	en_US
dc.date.available	2020-05-25T19:09:22Z
dc.date.issued	2018-03-01	en_US
dc.identifier.citation	Computer Methods and Programs in Biomedicine, 2018, 155 pp. 199 - 208	en_US
dc.identifier.issn	0169-2607	en_US
dc.identifier.uri	http://hdl.handle.net/10453/121714
dc.description.abstract	© 2017 Elsevier B.V. Background Breast cancer is the most common cancer affecting females worldwide. Breast cancer survivability prediction is challenging and a complex research task. Existing approaches engage statistical methods or supervised machine learning to assess/predict the survival prospects of patients. Objective The main objectives of this paper is to develop a robust data analytical model which can assist in (i) a better understanding of breast cancer survivability in presence of missing data, (ii) providing better insights into factors associated with patient survivability, and (iii) establishing cohorts of patients that share similar properties. Methods Unsupervised data mining methods viz. the self-organising map (SOM) and density-based spatial clustering of applications with noise (DBSCAN) is used to create patient cohort clusters. These clusters, with associated patterns, were used to train multilayer perceptron (MLP) model for improved patient survivability analysis. A large dataset available from SEER program is used in this study to identify patterns associated with the survivability of breast cancer patients. Information gain was computed for the purpose of variable selection. All of these methods are data-driven and require little (if any) input from users or experts. Results SOM consolidated patients into cohorts of patients with similar properties. From this, DBSCAN identified and extracted nine cohorts (clusters). It is found that patients in each of the nine clusters have different survivability time. The separation of patients into clusters improved the overall survival prediction accuracy based on MLP and revealed intricate conditions that affect the accuracy of a prediction. Conclusions A new, entirely data driven approach based on unsupervised learning methods improves understanding and helps identify patterns associated with the survivability of patient. The results of the analysis can be used to segment the historical patient data into clusters or subsets, which share common variable values and survivability. The survivability prediction accuracy of a MLP is improved by using identified patient cohorts as opposed to using raw historical data. Analysis of variable values in each cohort provide better insights into survivability of a particular subgroup of breast cancer patients.	en_US
dc.relation.ispartof	Computer Methods and Programs in Biomedicine	en_US
dc.relation.isbasedon	10.1016/j.cmpb.2017.12.011	en_US
dc.subject.classification	Medical Informatics	en_US
dc.subject.mesh	Humans	en_US
dc.subject.mesh	Breast Neoplasms	en_US
dc.subject.mesh	SEER Program	en_US
dc.subject.mesh	Cluster Analysis	en_US
dc.subject.mesh	Survival Analysis	en_US
dc.subject.mesh	Cohort Studies	en_US
dc.subject.mesh	Models, Theoretical	en_US
dc.subject.mesh	Female	en_US
dc.subject.mesh	Machine Learning	en_US
dc.title	Breast cancer data analysis for survivability studies and prediction	en_US
dc.type	Journal Article
utslib.citation.volume	155	en_US
utslib.for	0903 Biomedical Engineering	en_US
utslib.for	0801 Artificial Intelligence and Image Processing	en_US
utslib.for	0906 Electrical and Electronic Engineering	en_US
pubs.embargo.period	Not known	en_US
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Information, Systems and Modelling
pubs.organisational-group	/University of Technology Sydney/Strength - CAMGIS - Centre for Advanced Modelling and Geospatial lnformation Systems
pubs.organisational-group	/University of Technology Sydney/Strength - PERSWADE - Centre on Persuasive Systems for Wise Adaptive Living
utslib.copyright.status	open_access
pubs.publication-status	Published	en_US
pubs.volume	155	en_US

Abstract:

© 2017 Elsevier B.V. Background Breast cancer is the most common cancer affecting females worldwide. Breast cancer survivability prediction is challenging and a complex research task. Existing approaches engage statistical methods or supervised machine learning to assess/predict the survival prospects of patients. Objective The main objectives of this paper is to develop a robust data analytical model which can assist in (i) a better understanding of breast cancer survivability in presence of missing data, (ii) providing better insights into factors associated with patient survivability, and (iii) establishing cohorts of patients that share similar properties. Methods Unsupervised data mining methods viz. the self-organising map (SOM) and density-based spatial clustering of applications with noise (DBSCAN) is used to create patient cohort clusters. These clusters, with associated patterns, were used to train multilayer perceptron (MLP) model for improved patient survivability analysis. A large dataset available from SEER program is used in this study to identify patterns associated with the survivability of breast cancer patients. Information gain was computed for the purpose of variable selection. All of these methods are data-driven and require little (if any) input from users or experts. Results SOM consolidated patients into cohorts of patients with similar properties. From this, DBSCAN identified and extracted nine cohorts (clusters). It is found that patients in each of the nine clusters have different survivability time. The separation of patients into clusters improved the overall survival prediction accuracy based on MLP and revealed intricate conditions that affect the accuracy of a prediction. Conclusions A new, entirely data driven approach based on unsupervised learning methods improves understanding and helps identify patterns associated with the survivability of patient. The results of the analysis can be used to segment the historical patient data into clusters or subsets, which share common variable values and survivability. The survivability prediction accuracy of a MLP is improved by using identified patient cohorts as opposed to using raw historical data. Analysis of variable values in each cohort provide better insights into survivability of a particular subgroup of breast cancer patients.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/121714