A review of the machine learning datasets in mammography, their adherence to the FAIR principles and the outlook for the future.

Logan, J; Kennedy, PJ; Catchpoole, D

A review of the machine learning datasets in mammography, their adherence to the FAIR principles and the outlook for the future.

Logan, J Kennedy, PJ Catchpoole, D

Permalink

Publisher:: Springer Nature
Publication Type:: Journal Article
Citation:: Sci Data, 2023, 10, (1), pp. 595
Issue Date:: 2023-09-08

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Published versionAdobe PDF (1.23 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Logan, J
dc.contributor.author	Kennedy, PJ
dc.contributor.author	Catchpoole, D
dc.date.accessioned	2023-10-18T05:27:35Z
dc.date.available	2023-07-31
dc.date.available	2023-10-18T05:27:35Z
dc.date.issued	2023-09-08
dc.identifier.citation	Sci Data, 2023, 10, (1), pp. 595
dc.identifier.issn	2052-4463
dc.identifier.issn	2052-4463
dc.identifier.uri	http://hdl.handle.net/10453/172796
dc.description.abstract	The increasing rates of breast cancer, particularly in emerging economies, have led to interest in scalable deep learning-based solutions that improve the accuracy and cost-effectiveness of mammographic screening. However, such tools require large volumes of high-quality training data, which can be challenging to obtain. This paper combines the experience of an AI startup with an analysis of the FAIR principles of the eight available datasets. It demonstrates that the datasets vary considerably, particularly in their interoperability, as each dataset is skewed towards a particular clinical use-case. Additionally, the mix of digital captures and scanned film compounds the problem of variability, along with differences in licensing terms, ease of access, labelling reliability, and file formats. Improving interoperability through adherence to standards such as the BIRADS criteria for labelling and annotation, and a consistent file format, could markedly improve access and use of larger amounts of standardized data. This, in turn, could be increased further by GAN-based synthetic data generation, paving the way towards better health outcomes for breast cancer.
dc.format	Electronic
dc.language	eng
dc.publisher	Springer Nature
dc.relation.ispartof	Sci Data
dc.relation.isbasedon	10.1038/s41597-023-02430-6
dc.rights	info:eu-repo/semantics/openAccess
dc.subject.mesh	Data Accuracy
dc.subject.mesh	Machine Learning
dc.subject.mesh	Mammography
dc.subject.mesh	Motion Pictures
dc.subject.mesh	Reproducibility of Results
dc.subject.mesh	Datasets as Topic
dc.subject.mesh	Mammography
dc.subject.mesh	Reproducibility of Results
dc.subject.mesh	Motion Pictures
dc.subject.mesh	Datasets as Topic
dc.subject.mesh	Machine Learning
dc.subject.mesh	Data Accuracy
dc.title	A review of the machine learning datasets in mammography, their adherence to the FAIR principles and the outlook for the future.
dc.type	Journal Article
utslib.citation.volume	10
utslib.location.activity	England
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - CHT - Health Technologies
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
pubs.organisational-group	/University of Technology Sydney/Centre for Health Technologies (CHT)
utslib.copyright.status	open_access	*
dc.date.updated	2023-10-18T05:27:32Z
pubs.issue	1
pubs.publication-status	Published online
pubs.volume	10
utslib.citation.issue	1

Abstract:

The increasing rates of breast cancer, particularly in emerging economies, have led to interest in scalable deep learning-based solutions that improve the accuracy and cost-effectiveness of mammographic screening. However, such tools require large volumes of high-quality training data, which can be challenging to obtain. This paper combines the experience of an AI startup with an analysis of the FAIR principles of the eight available datasets. It demonstrates that the datasets vary considerably, particularly in their interoperability, as each dataset is skewed towards a particular clinical use-case. Additionally, the mix of digital captures and scanned film compounds the problem of variability, along with differences in licensing terms, ease of access, labelling reliability, and file formats. Improving interoperability through adherence to standards such as the BIRADS criteria for labelling and annotation, and a consistent file format, could markedly improve access and use of larger amounts of standardized data. This, in turn, could be increased further by GAN-based synthetic data generation, paving the way towards better health outcomes for breast cancer.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/172796