A Parallel Random Forest Algorithm for Big Data in a Spark Cloud Computing Environment

Chen, J; Li, K; Tang, Z; Bilal, K; Yu, S; Weng, C; Li, K

A Parallel Random Forest Algorithm for Big Data in a Spark Cloud Computing Environment

Chen, J Li, K Tang, Z Bilal, K Yu, S

Weng, C Li, K

Permalink

Publisher:: Institute of Electrical and Electronics Engineers
Publication Type:: Journal Article
Citation:: IEEE Transactions on Parallel and Distributed Systems, 2017, 28, (4), pp. 919-933
Issue Date:: 2017-04-01

Closed Access

	Filename	Description	Size
	A_Parallel_Random_Forest_Algorithm_for_Big_Data_in_a_Spark_Cloud_Computing_Environment.pdf		2.47 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Chen, J
dc.contributor.author	Li, K
dc.contributor.author	Tang, Z
dc.contributor.author	Bilal, K
dc.contributor.author	Yu, S https://orcid.org/0000-0003-4485-6743
dc.contributor.author	Weng, C
dc.contributor.author	Li, K
dc.date.accessioned	2022-08-16T05:14:19Z
dc.date.available	2022-08-16T05:14:19Z
dc.date.issued	2017-04-01
dc.identifier.citation	IEEE Transactions on Parallel and Distributed Systems, 2017, 28, (4), pp. 919-933
dc.identifier.issn	1045-9219
dc.identifier.issn	1558-2183
dc.identifier.uri	http://hdl.handle.net/10453/160342
dc.description.abstract	With the emergence of the big data age, the issue of how to obtain valuable knowledge from a dataset efficiently and accurately has attracted increasingly attention from both academia and industry. This paper presents a Parallel Random Forest (PRF) algorithm for big data on the Apache Spark platform. The PRF algorithm is optimized based on a hybrid approach combining dataparallel and task-parallel optimization. From the perspective of data-parallel optimization, a vertical data-partitioning method is performed to reduce the data communication cost effectively, and a data-multiplexing method is performed is performed to allow the training dataset to be reused and diminish the volume of data. From the perspective of task-parallel optimization, a dual parallel approach is carried out in the training process of RF, and a task Directed Acyclic Graph (DAG) is created according to the parallel training process of PRF and the dependence of the Resilient Distributed Datasets (RDD) objects. Then, different task schedulers are invoked for the tasks in the DAG. Moreover, to improve the algorithm's accuracy for large, high-dimensional, and noisy data, we perform a dimension-reduction approach in the training process and a weighted voting approach in the prediction process prior to parallelization. Extensive experimental results indicate the superiority and notable advantages of the PRF algorithm over the relevant algorithms implemented by Spark MLlib and other studies in terms of the classification accuracy, performance, and scalability. With the expansion of the scale of the random forest model and the Spark cluster, the advantage of the PRF algorithm is more obvious.
dc.language	English
dc.publisher	Institute of Electrical and Electronics Engineers
dc.relation.ispartof	IEEE Transactions on Parallel and Distributed Systems
dc.relation.isbasedon	10.1109/TPDS.2016.2603511
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	0803 Computer Software, 0805 Distributed Computing, 1005 Communications Technologies
dc.subject.classification	Distributed Computing
dc.title	A Parallel Random Forest Algorithm for Big Data in a Spark Cloud Computing Environment
dc.type	Journal Article
utslib.citation.volume	28
utslib.for	0803 Computer Software
utslib.for	0805 Distributed Computing
utslib.for	1005 Communications Technologies
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
utslib.copyright.status	closed_access	*
pubs.consider-herdc	false
dc.date.updated	2022-08-16T05:14:16Z
pubs.issue	4
pubs.publication-status	Published
pubs.volume	28
utslib.citation.issue	4

Abstract:

With the emergence of the big data age, the issue of how to obtain valuable knowledge from a dataset efficiently and accurately has attracted increasingly attention from both academia and industry. This paper presents a Parallel Random Forest (PRF) algorithm for big data on the Apache Spark platform. The PRF algorithm is optimized based on a hybrid approach combining dataparallel and task-parallel optimization. From the perspective of data-parallel optimization, a vertical data-partitioning method is performed to reduce the data communication cost effectively, and a data-multiplexing method is performed is performed to allow the training dataset to be reused and diminish the volume of data. From the perspective of task-parallel optimization, a dual parallel approach is carried out in the training process of RF, and a task Directed Acyclic Graph (DAG) is created according to the parallel training process of PRF and the dependence of the Resilient Distributed Datasets (RDD) objects. Then, different task schedulers are invoked for the tasks in the DAG. Moreover, to improve the algorithm's accuracy for large, high-dimensional, and noisy data, we perform a dimension-reduction approach in the training process and a weighted voting approach in the prediction process prior to parallelization. Extensive experimental results indicate the superiority and notable advantages of the PRF algorithm over the relevant algorithms implemented by Spark MLlib and other studies in terms of the classification accuracy, performance, and scalability. With the expansion of the scale of the random forest model and the Spark cluster, the advantage of the PRF algorithm is more obvious.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/160342