In-network Computation for Large-scale Federated Learning over Wireless Edge Networks

Dinh, TQ; Nguyen, DN; Hoang, DT; Pham, TV; Dutkiewicz, E

In-network Computation for Large-scale Federated Learning over Wireless Edge Networks

Dinh, TQ Nguyen, DN Hoang, DT Pham, TV Dutkiewicz, E

Permalink

Publisher:: Institute of Electrical and Electronics Engineers (IEEE)
Publication Type:: Journal Article
Citation:: IEEE Transactions on Mobile Computing, 2022, PP, (99), pp. 1-15
Issue Date:: 2022-01-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

The embargo period expires on 15 Jul 2024

Adobe PDF

Download Accepted versionAdobe PDF (3.04 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Dinh, TQ
dc.contributor.author	Nguyen, DN
dc.contributor.author	Hoang, DT
dc.contributor.author	Pham, TV
dc.contributor.author	Dutkiewicz, E https://orcid.org/0000-0002-4268-9286
dc.date.accessioned	2023-01-01T05:38:56Z
dc.date.available	2023-01-01T05:38:56Z
dc.date.issued	2022-01-01
dc.identifier.citation	IEEE Transactions on Mobile Computing, 2022, PP, (99), pp. 1-15
dc.identifier.issn	1536-1233
dc.identifier.issn	1558-0660
dc.identifier.uri	http://hdl.handle.net/10453/164598
dc.description.abstract	Most conventional Federated Learning (FL) models are using a star network topology where all users aggregate their local models at a single server (e.g., a cloud server). That causes significant overhead in terms of both communications and computing at the server, delaying the training process, especially for large scale FL systems with straggling nodes. This paper proposes a novel edge network architecture that enables decentralizing the model aggregation process at the server, thereby significantly reducing the training delay for the whole FL network. Specifically, we design a highly-effective in-network computation framework (INC) consisting of a user scheduling mechanism, an in-network aggregation process (INA) which is designed for both primal- and primal-dual methods in distributed machine learning problems, and a network routing algorithm with theoretical performance bounds. The in-network aggregation process, which is implemented at edge nodes and cloud node, can adapt two typical methods to allow edge networks to effectively solve the distributed machine learning problems. Under the proposed INA, we then formulate a joint routing and resource optimization problem, aiming to minimize the aggregation latency. The problem turns out to be NP-hard, and thus we propose a polynomial time routing algorithm which can achieve near optimal performance with a theoretical bound. Simulation results showed that the proposed algorithm can achieve more than 99<inline-formula><tex-math notation="LaTeX">$\%$</tex-math></inline-formula> of the optimal solution and reduce the FL training latency, up to 5.6 times w.r.t other baselines. The proposed INC framework can not only help reduce the FL training latency but also significantly decrease cloud’s traffic and computing overhead. By embedding the computing/aggregation tasks at the edge nodes and leveraging the multi-layer edge-network architecture, the INC framework can liberate FL from the star topology to enable large-scale FL.
dc.language	en
dc.publisher	Institute of Electrical and Electronics Engineers (IEEE)
dc.relation	http://purl.org/au-research/grants/arc/DE210100651
dc.relation.ispartof	IEEE Transactions on Mobile Computing
dc.relation.isbasedon	10.1109/TMC.2022.3190260
dc.rights	info:eu-repo/semantics/embargoedAccess
dc.subject	0805 Distributed Computing, 0906 Electrical and Electronic Engineering, 1005 Communications Technologies
dc.subject.classification	Networking & Telecommunications
dc.title	In-network Computation for Large-scale Federated Learning over Wireless Edge Networks
dc.type	Journal Article
utslib.citation.volume	PP
utslib.for	0805 Distributed Computing
utslib.for	0906 Electrical and Electronic Engineering
utslib.for	1005 Communications Technologies
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - GBDTC - Global Big Data Technologies
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Electrical and Data Engineering
utslib.copyright.status	open_access	*
utslib.copyright.embargo	2024-07-15T00:00:00+1000Z
dc.date.updated	2023-01-01T05:38:50Z
pubs.issue	99
pubs.publication-status	Published
pubs.volume	PP
utslib.citation.issue	99

Abstract:

Most conventional Federated Learning (FL) models are using a star network topology where all users aggregate their local models at a single server (e.g., a cloud server). That causes significant overhead in terms of both communications and computing at the server, delaying the training process, especially for large scale FL systems with straggling nodes. This paper proposes a novel edge network architecture that enables decentralizing the model aggregation process at the server, thereby significantly reducing the training delay for the whole FL network. Specifically, we design a highly-effective in-network computation framework (INC) consisting of a user scheduling mechanism, an in-network aggregation process (INA) which is designed for both primal- and primal-dual methods in distributed machine learning problems, and a network routing algorithm with theoretical performance bounds. The in-network aggregation process, which is implemented at edge nodes and cloud node, can adapt two typical methods to allow edge networks to effectively solve the distributed machine learning problems. Under the proposed INA, we then formulate a joint routing and resource optimization problem, aiming to minimize the aggregation latency. The problem turns out to be NP-hard, and thus we propose a polynomial time routing algorithm which can achieve near optimal performance with a theoretical bound. Simulation results showed that the proposed algorithm can achieve more than 99

$\%$

of the optimal solution and reduce the FL training latency, up to 5.6 times w.r.t other baselines. The proposed INC framework can not only help reduce the FL training latency but also significantly decrease cloud’s traffic and computing overhead. By embedding the computing/aggregation tasks at the edge nodes and leveraging the multi-layer edge-network architecture, the INC framework can liberate FL from the star topology to enable large-scale FL.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/164598