TU Darmstadt / ULB / TUprints

Generalized Cost-Based Job Scheduling in Very Large Heterogeneous Cluster Systems

Khuda Bukhsh, Wasiur R. ; Kar, Sounak ; Alt, Bastian ; Rizk, Amr ; Koeppl, Heinz (2022)
Generalized Cost-Based Job Scheduling in Very Large Heterogeneous Cluster Systems.
In: IEEE Transactions on Parallel and Distributed Systems, 31 (11)
doi: 10.26083/tuprints-00021667
Article, Secondary publication, Postprint

[img] Text
cbsTpds.pdf
Copyright Information: In Copyright.

Download (5MB)
Item Type: Article
Type of entry: Secondary publication
Title: Generalized Cost-Based Job Scheduling in Very Large Heterogeneous Cluster Systems
Language: English
Date: 2022
Place of Publication: Darmstadt
Publisher: IEEE
Journal or Publication Title: IEEE Transactions on Parallel and Distributed Systems
Volume of the journal: 31
Issue Number: 11
Collation: 14 Seiten
DOI: 10.26083/tuprints-00021667
Corresponding Links:
Origin: Secondary publication service
Abstract:

We study job assignment in large, heterogeneous resource-sharing clusters of servers with finite buffers. This load balancing problem arises naturally in today's communication and big data systems, such as Amazon Web Services, Network Service Function Chains, and Stream Processing. Arriving jobs are dispatched to a server, following a load balancing policy that optimizes a performance criterion such as job completion time. Our contribution is a randomized Cost-Based Scheduling (CBS) policy in which the job assignment is driven by general cost functions of the server queue lengths. Beyond existing schemes, such as the Join the Shortest Queue (JSQ), the power of d or the SQ(d) and the capacity-weighted JSQ, the notion of CBS yields new application-specific policies such as hybrid locally uniform JSQ. As today's data center clusters have thousands of servers, exact analysis of CBS policies is tedious. In this article, we derive a scaling limit when the number of servers grows large, facilitating a comparison of various CBS policies with respect to their transient as well as steady state behavior. A byproduct of our derivations is the relationship between the queue filling proportions and the server buffer sizes, which cannot be obtained from infinite buffer models. Finally, we provide extensive numerical evaluations and discuss several applications including multi-stage systems.

Uncontrolled Keywords: Job Scheduling, performance evaluation, mean-field limit
Status: Postprint
URN: urn:nbn:de:tuda-tuprints-216679
Classification DDC: 000 Generalities, computers, information > 004 Computer science
500 Science and mathematics > 530 Physics
Divisions: 18 Department of Electrical Engineering and Information Technology > Institute for Telecommunications > Bioinspired Communication Systems
18 Department of Electrical Engineering and Information Technology > Self-Organizing Systems Lab
Date Deposited: 20 Jul 2022 14:03
Last Modified: 13 Apr 2023 11:28
URI: https://tuprints.ulb.tu-darmstadt.de/id/eprint/21667
PPN: 506774325
Export:
Actions (login required)
View Item View Item