Running Model Training Across Geo-Distributed Datacenters

This tutorial explains how to run the training of common models across servers that can be located in different, geographically distributed data centers with exalsius. Therefore, we will use our own framework of implemented distributed model training recipes. It combines torch elastic with different gradient compression strategies, enabling it to handle low-bandwidth network connections between data centers. However, exalsius is agnostic to specific frameworks and can be used to run Flower, Flare, or Colossal-AI training workloads.

Prerequisites

First, make sure you have installed the exalsius CLI.

Our framework uses Weights & Biases to monitor training and Hugging Face for model and dataset storage. You need a Weights & Biases API key and a Hugging Face write user access token.

We will start by creating a cluster that supports the execution of geo-distributed training workloads. Therefore, you need a set of nodes in your node pool. If you have not imported nodes to your node pool yet, follow the node import guide.

No Docker Pre-Installed

exalsius uses containerd to run workloads on nodes. Therefore, nodes must not have Docker pre-installed. See node prerequisites.

Firewall Configuration

To allow model training across data centers, nodes need to open UDP inbound port 51820. See firewall configuration.

Verify that there are available nodes in your node pool:

exls nodes list

Deploying Geo-Distributed Clusters

Follow the cluster deployment guide to deploy a cluster. Make sure that you set the option flags or confirm the configurations during the interactive flow to:

enable the cluster for multi-node training
enable the VPN connection between cluster nodes

exalsius supports running training across heterogeneous nodes. You can add both Nvidia and AMD GPU nodes to your cluster. Nodes without GPUs can be added as well.

Under the Hood

exalsius uses the Volcano scheduler to orchestrate model training jobs across multiple nodes.

To deploy a cluster across nodes in different data centers, exalsius internally uses the NetBird VPN service to enable peer-to-peer connections between nodes. It uses WireGuard to tunnel the communication through UDP port 51820. Nodes need to open this port for inbound traffic.

Deploy a Geo-Distributed GPT Model Training Job

exalsius provides a distributed training workspace based on our custom framework for communication-efficient distributed training recipes. It supports a range of models and datasets that can be trained on nodes across geo-distributed data centers.

To start the training of a GPT-like model on your cluster, run:

exls workspaces deploy distributed-training \
  --model gpt-neo-x \
  --gradient-compression medium \
  --wandb-token <YOUR-API-TOKEN> \
  --hf-token <YOUR-ACCESS-KEY>

The --model option selects the model that should be trained. The --gradient-compression controls the gradient compression strategy that is used to support low-bandwidth interconnects between cluster nodes.

If you do not provide the --wandb-token and --hf-token options, exalsius will prompt you to enter them.

Before starting the workspace, exalsius will ask you if you want to open the detailed configuration file. This is not required. We recommend using the parameters as they are if you are not familiar with the distributed training framework.

If you decide to edit the parameters, exalsius will open them via your default terminal editor. You can change settings, save them, and close the editor. exalsius will use your edited settings.

You can check the status of your workspace by running:

exls workspaces list

Understanding Status

The status "RUNNING" suggests that the cluster accepted your workspace. First-time running of a distributed-training workspace requires large images and datasets to be pulled onto the nodes. It might take a couple of minutes until the training starts.

Under the Hood

To enable effective geo-distributed training, our distributed training framework implements different strategies to reduce the communication of gradients between nodes. We use DiLoCo to reduce the number of synchronizations between nodes. Furthermore, the framework supports the compression of gradients with FP8 or FP4 quantization and decoupled momentum optimization.

exalsius provides an abstraction to set the communication reduction of gradients via the --gradient-compression option. However, fine-grained adjustments to the configuration can be done when editing the settings before submitting them.

Running Geo-Distributed Training

The distributed-training workspace supports the training of a set of integrated model architectures and datasets. Some models have specific requirements that need to be checked before starting the training.

Model Name	Dataset	Requirements
`gpt-neo`	c4	-
`gpt-neo-x`	c4	-
`gcn`	ogbn-arxiv	-
`resnet50`	ILSVRC/imagenet-1k	GPU-Node storage >250GB, the account associated with the Hugging Face user key needs to have access to ILSVRC/imagenet-1k dataset
`resnet101`	ILSVRC/imagenet-1k	GPU-Node storage >250GB, the account associated with the Hugging Face user key needs to have access to ILSVRC/imagenet-1k dataset
`wav2vec2`	librispeech_asr	GPU-Node storage >250GB

Monitoring the Logs of your Workspace

exalsius builds upon Kubernetes to deploy and manage clusters. If you have kubectl installed, you can download the kubeconfig of your cluster by running:

exls clusters import-kubeconfig

Note

This will overwrite the existing config file in $HOME/.kube/config. You can use the --kubeconfig-path option to set another file path.

Now, you can check the status of your pods via:

kubectl get pods

Monitoring via Weights & Biases

After the workspace is started and the pods have downloaded the images, you can monitor the training progress via your Weights & Biases dashboard. For each geo-distributed model training run, exalsius creates a new project. If you want to use an existing project, you can set the Weights & Biases project name in the detailed configurations of the workspace.

Delete a Training Workspace

Distributed model training workspaces can be deleted anytime. To delete it, list all your workspaces:

exls workspaces list

Find the ID of your workspace and delete it with:

exls workspaces delete <WORKSPACE-ID>

Warning

Workspace storage is ephemeral. All data and logs will be lost after deletion. Always export checkpoints or metrics before removing a workspace if you want to keep results.

Troubleshooting

Issue	Possible Fix
The workspace reached the status "RUNNING" but there is no telemetry data in Weights & Biases	After deploying a fresh cluster, the first time starting of the distributed training workspace needs to pull large images and datasets. This might take up to 15 minutes. Check the status of your pods using `kubectl` to see details.
The pods reach the status "RUNNING" but there is still no data in Weights & Biases	After deploying a fresh cluster, the first time starting of the distributed training might run into rendezvous issues related to Torch Elastic and Torch Distributed. Try to delete the workspace and start it again. This time the images and datasets are already pulled and the workspace will start immediately.

Further Information

Use exls workspaces deploy --help or exls workspaces deploy distributed-training --help
Visit the Distributed Training Framework Repository for implementation details, supported models, optimizers, and gradient compression strategies.
Check the Distributed Training Workspace Template Repository to see the Helm chart integration.