Skip to content

Serving a large language model

This tutorial walks you through deploying an LLM inference workspace on exalsius, validating the deployment, and sending your first inference request.

exalsius uses llm-d and vLLM under the hood to provide OpenAI-compatible model serving with gateway routing, model replica scheduling, and tensor parallelism.

Prerequisites

  • The exalsius CLI installed and configured
  • A cluster in READY status with --prepare-llm-inference-environment enabled (see deploy clusters)
  • Available GPU capacity on the cluster
  • A HuggingFace API token with access to the model you want to serve

Verify your setup:

exls clusters list
exls clusters show-available-resources <CLUSTER-ID-or-NAME>

LLM inference environment

The cluster must have been created with the LLM inference environment enabled. This pre-installs the Gateway API, Istio, Gateway API Inference Extension, and llm-d control-plane components. Without it, LLM inference workspaces cannot be deployed.

Step 1 — Deploy a model

Deploy a model workspace with the CLI:

exls workspaces deploy llm-inference \
  --name qwen3-1p7b \
  --hf-token <HUGGINGFACE-TOKEN> \
  --model-name Qwen/Qwen3-1.7B \
  --num-gpus 1
Flag Description
--name Workspace name. Use a stable, descriptive name for easier management.
--hf-token HuggingFace API token. Also accepts --huggingface-token or the HUGGINGFACE_TOKEN/HF_TOKEN env vars.
--model-name HuggingFace model in <repo>/<model> format.
--num-gpus Number of GPUs. For multi-GPU deployments, this sets vLLM tensor parallelism.

Before submitting, the CLI shows a deployment summary and asks for confirmation. You can optionally review and edit the full workspace configuration in your editor.

Multi-GPU deployment

For larger models that require tensor parallelism:

exls workspaces deploy llm-inference \
  --name qwen3-7b-tp2 \
  --hf-token <HUGGINGFACE-TOKEN> \
  --model-name Qwen/Qwen3-7B \
  --num-gpus 2

Ensure the model supports the tensor parallelism configuration you select.

Step 2 — Validate the deployment

Check that the workspace is running:

exls workspaces get <WORKSPACE-ID-or-NAME>

The Access field shows the inference endpoint and the Open WebUI URL. First startup can take several minutes because model images and artifacts are large.

Confirm the model is discoverable:

curl <INFERENCE-ENDPOINT>/v1/models

Step 3 — Send an inference request

Send an OpenAI-compatible chat completion request to the inference endpoint:

curl <INFERENCE-ENDPOINT>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-1.7B",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Troubleshooting

If requests fail, verify that --model-name, --num-gpus, and --hf-token were set correctly during deployment, and that the model is fully loaded (check /v1/models).

Step 4 — Add more models

Deploy additional model workspaces on the same cluster. Each workspace runs independently and shares the cluster's LLM inference environment:

exls workspaces deploy llm-inference \
  --name mistral-7b \
  --hf-token <HUGGINGFACE-TOKEN> \
  --model-name mistralai/Mistral-7B-Instruct-v0.3 \
  --num-gpus 1

All models are accessible through the same gateway endpoint and routed by the Gateway API Inference Extension.

Step 5 — Clean up

Delete model workspaces when no longer needed:

exls workspaces delete <WORKSPACE-ID-or-NAME>

Warning

Workspace storage is ephemeral. Export any required data before deletion.

Further reading