Serving a large language model
This tutorial walks you through deploying an LLM inference workspace on exalsius, validating the deployment, and sending your first inference request.
exalsius uses llm-d and vLLM under the hood to provide OpenAI-compatible model serving with gateway routing, model replica scheduling, and tensor parallelism.
Prerequisites
- The exalsius CLI installed and configured
- A cluster in
READYstatus with--prepare-llm-inference-environmentenabled (see deploy clusters) - Available GPU capacity on the cluster
- A HuggingFace API token with access to the model you want to serve
Verify your setup:
exls clusters list
exls clusters show-available-resources <CLUSTER-ID-or-NAME>
LLM inference environment
The cluster must have been created with the LLM inference environment enabled. This pre-installs the Gateway API, Istio, Gateway API Inference Extension, and llm-d control-plane components. Without it, LLM inference workspaces cannot be deployed.
Step 1 — Deploy a model
Deploy a model workspace with the CLI:
exls workspaces deploy llm-inference \
--name qwen3-1p7b \
--hf-token <HUGGINGFACE-TOKEN> \
--model-name Qwen/Qwen3-1.7B \
--num-gpus 1
| Flag | Description |
|---|---|
--name |
Workspace name. Use a stable, descriptive name for easier management. |
--hf-token |
HuggingFace API token. Also accepts --huggingface-token or the HUGGINGFACE_TOKEN/HF_TOKEN env vars. |
--model-name |
HuggingFace model in <repo>/<model> format. |
--num-gpus |
Number of GPUs. For multi-GPU deployments, this sets vLLM tensor parallelism. |
Before submitting, the CLI shows a deployment summary and asks for confirmation. You can optionally review and edit the full workspace configuration in your editor.
Multi-GPU deployment
For larger models that require tensor parallelism:
exls workspaces deploy llm-inference \
--name qwen3-7b-tp2 \
--hf-token <HUGGINGFACE-TOKEN> \
--model-name Qwen/Qwen3-7B \
--num-gpus 2
Ensure the model supports the tensor parallelism configuration you select.
Step 2 — Validate the deployment
Check that the workspace is running:
exls workspaces get <WORKSPACE-ID-or-NAME>
The Access field shows the inference endpoint and the Open WebUI URL. First startup can take several minutes because model images and artifacts are large.
Confirm the model is discoverable:
curl <INFERENCE-ENDPOINT>/v1/models
Step 3 — Send an inference request
Send an OpenAI-compatible chat completion request to the inference endpoint:
curl <INFERENCE-ENDPOINT>/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-1.7B",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Troubleshooting
If requests fail, verify that --model-name, --num-gpus, and --hf-token were set correctly during deployment, and that the model is fully loaded (check /v1/models).
Step 4 — Add more models
Deploy additional model workspaces on the same cluster. Each workspace runs independently and shares the cluster's LLM inference environment:
exls workspaces deploy llm-inference \
--name mistral-7b \
--hf-token <HUGGINGFACE-TOKEN> \
--model-name mistralai/Mistral-7B-Instruct-v0.3 \
--num-gpus 1
All models are accessible through the same gateway endpoint and routed by the Gateway API Inference Extension.
Step 5 — Clean up
Delete model workspaces when no longer needed:
exls workspaces delete <WORKSPACE-ID-or-NAME>
Warning
Workspace storage is ephemeral. Export any required data before deletion.
Further reading
- Start workspaces — full workspace CLI reference
- Cluster observability — monitor your inference workloads
- llm-d documentation
- vLLM documentation