This guide shows how to deploy DeepSeek-R1 using SGLang and Terraform, taking an infrastructure-as-code approach to NVFP4 inference on cloud-provisioned GPU systems.
We focus on a single-node GPU inference with 4× NVIDIA B300 SXM6 262GB, local NVMe storage with Docker caching, and a reproducible benchmark run.
As a result, you will have a single Terraform config for provisioning infrastructure and a startup script for installing the runtime, downloading the model, and launching SGLang.
Why SGLang for DeepSeek-R1 inference
SGLang is the officially preferred serving framework for DeepSeek models starting from V3 and R1. As of today, the SGLang project extended beyond model inference toward Reinforcement Learning (RL) post-training techniques, integrating with Miles, Slime, and veRL.
Through 2025, the SGLang team focused on leveraging the optimizations performed and presented by the DeepSeek team in the DeepSeek-V3 Technical Report. We published an in-depth overview of these optimization techniques, with Multi-Head Latent Attention getting a dedicated entry.
Why Terraform for DeepSeek-R1 inference with SGLang
One of the optimization techniques leveraged by SGLang for DeepSeek inference is Prefill-Decode disaggregation (PD disaggregation), which is one of the most complex optimizations from the perspective of resource orchestration. Therefore, Infrastructure-as-code (IaC) tools like Terraform can reduce said complexity.
Terraform is an open-source IaC tool developed by HashiCorp that enables defining, provisioning, and managing cloud and on-premises infrastructure using its declarative language (HCL) in configuration files. It allows for versioning, reuse, and automation for safe, efficient infrastructure deployment across cloud platforms.
This enables systematic deployment of resources that scale flexibly, while ensuring resources are trackable and identified throughout the whole system.
The core Terraform concepts include:
- providers: plugins to interact with cloud providers APIs
- resources: which Terraform handles via providers
- data sources: read-only lookups
- modules: reusable bundles of Terraform configs
- variables / outputs: parameters in, results out
For most use cases, we will be running the combinations of these commands:
terraform init: sets up the working directory, downloads providers/modules. terraform plan: dry run/safe mode which tells what would change based on current state. terraform apply: performs the proposed changes. terraform destroy: deletes resources.
By default, Terraform saves the current state of infrastructure and resources on a local file called terraform.tfstate. We can consult the list of resources deployed using terraform state list .
Terraform provider installation for Verda
The Verda provider is available on the Terraform Registry, making the installation process very straightforward.
Our Terraform integration is fully compatible with OpenTofu. You can also find our provider on the OpenTofu Registry. However, we do not cover this installation method in this guide.
- Download and install Terraform:
# Terraform installation wget -O - https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(grep -oP '(?<=UBUNTU_CODENAME=).*' /etc/os-release || lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list sudo apt update && sudo apt install terraform
- Once installed, we just need to include the following onto our Terraform configs:
terraform { required_providers { verda = { source = "verda-cloud/verda" version = "1.0.0" } } } provider "verda" { client_id = var.verda_client_id client_secret = var.verda_client_secret }
By default, our provider will recognize VERDA_CLIENT_ID and VERDA_CLIENT_SECRET environment vars. If they’re not defined, they can be specified on the required_providers block as configuration, as shown above.
Terraform configuration for GPU instances and storage
Once the setup is complete, we will create Terraform configurations for managing resources. Terraform configurations are .tf files that describe resources for the infrastructure required for this deployment, specifying their properties and dependencies between them.
The following configuration describes several resources:
-
terraformblock: The parent block that contains configurations that define Terraform behavior. -
required_providersblock: Specifies all provider plugins required to create and manage resources specified in the configuration. -
resourceblock: Dependent on the provider configuration for their resources. For this deployment, resources are specified in our GitHub repository. -
outputblock: Exposes information about your infrastructure that you can reference in the command line or any other Terraform configurations, enabling configurations of programmatic capabilities.
# NOTE: Path to public key required on public_key terraform { required_providers { verda = { source = "verda-cloud/verda" version = "1.0.0" } } } provider "verda" {} resource "verda_ssh_key" "tf_ssh" { name = "tf-ssh" public_key = file("<path_to_key.pub>") } resource "verda_startup_script" "init_vm" { name = "init-vm" script = file("vm_scripts/vm_init.sh") } resource "verda_volume" "tf_volume" { name = "terraform-volume" size = 2000 # Size in GB type = "NVMe" location = "FIN-03" } resource "verda_instance" "terraform-sglang" { instance_type = "4B200.120V" image = "ubuntu-24.04-cuda-13.0-open-docker" hostname = "terraform-sglang" description = "Example instance" location = "FIN-03" os_volume = { name = "terraform-os" size = 200 type = "NVMe" } ssh_key_ids = [verda_ssh_key.tf_ssh.id] startup_script_id = verda_startup_script.init_vm.id existing_volumes = [verda_volume.tf_volume.id] } # Outputs output "instance_ip" { value = verda_instance.terraform-sglang.ip } output "instance_status" { value = verda_instance.terraform-sglang.status }
In order to define the properties, we need to consult the Verda Cloud API. For this, we need the Cloud API Credentials, which we can generate in the console. See the docs.
Next, we need an access token for requesting the API, which can be obtained following the previous steps or by consulting the request documentation.
As previously mentioned, the Verda provider supports the following methods for providing credentials:
- Provider Configuration (shown above on
providerblock) - Environment Variables:
-
VERDA_CLIENT_ID -
VERDA_CLIENT_SECRET -
VERDA_BASE_URL(optional)
-
We provide the following snippet for requesting relevant information from the Verda API:
#! /bin/bash BASE_URL="${BASE_URL:-https://api.verda.com/v1}" VERDA_CLIENT_ID=$VERDA_CLIENT_ID VERDA_CLIENT_SECRET=$VERDA_CLIENT_SECRET RESP=$( curl -sS --request POST "$BASE_URL/oauth2/token" \ --header "Content-Type: application/json" \ --data "{\"grant_type\":\"client_credentials\",\"client_id\":\"$VERDA_CLIENT_ID\",\"client_secret\":\"$VERDA_CLIENT_SECRET\"}" ) #echo "Raw response:" echo "$RESP" if command -v jq >/dev/null 2>&1; then TOKEN="$(echo "$RESP" | jq -r '.access_token')" echo echo "access_token:" echo "$TOKEN" fi echo RESP=$( curl -sS --request GET "$BASE_URL/instance-types" \ --header "Authorization: Bearer $TOKEN" \ ) echo "instances types:" echo "$RESP" | jq -r '.[].instance_type' echo RESP=$( curl -sS --request GET "$BASE_URL/images" \ --header "Authorization: Bearer $TOKEN" \ ) echo "images:" echo "$RESP" | jq -r '.[].image_type' echo
Once we have specified the resources for the deployment, we can test how it will impact our current Terraform state by executing terraform plan . We can apply the subsequent changes with terraform apply .
Tip: We can consult the IP and the status of the instance created by executing terraform plan and observing the outputs for instance_ip and instance_status .
DeepSeek-R1 deployment script
In order to deploy DeepSeek-R1 NVFP4-quantized, we include the following script, which will be executed once the instance is deployed. As we can see above, this script is included as a resource: resource "verda_startup_script" "init_vm" {...} .
Next, the Hugging Face token is required to download the model from their hub:
#!/bin/bash export HOST_MODEL_PATH="/mnt/local_nvme/models" export C_MODEL_PATH="/root/models" export HF_TOKEN=$HF_TOKEN # Create a local NVMe volume mkdir /mnt/local_nvme sudo parted -s /dev/vdb mklabel gpt sudo parted -s /dev/vdb mkpart primary ext4 0% 100% sudo partprobe /dev/vdb sudo mkfs.ext4 -F /dev/vdb1 sudo mount /dev/vdb1 /mnt/local_nvme mkdir -p $HOST_MODEL_PATH # Docker setup (store artifacts in local NVMe) sudo mkdir -p /mnt/local_nvme/docker sudo tee /etc/docker/daemon.json >/dev/null <<'EOF' { "data-root": "/mnt/local_nvme/docker", "runtimes": { "nvidia": { "args": [], "path": "nvidia-container-runtime" } } } EOF # Restart docker to apply the changes sudo systemctl restart docker # sglang setup # check image version in https://hub.docker.com/r/lmsysorg/sglang/tags docker run --gpus all --shm-size 32g --network=host --name sglang_server -d --ipc=host \ -v "$HOST_MODEL_PATH:$C_MODEL_PATH" \ -e HF_TOKEN="$HF_TOKEN" \ lmsysorg/sglang:dev-cu13 \ bash -lc " huggingface-cli download nvidia/DeepSeek-R1-0528-NVFP4-v2 --cache-dir "$C_MODEL_PATH" exec python3 -m sglang.launch_server \ --model-path "$C_MODEL_PATH/models--nvidia--DeepSeek-R1-0528-NVFP4-v2/snapshots/25a138f28f49022958b9f2d205f9b7de0cdb6e18/" \ --served-model-name dsr1 \ --tp 4 \ --attention-backend trtllm_mla \ --disable-radix-cache \ --moe-runner-backend flashinfer_trtllm \ --quantization modelopt_fp4 \ --kv-cache-dtype fp8_e4m3 " # Check logs: docker logs -f sglang_server # Execute benchmark with: docker exec -it sglang_server bash -lc "python3 -m sglang.bench_one_batch_server --model None --base-url http://localhost:30000 --batch-size 16 --input-len 1024 --output-len 1024"
SGLang benchmark
As a quick test, we included the results of executing the following SGLang benchmark on the deployed instance:
docker exec -it sglang_server bash -lc "python3 -m sglang.bench_one_batch_server --model None --base-url http://localhost:30000 --batch-size 16 --input-len 1024 --output-len 1024"
root@terraform-sglang:~# docker exec -it sglang_server bash -lc "python3 -m sglang.bench_one_batch_server --model None --base-url http://localhost:30000 --batch-size 16 --input-len 1024 --output-len 1024" ======== Warmup Begin ======== Warmup with batch_size=[16] #Input tokens: 16384 #Output tokens: 256 batch size: 16 input_len: 1024 output_len: 16 latency: 0.50 s input throughput: 50249.39 tok/s output throughput: 1477.75 tok/s last_ttft: 0.33 s last generation throughput: 1337.26 tok/s ======== Warmup End ======== #Input tokens: 16384 #Output tokens: 16384 batch size: 16 input_len: 1024 output_len: 1024 latency: 12.55 s input throughput: 50307.77 tok/s output throughput: 1340.45 tok/s last_ttft: 0.33 s last generation throughput: 1337.14 tok/s Results are saved to result.jsonl
This makes the deployment and benchmark reproducible end-to-end, with the required GPU resources readily available with self-service access on Verda.
Future work
As this guide covers a single-node deployment with 4× NVIDIA B300 SXM6 GPUs, we propose multi-component setups where IaC tools excel:
- PD disaggregation on Kubernetes: Use Terraform to provision a k8s cluster and separate GPU node pools for prefill and decode, and then deploy SGLang PD disaggregated mode to orchestrate it.
- Autoscaling policies: Scale decode replicas on queue depth / tokens-per-second, and scale prefill on bursty traffic.
- Benchmark automation: Run reproducible PD benchmarks as k8s jobs and store results and configs for apples-to-apples comparisons.