is now Verda Learn more

Deploy DeepSeek-R1 with SGLang using Terraform (NVFP4 inference)

Rodrigo Garcia 8 min read
Deploy DeepSeek-R1 with SGLang using Terraform (NVFP4 inference)

This guide shows how to deploy DeepSeek-R1 using SGLang and Terraform, taking an infrastructure-as-code approach to NVFP4 inference on cloud-provisioned GPU systems.

We focus on a single-node GPU inference with 4× NVIDIA B300 SXM6 262GB, local NVMe storage with Docker caching, and a reproducible benchmark run.

As a result, you will have a single Terraform config for provisioning infrastructure and a startup script for installing the runtime, downloading the model, and launching SGLang.

Why SGLang for DeepSeek-R1 inference

SGLang is the officially preferred serving framework for DeepSeek models starting from V3 and R1. As of today, the SGLang project extended beyond model inference toward Reinforcement Learning (RL) post-training techniques, integrating with Miles, Slime, and veRL.

Through 2025, the SGLang team focused on leveraging the optimizations performed and presented by the DeepSeek team in the DeepSeek-V3 Technical Report. We published an in-depth overview of these optimization techniques, with Multi-Head Latent Attention getting a dedicated entry.

Why Terraform for DeepSeek-R1 inference with SGLang

One of the optimization techniques leveraged by SGLang for DeepSeek inference is Prefill-Decode disaggregation (PD disaggregation), which is one of the most complex optimizations from the perspective of resource orchestration. Therefore, Infrastructure-as-code (IaC) tools like Terraform can reduce said complexity.

Terraform is an open-source IaC tool developed by HashiCorp that enables defining, provisioning, and managing cloud and on-premises infrastructure using its declarative language (HCL) in configuration files. It allows for versioning, reuse, and automation for safe, efficient infrastructure deployment across cloud platforms.

This enables systematic deployment of resources that scale flexibly, while ensuring resources are trackable and identified throughout the whole system.

The core Terraform concepts include:

For most use cases, we will be running the combinations of these commands:

terraform init: sets up the working directory, downloads providers/modules.
terraform plan: dry run/safe mode which tells what would change based on current state.
terraform apply: performs the proposed changes.
terraform destroy: deletes resources.

By default, Terraform saves the current state of infrastructure and resources on a local file called terraform.tfstate. We can consult the list of resources deployed using terraform state list.

Terraform provider installation for Verda

The Verda provider is available on the Terraform Registry, making the installation process very straightforward.

Our Terraform integration is fully compatible with OpenTofu. You can also find our provider on the OpenTofu Registry. However, we do not cover this installation method in this guide.

  1. Download and install Terraform:
# Terraform installation
wget -O - https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(grep -oP '(?<=UBUNTU_CODENAME=).*' /etc/os-release || lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt update && sudo apt install terraform
  1. Once installed, we just need to include the following onto our Terraform configs:

terraform {
  required_providers {
    verda = {
      source = "verda-cloud/verda"
      version = "1.0.0"
    }
  }
}

provider "verda" {
client_id     = var.verda_client_id
client_secret = var.verda_client_secret
}

By default, our provider will recognize VERDA_CLIENT_ID and VERDA_CLIENT_SECRET environment vars. If they’re not defined, they can be specified on the required_providers block as configuration, as shown above.

Terraform configuration for GPU instances and storage

Once the setup is complete, we will create Terraform configurations for managing resources. Terraform configurations are .tf files that describe resources for the infrastructure required for this deployment, specifying their properties and dependencies between them.

The following configuration describes several resources:

# NOTE: Path to public key required on public_key

terraform {
  required_providers {
    verda = {
      source = "verda-cloud/verda"
      version = "1.0.0"
    }
  }
}

provider "verda" {}

resource "verda_ssh_key" "tf_ssh" {
  name       = "tf-ssh"
  public_key = file("<path_to_key.pub>")
}

resource "verda_startup_script" "init_vm" {
  name   = "init-vm"
  script = file("vm_scripts/vm_init.sh")
}

resource "verda_volume" "tf_volume" {
  name     = "terraform-volume"
  size     = 2000  # Size in GB
  type     = "NVMe"
  location = "FIN-03"
}

resource "verda_instance" "terraform-sglang" {
  instance_type = "4B200.120V"
  image         = "ubuntu-24.04-cuda-13.0-open-docker"
  hostname      = "terraform-sglang"
  description   = "Example instance"
  location      = "FIN-03"
  os_volume = {
    name = "terraform-os"
    size = 200
    type = "NVMe"
  }

  ssh_key_ids = [verda_ssh_key.tf_ssh.id]
  startup_script_id = verda_startup_script.init_vm.id
  existing_volumes = [verda_volume.tf_volume.id]
}

# Outputs
output "instance_ip" {
  value = verda_instance.terraform-sglang.ip
}

output "instance_status" {
  value = verda_instance.terraform-sglang.status
}

In order to define the properties, we need to consult the Verda Cloud API. For this, we need the Cloud API Credentials, which we can generate in the console. See the docs.

Next, we need an access token for requesting the API, which can be obtained following the previous steps or by consulting the request documentation.

As previously mentioned, the Verda provider supports the following methods for providing credentials:

We provide the following snippet for requesting relevant information from the Verda API:

#! /bin/bash

BASE_URL="${BASE_URL:-https://api.datacrunch.io/v1}"
VERDA_CLIENT_ID=$VERDA_CLIENT_ID
VERDA_CLIENT_SECRET=$VERDA_CLIENT_SECRET

RESP=$(
 curl -sS --request POST "$BASE_URL/oauth2/token" \
   --header "Content-Type: application/json" \
   --data "{\"grant_type\":\"client_credentials\",\"client_id\":\"$VERDA_CLIENT_ID\",\"client_secret\":\"$VERDA_CLIENT_SECRET\"}"
)

#echo "Raw response:"
echo "$RESP"

if command -v jq >/dev/null 2>&1; then
 TOKEN="$(echo "$RESP" | jq -r '.access_token')"
 echo
 echo "access_token:"
 echo "$TOKEN"
fi

echo
RESP=$(
 curl -sS --request GET "$BASE_URL/instance-types" \
     --header "Authorization: Bearer $TOKEN" \
   )

echo "instances types:"
echo "$RESP" | jq -r '.[].instance_type'

echo
RESP=$(
 curl -sS --request GET "$BASE_URL/images" \
     --header "Authorization: Bearer $TOKEN" \
   )

echo "images:"
echo "$RESP" | jq -r '.[].image_type'

echo

Once we have specified the resources for the deployment, we can test how it will impact our current Terraform state by executing terraform plan. We can apply the subsequent changes with terraform apply.

Tip: We can consult the IP and the status of the instance created by executing terraform plan and observing the outputs for instance_ip and instance_status.

DeepSeek-R1 deployment script

In order to deploy DeepSeek-R1 NVFP4-quantized, we include the following script, which will be executed once the instance is deployed. As we can see above, this script is included as a resource: resource "verda_startup_script" "init_vm" {...}.

Next, the Hugging Face token is required to download the model from their hub:

#!/bin/bash

export HOST_MODEL_PATH="/mnt/local_nvme/models"
export C_MODEL_PATH="/root/models"
export HF_TOKEN=$HF_TOKEN

# Create a local NVMe volume
mkdir /mnt/local_nvme
sudo parted -s /dev/vdb mklabel gpt
sudo parted -s /dev/vdb mkpart primary ext4 0% 100%
sudo partprobe /dev/vdb
sudo mkfs.ext4 -F /dev/vdb1
sudo mount /dev/vdb1 /mnt/local_nvme

mkdir -p $HOST_MODEL_PATH

# Docker setup (store artifacts in local NVMe)
sudo mkdir -p /mnt/local_nvme/docker
sudo tee /etc/docker/daemon.json >/dev/null <<'EOF'
{
   "data-root": "/mnt/local_nvme/docker",
   "runtimes": {
       "nvidia": {
           "args": [],
           "path": "nvidia-container-runtime"
       }
   }
}
EOF

# Restart docker to apply the changes
sudo systemctl restart docker

# sglang setup
# check image version in https://hub.docker.com/r/lmsysorg/sglang/tags
docker run --gpus all --shm-size 32g --network=host --name sglang_server -d --ipc=host \
 -v "$HOST_MODEL_PATH:$C_MODEL_PATH" \
 -e HF_TOKEN="$HF_TOKEN" \
 lmsysorg/sglang:dev-cu13 \
 bash -lc "
   huggingface-cli download nvidia/DeepSeek-R1-0528-NVFP4-v2 --cache-dir "$C_MODEL_PATH"
   exec python3 -m sglang.launch_server \
     --model-path "$C_MODEL_PATH/models--nvidia--DeepSeek-R1-0528-NVFP4-v2/snapshots/25a138f28f49022958b9f2d205f9b7de0cdb6e18/" \
     --served-model-name dsr1 \
     --tp 4 \
     --attention-backend trtllm_mla \
     --disable-radix-cache \
     --moe-runner-backend flashinfer_trtllm \
     --quantization modelopt_fp4 \
     --kv-cache-dtype fp8_e4m3
 "

# Check logs: docker logs -f sglang_server
# Execute benchmark with: docker exec -it sglang_server bash -lc "python3 -m sglang.bench_one_batch_server --model None --base-url http://localhost:30000 --batch-size 16 --input-len 1024 --output-len 1024"

SGLang benchmark

As a quick test, we included the results of executing the following SGLang benchmark on the deployed instance:

docker exec -it sglang_server bash -lc "python3 -m sglang.bench_one_batch_server --model None --base-url http://localhost:30000 --batch-size 16 --input-len 1024 --output-len 1024"
root@terraform-sglang:~# docker exec -it sglang_server bash -lc "python3 -m sglang.bench_one_batch_server --model None --base-url http://localhost:30000 --batch-size 16 --input-len 1024 --output-len 1024"
======== Warmup Begin ========
Warmup with batch_size=[16]
#Input tokens: 16384
#Output tokens: 256
batch size: 16
input_len: 1024
output_len: 16
latency: 0.50 s
input throughput: 50249.39 tok/s
output throughput: 1477.75 tok/s
last_ttft: 0.33 s
last generation throughput: 1337.26 tok/s
======== Warmup End   ========

#Input tokens: 16384
#Output tokens: 16384
batch size: 16
input_len: 1024
output_len: 1024
latency: 12.55 s
input throughput: 50307.77 tok/s
output throughput: 1340.45 tok/s
last_ttft: 0.33 s
last generation throughput: 1337.14 tok/s

Results are saved to result.jsonl

This makes the deployment and benchmark reproducible end-to-end, with the required GPU resources readily available with self-service access on Verda.

Future work

As this guide covers a single-node deployment with 4× NVIDIA B300 SXM6 GPUs, we propose multi-component setups where IaC tools excel:

References

Subscribe to our newsletter

Get the latest updates on GPU benchmarks and AI research
Your information will be used in accordance with the Privacy Policy. You may opt out at any time.