Serverless GPU containers
for AI workloads

Deploy inference services or run batch compute on scalable GPU infrastructure. From scale-to-zero to multi-GPU priority deployments

  • Auto-scaling
  • Cost-efficient
  • Wide GPU selection
Serverless container product visual

Serverless on Verda

Purpose-built for serverless deployments across the ML model lifecycle.

Verda Containers dashboard listing deployments with their status, image, compute, and replica health.

Unified AI cloud

One platform for serverless deployments, batch jobs, and other use cases across the entire ML model lifecycle.

Scaling settings with replica limits and Instant, Balanced, and Cost saver queue-load policies.

Tunable scaling sensitivity

Control how aggressively replicas scale to queue length. Dial it in for latency-sensitive services or relax it for cost-sensitive ones.

GPU VRAM timeseries chart showing used and free memory with a hover tooltip.

Real-time metrics

Replica count, GPU and CPU utilization, request rates, inference duration, and queue size — streamed to the UI or to your Prometheus or Loki stack.

Code window showing a fetch call to the Verda container-deployments API.

API and SDK

Deploy and manage containers from the UI, CLI, API, or the native SDK. Pick the interface that fits your workflow.

Quickstart

Point Verda at your container image, pick a GPU, and we handle the rest. Works with any registry, any framework.

We've written a step-by-step migration guide that covers image compatibility, endpoint conventions, environment variables, and scaling equivalents. Your container, unchanged — just a different platform underneath.

You get an endpoint, per-replica metrics, real-time logs, and a bill that scales with your traffic. Nothing to configure, nothing to provision.

Deployment types

Choose the service that suits your use case best

Serverless (auto-scaling) Batch
Runs indefinitely While traffic exists No
Scales to zero Yes, when idle After completion
Exposes endpoint Yes Optional
Typical duration Milliseconds to minutes per request Minutes to hours per job
Cold start on request On first request after idle On job dispatch
Best for Interactive inference, user-facing APIs, bursty or unpredictable traffic Long-running compute, offline inference over large datasets, periodic pipelines

Usage-based pricing

Per-replica billing in 10-minute intervals. Interruptible spot pricing at roughly 50% of on-demand. Multi-GPU configurations in 1×, 2×, and 4×

Currency
Contract type
Compute type
B300 SXM6 268GB VRAM

GPUs

1x2x4x8x

Price per GPU

$8.250/h
B200 SXM6 180GB VRAM

GPUs

1x2x4x8x

Price per GPU

$6.721/h
H200 SXM5 141GB VRAM

GPUs

1x2x4x8x

Price per GPU

$4.400/h
H100 SXM5 80GB VRAM

GPUs

1x2x4x8x

Price per GPU

$3.575/h
RTX PRO 6000 96GB VRAM

GPUs

1x2x4x8x

Price per GPU

$2.079/h
L40S 48GB VRAM

GPUs

1x2x4x8x

Price per GPU

$1.507/h
AMD EPYC 32–128GB RAM

CPUs

8x16x32x

Starting price

$0.0614/h

Zero setup

Deploy any container from any registry — Docker Hub, GitHub Container Registry, or your own — in a few clicks. Start, stop, or hibernate instantly from the UI, CLI, or API.

Controlled auto-scaling

Scale to zero when idle or up to hundreds of GPUs during traffic spikes. Adjustable scaling sensitivity and multi-GPU priority deployments give you precise control over how your workload responds to load.

Fast cold starts

Our in-house AI research team tunes container pulls, model loads, and GPU warmups so your first request doesn't wait around. Cold starts are a first-class engineering problem at Verda, not an afterthought.

Wide GPU selection

Run on cutting-edge NVIDIA compute across the Blackwell, Hopper, and Ampere generations — including B300 SXM6, H200, H100, A100, and RTX PRO 6000. Available in 1×, 2×, and 4× configurations.

Pay-per-use

Pay only for the compute you actively use, billed in 10-minute intervals. No idle charges, no commitments, no surprise bills. Spot pricing cuts the bill roughly in half for interruptible workloads.

Full observability

Real-time logs and detailed metrics on utilization, request rates, inference duration, and queue size — in the Verda console or as endpoints for your existing Prometheus, Loki, or Grafana stack.

Already on Runpod?

Most Runpod serverless deployments move to Verda in under an hour.

We've written a step-by-step migration guide that covers image compatibility, endpoint conventions, environment variables, and scaling equivalents. Your container, unchanged — just a different platform underneath.

Read the migration guide

FAQs

How is this different from running my own Kubernetes cluster?

You bring a container image; we run it. There's no cluster to stand up, no node pools to size, no GPU operator to install, and no autoscaler to tune. Verda handles provisioning, scaling to zero, cold starts, and per-replica metrics — you get an endpoint and a bill that tracks your traffic.

How does Verda compare to Runpod or Modal?

Different priorities. Runpod and Modal are US-based; Verda runs on European data centers with renewable energy, which matters for teams with data-residency or sustainability requirements.

Our engineering team pairs directly with customer teams — Runpod optimizes for self-service, Modal for code-first Python workflows.

On hardware, we bring early access to NVIDIA's latest silicon (B300 SXM6, GB300 NVL72) as an NVIDIA Preferred Partner. For interactive inference at production scale, all three can do the job; the fit depends on what you need from the team behind the platform.

Where are your data centers, and is my data sovereign?

Verda runs on European data centers powered by renewable energy. Your data and workloads stay in-region, which matters for teams with data-residency or sustainability requirements.

What's the difference between serverless deployments and batch jobs?

Serverless deployments auto-scale to live traffic, expose an endpoint, and scale to zero when idle — best for interactive inference and user-facing APIs. Batch jobs run to completion on dispatch — best for long-running compute, offline inference over large datasets, and periodic pipelines.

Do I pay when nothing is running?

No. Serverless deployments scale to zero when idle, and you're only billed for the compute you actively use, in 10-minute intervals — no idle charges.

How am I billed?

Per replica, in 10-minute intervals, for the compute you actively use. Spot pricing runs at roughly 50% of on-demand for interruptible workloads. No commitments, no surprise bills.

Can I use any container registry?

Yes — deploy any container from any registry, including Docker Hub, GitHub Container Registry, or your own private registry.

How fast are cold starts?

Our in-house AI research team tunes container pulls, model loads, and GPU warmups to keep first-request latency low. Cold starts are a first-class engineering problem at Verda, not an afterthought.

What's the SLA?

Production deployments are backed by our uptime SLA. Talk to sales for specifics on availability guarantees and enterprise terms.

Built in Europe, trusted globally

One platform, the full AI lifecycle

From rapid prototyping to foundation training and scalable inference — on a single full-stack AI cloud