Skip to main content
Photo from unsplash: llm-d-doks-deploy

Deploy llm-d for Distributed LLM Inference on DigitalOcean Kubernetes

Written on July 15, 2025 by Jeff Fan.

13 min read
––– views
Read in Chinese

Deploy llm-d for Distributed LLM Inference on DigitalOcean Kubernetes

This tutorial will guide you through deploying llm-d on DigitalOcean Kubernetes using our automated deployment scripts. Whether you're a DevOps engineer, ML engineer, or platform architect, this guide will help you establish a distributed LLM inference service on Kubernetes.

⏱️ Estimated Deployment Time: 15-20 minutes

📋 Tutorial Scope: This tutorial focuses on basic llm-d deployment on DigitalOcean Kubernetes with automated scripts.


Overview

llm-d is a distributed LLM inference framework designed for Kubernetes environments, featuring disaggregated serving architecture and intelligent resource management. On DigitalOcean Kubernetes, you can deploy llm-d to achieve:

  1. Disaggregated LLM Inference
    Separate prefill (context processing) and decode (token generation) stages across different GPU nodes.

  2. GPU Resource Management
    Automatic GPU resource allocation with support for NVIDIA RTX 4000 Ada, RTX 6000 Ada, and L40S cards.

  3. Kubernetes-Native Architecture
    Cloud-native design with proper service discovery and resource management.


What is llm-d?

llm-d represents a next-generation distributed LLM inference platform, specifically designed for Kubernetes environments. Unlike traditional single-node solutions, llm-d brings distributed computing capabilities to LLM inference. llm-d architecture

Understanding Disaggregated LLM Inference

Think of the difference between fast fashion retail and bespoke tailoring - this perfectly captures the fundamental differences between traditional web applications and LLM inference:

Traditional Web Applications vs. LLM Inference:

Comparison AspectTraditional Web Apps (Fast Fashion)LLM Inference (Bespoke Tailoring Workshop)
Service ProcessStore displays S·M·L standard sizes, customers grab and checkoutMeasurement → Pattern Making → Fitting → Alterations → Delivery
Request LifespanMilliseconds to seconds (instant checkout)Seconds to minutes (stitch by stitch execution)
Resource RequirementsSimilar fabric and manufacturing time per itemVastly different fabric usage and handcraft time per suit
StatefulnessStaff don't remember your previous purchasesTailor remembers your measurements and preferences
CostLow unit price, mass productionHigh unit price, precision handcraft

fast-fashion-vs-beskpoke Traditional LLM Serving = "One-Person-Does-Everything Tailor"

Problems with this approach:

  • Resource Imbalance: Some customers need simple hem adjustments, others want full custom suits - workload varies dramatically
  • Fabric Waste: Each customer monopolizes a pile of fabric, no sharing of leftover pieces
  • Queue Blocking: Complex orders in front block quick alterations behind

llm-d's Disaggregated Approach = "Modern Bespoke Tailoring Production Line"

StationProcess AnalogySpecialized Optimization
Prefill StationMeasurement + Pattern Making RoomHigh parallel computation, CPU/GPU collaboration
Decode StationSewing RoomSequential output focus, maximum memory bandwidth
Smart GatewayMaster Tailor ManagerDynamic order assignment based on KV Cache and load

Benefits Achieved:

  1. Fabric (KV Cache) Sharing: Similar pattern orders concentrated for high hit rates
  2. Request Shape Optimization: Hem alterations express lane, formal wear slow lane - each takes its own path
  3. Independent Scaling: Add more pattern makers during measurement season, more sewers during delivery season
  4. GPU Memory Efficiency: Measurement phase needs compute-heavy/memory-light; sewing phase needs the opposite - separation allows each to take what it needs

One-Line Summary: Fast fashion emphasizes "grab and go"; bespoke tailoring pursues "measured perfection." llm-d separates measurement from sewing, with intelligent master tailor coordination, making AI inference both personalized and efficient.


Tutorial Steps

Step 1: Clone the Repository and Setup Environment

First, let's get the llm-d deployer repository and set up our environment:

# Clone the llm-d deployer repository git clone https://github.com/iambigmomma/llm-d-deployer.git cd llm-d-deployer/quickstart/infra/doks-digitalocean
bash

Prerequisites

  • DigitalOcean account with GPU quota enabled
  • doctl CLI installed and authenticated
  • kubectl installed
  • helm installed

Set Required Environment Variables

# Set your HuggingFace token (required for model downloads) export HF_TOKEN=hf_your_token_here # Verify doctl is authenticated doctl auth list
bash

🔐 Important: Model Access Requirements

For Meta Llama Models (Llama-3.2-3B-Instruct):

The meta-llama/Llama-3.2-3B-Instruct model used in this tutorial requires special access:

  1. HuggingFace Account Required: You must have a HuggingFace account
  2. Model Access Request: Visit Llama-3.2-3B-Instruct on HuggingFace
  3. Accept License Agreement: Click "Agree and access repository" and complete the license agreement
  4. Wait for Approval: Access approval is usually granted within a few hours
  5. Generate Access Token: Create a HuggingFace access token with "Read" permissions from your Settings > Access Tokens HF-meta-llama-repo

Alternative Open Models (No License Required):

If you prefer to avoid the approval process, consider these open alternatives:

  • google/gemma-2b-it - Google's open instruction-tuned model
  • Qwen/Qwen2.5-3B-Instruct - Alibaba's multilingual model
  • microsoft/Phi-3-mini-4k-instruct - Microsoft's efficient small model

To use alternative models, you'll need to modify the deployment configuration files accordingly.

Step 2: Create DOKS Cluster with GPU Nodes

Our automated script will create a complete DOKS cluster with both CPU and GPU nodes:

# Run the automated cluster setup script ./setup-gpu-cluster.sh -c
bash

The script will:

  1. Create a new DOKS cluster with CPU nodes
  2. Add a GPU node pool with your chosen GPU type
  3. Install NVIDIA Device Plugin for GPU support
  4. Configure proper node labeling and GPU resource management

Choose Your GPU Type

When prompted, select your preferred GPU type:

  • RTX 4000 Ada: Cost-effective for smaller models (7B-13B parameters)
  • RTX 6000 Ada: Balanced performance for medium models (13B-34B parameters)
  • L40S: Maximum performance for large models (70B+ parameters)

setup-gpu-doks

Verify Cluster Setup

# Check cluster status kubectl get nodes # Verify GPU nodes are ready kubectl get nodes -l doks.digitalocean.com/gpu-brand=nvidia # Check GPU resources are available kubectl describe nodes -l doks.digitalocean.com/gpu-brand=nvidia | grep nvidia.com/gpu
bash

You should see output similar to:

NAME STATUS ROLES AGE VERSION pool-gpu-xxxxx Ready <none> 3m v1.31.1 pool-gpu-yyyyy Ready <none> 3m v1.31.1
bash

🔄 If the Setup Script Stops Unexpectedly

This is completely normal! DigitalOcean API calls may occasionally timeout during node provisioning. If you see the script stop after creating the GPU node pool:

  1. Wait 30 seconds for the API operations to complete
  2. Re-run the same command:
    ./setup-gpu-cluster.sh
    bash
  3. The script will automatically detect existing components and continue from where it left off
  4. No duplicate resources will be created - the script is designed to be safely re-run

The script has intelligent state detection and will skip already completed steps, making it completely safe to re-run multiple times.

Step 3: Deploy llm-d Infrastructure

Now let's deploy llm-d using our automated deployment scripts. This is a two-step process for better reliability and troubleshooting:

Step 3A: Deploy llm-d Core Components

First, let's deploy the core llm-d inference services:

# Deploy llm-d with your chosen GPU configuration ./deploy-llm-d.sh -g rtx-6000-ada -t your_hf_token
bash

What Gets Deployed:

  • Prefill Service: Handles context processing on GPU pods
  • Decode Service: Manages token generation with GPU optimization
  • Gateway Service: Routes requests and manages load balancing
  • Redis Service: Provides KV cache storage

Step 3B: Setup Monitoring (Optional)

After llm-d is running, optionally setup comprehensive monitoring:

# Navigate to monitoring directory cd monitoring # Setup Prometheus, Grafana, and llm-d dashboards ./setup-monitoring.sh
bash

Monitoring Components:

  • Prometheus: Metrics collection and storage
  • Grafana: Visualization dashboards and alerts
  • llm-d Dashboard: Custom inference performance dashboard
  • ServiceMonitor: Automatic llm-d metrics discovery

Monitor Deployment Progress

# Watch llm-d deployment progress kubectl get pods -n llm-d -w # Check all components are running kubectl get all -n llm-d
bash

Wait until all pods show Running status:

NAME READY STATUS RESTARTS AGE meta-llama-llama-3-2-3b-instruct-decode-xxx 1/1 Running 0 3m meta-llama-llama-3-2-3b-instruct-prefill-xxx 1/1 Running 0 3m llm-d-inference-gateway-xxx 1/1 Running 0 3m redis-xxx 1/1 Running 0 3m
bash

Monitor Setup Progress (If Step 3B was completed)

# Check monitoring stack status kubectl get pods -n llm-d-monitoring # Access Grafana dashboard kubectl port-forward -n llm-d-monitoring svc/prometheus-grafana 3000:80
bash

Step 4: Test Your llm-d Deployment

Now let's test that everything is working correctly using our test script:

# Navigate to the test directory cd /path/to/llm-d-deployer/quickstart # Run the automated test ./test-request.sh
bash

test-request

Manual Testing (Alternative)

If you prefer to test manually:

# Port-forward to the gateway service kubectl port-forward -n llm-d svc/llm-d-inference-gateway-istio 8080:80 & # Test the API with a simple request curl localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-3.2-3B-Instruct", "messages": [ {"role": "user", "content": "Explain Kubernetes in simple terms"} ], "max_tokens": 150, "stream": false }' | jq
bash

Expected Response

You should see a successful JSON response like:

{ "choices": [ { "finish_reason": "length", "index": 0, "logprobs": null, "message": { "content": "Kubernetes (also known as K8s) is an open-source container orchestration system for automating the deployment, scaling, and management of containerized applications...", "reasoning_content": null, "role": "assistant", "tool_calls": [] }, "stop_reason": null } ], "created": 1752523066, "id": "chatcmpl-76c2a86b-5460-4752-9f20-03c67ca5b0ba", "kv_transfer_params": null, "model": "meta-llama/Llama-3.2-3B-Instruct", "object": "chat.completion", "prompt_logprobs": null, "usage": { "completion_tokens": 150, "prompt_tokens": 41, "prompt_tokens_details": null, "total_tokens": 191 } }
json

Step 5: Access Monitoring and Dashboard

If you completed Step 3B (monitoring setup), you can access the comprehensive monitoring dashboard:

# Port-forward to Grafana kubectl port-forward -n llm-d-monitoring svc/prometheus-grafana 3000:80 # Get admin password kubectl get secret prometheus-grafana -n llm-d-monitoring -o jsonpath="{.data.admin-password}" | base64 -d
bash

Grafana Access: http://localhost:3000
Username: admin
Password: (from command above) monitoring-dashboard

llm-d Dashboard and Key Metrics

After monitoring setup, you'll find:

  • Dashboard Location: Look for "llm-d" folder in Grafana
  • Dashboard Name: "llm-d Inference Gateway"

The dashboard may take 1-2 minutes to appear as it's loaded by Grafana's sidecar.

📊 Important Metrics to Monitor

Request Performance Metrics:

  • Time to First Token (TTFT): Critical for user experience - measures how quickly the first response token is generated
  • Inter-Token Latency (ITL): Speed of subsequent token generation - affects perceived responsiveness
  • Requests per Second (RPS): Overall system throughput
  • Request Duration: End-to-end request completion time

Resource Utilization Metrics:

  • GPU Memory Usage: Monitor GPU memory consumption across prefill and decode pods
  • GPU Utilization: Actual compute usage percentage of GPUs
  • KV Cache Hit Rate: Percentage of requests benefiting from cached computations
  • Queue Depth: Number of pending requests waiting for processing

llm-d Specific Metrics:

  • Prefill vs Decode Load Distribution: Balance between processing phases
  • Cache-Aware Routing Effectiveness: Success rate of intelligent request routing
  • Model Loading Time: Time to load models into GPU memory
  • Token Generation Rate: Tokens produced per second per GPU

Kubernetes Metrics:

  • Pod Autoscaling Events: HPA scaling decisions and timing
  • Node Resource Pressure: CPU, memory, and GPU pressure on nodes
  • Network Throughput: Inter-pod communication for disaggregated serving

Performance Optimization Indicators:

  • Batch Size Utilization: How well requests are batched for efficiency
  • Context Length Distribution: Understanding of typical request patterns
  • Failed Request Rate: Error rates and their causes

These metrics help you:

  • Optimize Performance: Identify bottlenecks in prefill vs decode stages
  • Right-Size Resources: Balance cost and performance based on actual usage
  • Troubleshoot Issues: Quickly identify problems with specific components
  • Plan Capacity: Predict future resource needs based on traffic patterns

Common Issues and Solutions

Setup Script Stops After GPU Node Pool Creation

Symptoms: Script terminates after "GPU node pool created successfully" Cause: DigitalOcean API response delays during node provisioning (this is normal!) Solution:

# Wait 30 seconds, then re-run the script ./setup-gpu-cluster.sh # The script will automatically continue from where it left off # No duplicate resources will be created
bash

GPU Pod Scheduling Issues

Symptoms: Pods stuck in Pending state Solution: Check GPU node availability and resource requests

kubectl describe pods -n llm-d | grep -A 5 "Events:"
bash

Model Download Failures

Symptoms: Pods showing download errors Solution: Verify HF_TOKEN is set correctly

kubectl logs -n llm-d -l app=decode
bash

Service Connectivity Issues

Symptoms: API requests failing Solution: Check all pods are running and services are available

kubectl get pods -n llm-d kubectl get svc -n llm-d
bash

Dashboard Not Appearing in Grafana

Symptoms: llm-d dashboard not visible in Grafana after running monitoring setup Solution: Check dashboard ConfigMap and Grafana sidecar

# Check if dashboard ConfigMap exists kubectl get configmap llm-d-dashboard -n llm-d-monitoring # Check ConfigMap labels kubectl get configmap llm-d-dashboard -n llm-d-monitoring -o yaml | grep grafana_dashboard # If missing, re-run monitoring setup cd monitoring && ./setup-monitoring.sh
bash

Next Steps

Congratulations! You now have a working llm-d deployment on DigitalOcean Kubernetes. Your deployment includes:

DOKS Cluster: With CPU and GPU nodes properly configured
llm-d Services: Prefill, decode, gateway, and Redis running
GPU Support: NVIDIA Device Plugin configured for GPU scheduling
Working API: Tested and confirmed LLM inference capability

What You Can Do Next

  • Scale Your Deployment: Add more GPU nodes or increase pod replicas
  • Deploy Different Models: Use different model configurations
  • Monitor Performance: Use Grafana dashboards to track metrics
  • Integrate with Applications: Use the OpenAI-compatible API in your applications

Cleanup (Optional)

When you're done experimenting, you have two cleanup options:

Option 1: Remove Only llm-d Components (Keep Cluster)

If you want to keep your DOKS cluster but remove llm-d components:

# Navigate back to the deployment directory cd /path/to/llm-d-deployer/quickstart/infra/doks-digitalocean # Remove llm-d components using the uninstall flag ./deploy-llm-d.sh -u # Optionally remove monitoring (if installed) # kubectl delete namespace llm-d-monitoring
bash

This will:

  • Remove all llm-d pods and services
  • Delete the llm-d namespace
  • Keep monitoring components (if installed separately)
  • Keep your DOKS cluster and GPU nodes intact for future use

Option 2: Delete Entire Cluster

If you want to remove everything including the cluster:

# Delete the cluster (this will remove all resources) doctl kubernetes cluster delete llm-d-cluster
bash

💡 Tip: Use Option 1 if you plan to experiment with different llm-d configurations or other Kubernetes workloads on the same cluster. Use Option 2 for complete cleanup when you're finished with all experiments.


Resources

Happy deploying with llm-d on Kubernetes! 🚀

Tweet this article

Enjoying this post?

Don't miss out 😉. Get an email whenever I post, no spam.

I write 1-2 high quality posts about front-end development each month!

Join - other subscribers