Tutorial Series

Building a Hybrid LLM Platform on EKS

Across this blog we keep referring to a hybrid LLM platform — frontier models for the hard reasoning, self-hosted open-source models for the high-volume work, all on Kubernetes. This series builds it from an empty AWS account to a working inference service, one layer at a time, as reproducible AWS CDK infrastructure you can deploy and tear down yourself.

Start with Part 1 → 4 of 8 parts published

What We're Building

The Target Architecture

                          ┌─────────────────────────┐
   client requests  ───►  │   ALB (public subnets)  │
                          └────────────┬────────────┘
                                       │
                          ┌────────────▼────────────┐
                          │  hybrid router / gateway │   ← cloud vs. local
                          │     (CPU node pool)      │
                          └──────┬─────────────┬─────┘
                                 │             │
                   frontier API  │             │  local inference
                   (egress via   │             ▼
                    NAT)         │   ┌──────────────────────┐
                                 ▼   │  vLLM model servers   │
                          ┌──────────┤   (GPU node pool)     │
                          │ Claude / │└──────────────────────┘
                          │   GPT    │
                          └──────────┘
        all of it on EKS, in private subnets, observed + autoscaled

EKS AWS CDK vLLM GPU Karpenter OpenTelemetry Claude Llama

Roadmap

The Parts

Each part deploys cleanly on its own. Published parts link to the full walkthrough; the rest are on the way.

Architecture & the Network Foundation

Published

The full platform architecture and the CDK network stack it all lives in — VPC, public/private subnets, NAT, EKS discovery tags, and VPC endpoints sized for a GPU cluster.

AWS CDKVPCTypeScript

The EKS Control Plane

Published

Dropping the cluster into the VPC: the EKS control plane, the OIDC provider, IAM roles, and IRSA — ending with a working kubectl connection.

EKSIAMIRSAOIDC

Node Groups: CPU System Pool & GPU Pool

Published

Managed node groups for the system workloads and a GPU pool for inference — GPU AMIs, the NVIDIA device plugin, and the taints and labels that keep model servers on the right nodes.

GPUNVIDIANode Groups

Platform Add-ons

Published

The cluster services everything else depends on: the AWS Load Balancer Controller, ingress, and Karpenter for fast, cost-aware autoscaling of GPU capacity.

KarpenterALB ControllerIngress

Serving Local Models with vLLM

Coming soon

Deploying the self-hosted inference layer — vLLM model servers, loading weights, and request-based autoscaling so GPU capacity follows demand.

vLLMKEDAHelm

The Hybrid Router

Coming soon

The gateway that makes it hybrid: routing each request to a frontier model for hard reasoning or to a local model for high-volume execution work.

ClaudeGPTRouting

Observability & Cost Telemetry

Coming soon

Wiring observability into the platform — OpenTelemetry traces through the router, Prometheus and Grafana for GPU and vLLM metrics, and Langfuse for per-request token and cost telemetry.

OpenTelemetryPrometheusGrafanaLangfuse

Testing, Load & Examples

Coming soon

Validating the platform end-to-end — load testing the inference layer, sample workloads, and proving the routing economics under real traffic.

Load TestingExamples

Prefer the high-level version? The companion Hybrid AI Playbook and Self-Hosting LLMs on Kubernetes cover the why behind this build.

Want This Built for Your Team?

We build hybrid LLM platforms like this one for clients — reproducible, cost-aware, and documented so your team can own it. Book a free call and we'll map the fastest path.

Book a Free Call