← back

VoxPlatform

2026 · Go · K8s . LLMs . Observability . GCP . AI

Go Python AI K8s LLMs

Voxplatform

VoxPlatform is a self-hosted inference platform for voice AI models, running on GKE and managed through Kubernetes custom resources. Instead of writting Deployments and Service by hand for each model, you write this:

apiVersion: vox.io/v1alpha1
kind: VoiceModel
metadata:
  name: whisper-small
spec:
  model: Systran/faster-whisper-small.en
  replicas: 1
  device: cpu

The operator watches for these resources, creates the Deployment, Service, and monitoring config automatically, and reports status back through the CRD. Delete the VoiceModel, everythign cleans up. Change the model, rolling update.

The full stack includes a Go API gateway with Prometheus metrics and structured logging, a Python client SDK, and an evaluation harness that computes Word Error Rate against ground truth datasets.

Architecture

The request flow is straight forward: Client (Python SDK/CLI) -> Gateway (Go, :8080) -> Whisper Model Server (:8000) -> Transcription Response

The gateway adds the concerns the model server dosen’t care about: request validation, request ID propagation for distributed tracing, Prometheus metrics (latency histograms, request counters, in-flight gauges), and structured JSON logging that Cloud Logging can parse.

The infrastructure underneath is Terraform-provisioned GCP: a custom VPC with VPC-native networking for pod IPs, GKE Standard on the REGULAR release channel with Workload Identity, Artifact Registry for container images, and Cloud NAT for private node outbound access. Everything is modular — each Terraform module can be applied independently.

The Operator: From Deployment to Platform

The CRD design went through a few iterations. The first version had too many fields — I was trying to expose every possible knob. The final version has a small spec surface with sensible defaults:

type VoiceModelSpec struct {
    Model        string // Required: which model to server
    Replicas     *int32 // Default: 1
    Device       string // Default: "cpu" or "gpu"
    Quantization string // Default: "int8"
    Resources    *corev1.ResourceRequirements // Default: based on device
}

The operator uses owner references for garbage collection — when a VoiceModel is deleted, Kubernetes automatically cleans up the owned Deployment and Service. The reconciler is level-triggered: it doesn’t care whether it was invoked because of a create, update, or periodic resync. It just compares desired state against actual state and fixes any drift.

The status tracks lifecycle phases (Pending -> Deploying -> Ready -> Failed) with human-readable messages. When something goes wrong - insufficient CPU, image pull failure, crash loop - the status tells you why without needing to dig through pod events.

The Gateway: Why Not Just a Reverse Proxy

I considered using a raw httputil.ReverseProxy and decided against it. The gateway needs to: 1. Validate requests before forwarding (file size limits, format checks) 2. Enrich responses with metadata (processing time, request ID) 3. Return structured errors when the backend fails (not raw 502 HTML) 4. Export Prometheus metrics that are meaningful for inference workloads

The middleware chain is: request ID injection -> structured logging -> Prometheus metrics -> panic recovery -> handler. Every log line includes the request ID, so when a user reports “my request failed,” I can grep one string and find the complete request lifecycle.

I used Go 1.22’s improved ServeMux for routing — it supports method-based routing natively now, so there's no need for chi or gorilla. Zero external dependencies for the HTTP layer.

Evaluation: Catching Regressions Automatically

The eval harness is the piece that turns “I deployed a model” into “I have a reliable system.” It loads a dataset (CSV manifest mapping audio files to ground truth transcripts), sends each sample through the SDK, computes WER using jiwer, and fails if accuracy regressed beyond a threshold.

vox-eval run eval/datasets/librispeech-mini --threshold 0.25

Exit code 0 means the model passed. Exit code 1 means WER exceeded the threshold — block the deployment. The JSON report includes every sample’s ground truth and prediction so you can debug exactly which utterances went wrong.

WER computation includes standard normalization — case folding, punctuation removal, contraction expansion — so “don’t” matches “do not” and “Hello,” matches “hello.” Without this, you get false failures that erode trust in the eval pipeline.

The Python SDK: Bridging Go and Python

The platform boundary is clean: Go services talk to Python services over HTTP. They share API contracts, never code. The Python SDK wraps the gateway’s HTTP API with typed responses via Pydantic:

from voxplatform import VoxClient

client = VoxClient("http://localhost:8080")
result = client.transcribe("meeting.wav")
print(result.text)
print(f"Took {result.processing_time:.1f}s")

Errors carry the request ID through, so debugging is possible across the language boundary:

try:
    result = client.transcribe("audio.wav")
except VoxError as e:
    print(f"Request ID: {e.request_id}")  # Grep this in gateway logs

The SDK has both sync and async APIs — the eval harness uses async for concurrent test execution.

What’s Next

The operator is designed but not yet running on the cluster. The next steps are deploying the full stack end-to-end (monitoring + gateway + operator), building the first Grafana dashboard, and adding InferencePipeline — a second CRD that chains multiple models together (audio -> STT -> diarization -> summarization).

The repo is structured as a monorepo: Go operator and gateway, Python SDK and eval harness, Terraform infrastructure, and Helm charts all in one place. The language boundary is explicit — Go for platform services, Python for ML tooling and evaluation.