A tool to analyze the load of the linux instance and recommends the correct sizing on the basis of statistical analysis of the collected metrics.
Problem
Cloud infrastructure costs are one of the biggest operational expenses for any engineering team, yet most instances are sized based on gut feeling or worst-case estimates rather than actual observed behavior. The typical workflow is: pick an instance size that feels safe, deploy it, and never revisit it — leading to fleets that are chronically over-provisioned by 40–60%. The tooling gap here is specific: Prometheus and node_exporter give you raw metrics, cAdvisor gives you container-level visibility, but nothing takes that data and tells you what size your machine should actually be. You're left exporting dashboards, eyeballing percentiles, and making manual judgment calls across dozens of instances. Instvisor was built to close that gap — a lightweight agent that observes real workload patterns over time and produces concrete, cloud-provider-aware instance sizing recommendations without any external dependencies.
Architecture
Instvisor follows a two-binary design inspired by cAdvisor's philosophy of running close to the kernel with minimal overhead.
-
The agent
(instvisor-agent) runs as a long-lived systemd service or Docker container. It reads directly from Linux kernel
interfaces — /proc/stat, /proc/meminfo, /proc/diskstats, /proc/net/dev — collecting CPU, memory, disk I/O, and network metrics
on a configurable interval (default 30s). Samples are written to a local SQLite database with automatic retention management, keeping the
storage footprint small and the deployment dependency-free.
The analyzer (instvisor-analyze) is a CLI tool that reads from the same SQLite store and runs statistical analysis across the collected window.
It computes P50/P90/P95/P99 percentiles per resource, applies workload pattern detection (steady-state vs. bursty vs. scheduled batch), adds
configurable headroom, and maps the result to real instance types from AWS, OTC, and Azure flavor catalogs.
┌────────────────────────────┐
│ Linux Host │
│ │
│ /proc, /sys ──► agent │
│ │ │
│ SQLite │
│ │ │
│ instvisor- │
│ analyze │
│ │ │
└─────────────────────┼──────┘
▼
Instance Sizing Report
(stdout / JSON / future: Prometheus)
The agent is designed to run at < 50 MB RAM and < 1% CPU so it doesn't influence the workload it's measuring.
Design Decisions
- SQLite over an embedded time-series DB. I considered using BoltDB or an in-process Prometheus-style TSDB, but SQLite gave me richer querying (percentile calculations in SQL), zero extra dependencies, and battle-tested durability. For the data volumes instvisor generates (one row per metric per interval), SQLite handles years of data without any performance concerns.
- Two binaries instead of one. Separating the always-on agent from the on-demand analyzer keeps the agent minimal and auditable. Users who want to run it in read-only environments can deploy just the analyzer pointed at an existing data file. It also makes the mental model clearer — collection and analysis are distinct responsibilities.
- Reading from /proc directly instead of scraping node_exporter. The original design queried a local Prometheus/node_exporter endpoint, but that creates an external dependency that breaks the "run anywhere" goal. Talking directly to kernel interfaces keeps instvisor fully self-contained and gives lower-latency, higher-fidelity samples.
- Workload pattern classification before recommending. A flat P95 recommendation is misleading for bursty workloads. Instvisor first classifies the pattern (steady-state, bursty, scheduled) and adjusts the headroom formula accordingly — bursty workloads get sized closer to their P99, while steady-state workloads can be sized tighter to P90.
- Cloud flavor catalogs embedded in the binary. Rather than making API calls to cloud providers at analysis time, the instance type catalogs (AWS, OTC, Azure) are compiled into the binary. This keeps recommendations reproducible and offline-capable, at the cost of needing periodic catalog updates in releases.
What I Learned
- Working close to the Linux kernel taught me how much nuance lives in /proc. CPU usage isn't just user+system — you have to account for iowait, steal time, and softirq separately to avoid misleading recommendations for I/O-bound vs. CPU-bound workloads. Memory is trickier still: MemAvailable is the right field to use (not MemFree), because the kernel reclaims page cache under pressure and MemFree paints a falsely pessimistic picture.
- Building the workload pattern detector showed me that statistical classification needs confidence thresholds. Labeling a workload as "bursty" when you only have 6 hours of data leads to over-provisioned recommendations. The system now requires a minimum observation window before high-confidence classifications are emitted, and displays the confidence score alongside every recommendation.
- On the Go side, I learned that SQLite in Go (via CGo) carries complexity around cross-compilation and build reproducibility. Switching to a pure-Go SQLite driver resolved cross-compilation issues for the release pipeline without sacrificing meaningful performance for this use case.
- Designing for open-source contribution from the start — structured project layout, clear interfaces between collector/storage/analysis packages, and a contributing guide — made me think much more carefully about where the package boundaries should be and how to write code that someone else can extend without understanding the whole system.
Key Numbers
| Metric | Value |
|---|---|
| Agent memory footprint | < 50 MB RSS |
| Agent CPU overhead | < 1% on a 2-core machine |
| Collection interval (default) | 30 seconds |
| SQLite storage per day | ~2 MB |
| Default analysis window | 7 days |
| Supported cloud providers | AWS, OTC, Azure |
| Percentiles computed | P50, P90, P95, P99 |
| Potential cloud cost reduction | 30–70% |
| Binary size | ~8 MB |