Lean, mean, poor man's dream

Lean, mean, poor man's dream

February 21, 2026
Steroid Induced Hyper Productivity 2026InfrastructureDockerSelf-hostingDevOps

I run about 20 services across 7 cloud VPS nodes, a dedicated mail server, local services including some Raspberry Pis, and an off-site machine connected via VPN. The whole thing costs roughly 50 EUR a month and is managed almost entirely by an AI agent that has never forgotten a lesson it learned. A lean, mean, poor man’s dream. This is what it looks like, why, what didn’t work, lessons learned, and what’s still rough around the edges.

Table of contents

The premise

I like owning my infrastructure. Partly ideological — data sovereignty, open source, no vendor lock-in — but mostly because I genuinely enjoy the control and the learning that comes with it. When a service misbehaves, I want to know why, not open a support ticket and wait. When I need a new capability, I want to spin it up as quickly as possible, give it a go, and rip it out again if it’s not what I’m looking for. Dokploy is ideal for that — it has a huge library of one-click apps, but it’s just as simple to upload a Compose file and deploy something entirely bespoke.

Before all this, the setup was simpler: Coolify on a single 80 EUR/month Hetzner machine running half the services. It worked, but I kept bumping into limits. Coolify is a solid project, but a few things didn’t sit right with me:

CoolifyDokploy
Compose filesInjects a helper image on every deploy, rewrites configYou add Traefik labels and network yourself — predictable, transparent
ProxyTraefik + Caddy (two options, Traefik better supported)Traefik only — one proxy, no ambiguity
Multi-nodeSwarm support is experimentalNative Docker Swarm with automatic mesh networking
OrchestrationTalks to Docker Engine API directly, manual load balancingUses Docker Swarm stack deploys, automatic service discovery
Network isolationFlat architecture — databases join destination network, services can see each other across projectsIsolated deployments per compose stack, better default separation
Build queueBuilt-in with configurable concurrency (but frequently broken in practice)Single sequential queue — which is why I use external CI/CD instead
APIFull REST API (Laravel/Sanctum)Full REST API (OpenAPI/JWT) — both automation-ready

The redesign itself was done almost entirely with Claude Code — under my guidance and supervision, but automated. Multiple review passes, a lot of iteration, designing rules and guard rails for the agent to match the intent of the project. The architecture, the wrapper scripts, the config-as-code workflow, the deployment pipeline — all of it came out of that process. Keeping things tight was the hard part. The agent is happy to over-engineer if you let it. Like with every agent workflow, the output quality is directly related to the tightness of your specs and requirements — and to the existing knowledge you bring. Gaps can be filled, and I’ll admit I learned a lot in the process. One of the most important parts is feeding current documentation as a starting point for exploration and treating everything the agent produces critically. I’m happy with the result and especially the timeframe it took. The multiplication effect is real, and it applies to infrastructure work just as much as it does to code.

Requirements

  • Open source only — no proprietary lock-in, no “free tier that’ll cost you later.” Dual-licensed projects are fine with a pragmatic closing of one eye on a case by case basis
  • Production-grade — zero-downtime deployments, automatic rollback, health checks, the works
  • Multi-tenant capable — run client SaaS apps with proper isolation alongside personal projects
  • Centralized secret management — no secrets in repos, no secrets in environment files, no secrets in anyone’s context
  • Full observability — centralized logging, uptime monitoring, server metrics
  • Automated CI/CD — git push to production with no manual steps
  • Config-as-code — every piece of infrastructure state tracked in git, reproducible from scratch
  • Auditable — every change documented with what, why, and what it affected
  • Resource-efficient — everything runs on 4GB nodes, so every service earns its RAM budget
  • Small as possible operation — easy to understand, easy to work through together with an agent

Every service in the stack was chosen against these constraints. When your nodes have 4GB of RAM each, you develop a very healthy respect for resource budgets. Graylog wanting 2.4GB just to start up? That’s over half a node gone for logging alone. VictoriaLogs doing the same job in under a gigabyte? That’s the kind of trade-off that matters.

On the elephant in the room: Yes, the infrastructure is operated by an AI agent from Anthropic — a proprietary, closed-source, US-based cloud service. The irony of building an open-source, data-sovereign stack and then handing operation to a commercial AI is not lost on me. I’m pragmatic — I want to see how far this goes with the best available tools. And out of the big three, Anthropic is currently a compromise I can live with.

The stack at a glance

The infrastructure lives on Hetzner Cloud, in Nuremberg. Hetzner’s pricing is hard to beat for what you get. CPX22 instances (2 vCPU, 4GB RAM) at around 6-7 EUR/month each form the backbone. All nodes sit on a private 10.0.0.0/16 network via Hetzner private network — inter-service communication never touches the public internet.

On top of that: Docker Swarm for orchestration, Dokploy as the management layer, and Traefik for ingress and TLS. That’s the core. Everything else is services running on this foundation.

Node topology

Managed by Dokploy

Docker Swarm cluster

managerCPX22
  • Dokploy
  • Traefik ingress
swarm-01CPX22
  • Stateless app containers
swarm-02CPX22
  • Stateless app containers

Static nodes (outside Swarm)

node-01CPX22
  • Plausible Analytics
  • Zot Registry
  • Paperless-ngx
  • Beszel
  • parsedmarc
  • Listmonk
node-02CX23
  • PostgreSQL + PostGIS
  • Meilisearch
node-03CPX22
  • VictoriaLogs + Fluent Bit + Grafana
  • OpenBao

Standalone

mailCPX31
  • Mailcow
blackboxLAN
  • Plex
  • *arr stack
  • Uptime Kuma
  • Off-site S3 backup
raspberry-piholeLAN
  • Pi-hole DNS
raspberry-wireguardLAN
  • WireGuard VPN

The placement strategy is simple: stateless workloads go into Docker Swarm for rolling updates and multi-node scheduling. Stateful services (databases, search, logging, secrets) live on static nodes outside Swarm.

Scaling

Scaling has three levers: move services between existing nodes, provision new nodes, or upgrade existing ones to a bigger instance type. It’s a balancing act — performance is directly tied to price — but for now there’s plenty of headroom. The one thing worth knowing is that the low end of Hetzner’s VPS lineup is genuinely fast to hit its ceiling. These are cheap machines, and they behave like cheap machines under pressure. Deliberate swap configuration is in place to prevent Docker from OOM-killing containers — swapping is allowed to keep services up rather than crashing them, and it’s logged and visible in monitoring so you know when a node needs attention or an upgrade.

Why Docker Swarm, not Kubernetes

The answer is boring: Swarm does everything I need with a fraction of the complexity. I’ve used Docker in various forms over the years — Dokku, standalone Compose, Coolify, CapRover — all in production, in various capacity for small to medium sized projects. There was always more than enough headroom for scaling. The abyss to cross over to Kubernetes simply never presented itself, and if it does, I’m probably not the right person to make that jump. I feel comfortable right here, and I consider that a strength, not a limitation.

I need rolling deploys with zero downtime. Swarm does that with start-first update ordering — the new container comes up and passes health checks before the old one gets removed. I need automatic rollback on failure. Swarm does that natively. I need multi-node scheduling so my worker nodes share the load. Done.

What I don’t need: a control plane that requires 3 dedicated nodes, a PhD in YAML to configure networking, and a monitoring stack just to keep the orchestrator itself healthy. For a small operation running 20-ish services, Dokploy’s simplicity-to-capability ratio is exactly right. It’s not fashionable and that’s part of why it works.

The biggest caveat with Swarm isn’t technical — it’s perception. It’s been pronounced dead more times than western democracy, yet here it is, quietly running production workloads. The self-hosted PaaS wave (Dokploy, Coolify, CapRover) is building on top of it, which gives me hope that it’ll get the recognition it deserves. Swarm is open source as part of Moby, but Docker Inc. has been running it in maintenance mode for years — no public roadmap, long-standing issues sitting open, and new features are rare. Mirantis committed to Swarm support through 2030, but that’s enterprise support, not active development. I’m hoping the long-open issues finally get fixed and someone — Docker, Mirantis, the community — decides to actually invest in it again rather than letting it coast on “not deprecated.”

Dokploy sits on top as a management layer — web UI plus API for deploying services, managing environment variables, and handling Docker Compose stacks. It’s the open-source alternative to Vercel or Railway for self-hosting.

The operating model: everything is an API call

The entire infrastructure is built around one principle: every external service — Dokploy, Cloudflare, Hetzner, Mailcow, the secret vault — is accessed exclusively through wrapper scripts that talk to its API. No manual clicking in UIs, no raw curl commands, no SSH-ing into a node to change something by hand. Every interaction is a script call that can be repeated, logged, and — critically — handed to an AI agent.

There are wrapper scripts for every external API, leveraging existing CLI tools but wrapping them in a consistent interface: Dokploy for service management, Cloudflare for DNS, Hetzner for cloud resources, Mailcow for the mail server, and a multi-node SSH tool that can target individual nodes or groups. On top of that, an orchestration script syncs secrets from OpenBao to Dokploy’s environment variables, and two provisioning scripts handle idempotent node setup (SSH hardening, Docker, firewall, swarm join) and Fluent Bit deployment for log shipping.

Each script handles authentication internally — it loads secrets from OpenBao (or falls back to a local .env) through a shared library so that credentials never leak into command arguments, logs, or agent context. Every script also includes usage examples in its header, so the AI agent can pick up any tool and start working with minimal context — no need to load the full documentation into every session.

Before anything gets pushed to a remote API, it’s validated locally. YAML gets linted, JSON gets validated, Docker Compose files get checked, shell scripts run through ShellCheck. If the config is malformed, the operation fails before it ever touches production. This sounds obvious but it’s the kind of thing you only appreciate after you’ve spent an evening debugging why Traefik stopped routing because of a stray character in a YAML file.

This isn’t just a convenience layer. It’s the foundation of the entire automation loop:

Operations loop

Query stateWrapper script reads current config via API
Lint payloadYAML, JSON, Compose, ShellCheck
Make changeWrapper script mutates state via API
VerifyConfirm the change took effect
Export configCurrent state dumped to local JSON/YAML
Git commitConfig-as-code stays in sync
Record historyWhat changed, why, and what it affected

Every operation follows this cycle. DNS change? cloudflare-api.sh makes the change, exports the updated zone to config/cloudflare-dns.json, commits. New service deployed? dokploy-api.sh deploys it, topology gets re-exported, history entry gets written. Secret rotated? openbao-dokploy-sync.sh pushes the new value, mapping file gets updated.

This also means there’s an audit trail by default. Every change produces a git commit (config export) and a history entry (human-readable markdown documenting the change, the reasoning, and the affected services). 41 history entries so far, each one traceable back to a specific task.

Every piece of remote state — whether it lives on a SaaS provider like Cloudflare or on a self-hosted service like Dokploy, VictoriaLogs, or OpenBao — is exported to local files and tracked in git. If there’s drift between remote state and local files, the task isn’t done. Disaster recovery becomes “re-apply the config.” Whether I’d actually trust that on day one of a disaster is another question, but knowing the state is captured is already worth a lot.

Why not Terraform or Ansible? The benefits are real — declarative state, drift detection, idempotent convergence — but the time investment to fully internalize those tools for a small operation is hard to justify when shell scripts wrapping existing CLIs cover the same ground. This approach can always be extended, and if agent errors start pointing at the limits of the current model, that’s a useful signal that the concept needs to evolve — and honestly, I’m looking forward to exploring that when the time comes.

Without this API-first design, the rest of the stack is just a bunch of servers. With it, it’s an automatable system.

What’s running

I’m not going to list everything, but here’s the gist:

Client applications: Two full-stack apps (backend + frontend + database + OAuth) deployed as Swarm stacks with staging/production separation, a Listmonk newsletter platform, and 4 legacy Next.js apps. Each has its own isolated Docker network.

Observability: Beszel for server monitoring with Docker integration, VictoriaLogs + Grafana for centralized logging with Fluent Bit shipping logs from every node, Uptime Kuma running off-site for external uptime monitoring. Plausible Analytics for privacy-friendly web analytics.

Data services: PostgreSQL with PostGIS for geospatial queries, Meilisearch for full-text search, a private OCI container registry (Zot), Paperless-ngx for document management with AI classification. Databases are backed up via tiredofit/db-backup sidecars to local volumes, then pushed to Hetzner S3-compatible object storage by offen/docker-volume-backup.

Security: OpenBao for secret management (it’s the open-source fork of HashiCorp Vault), CrowdSec for collaborative threat detection and blocking across all nodes, and a Mailcow mail server with DMARC reporting via parsedmarc.

Personal: This blog and a hub portal for the infrastructure itself.

The pipeline

The pipeline is a reusable template: any git-based project with a Dockerfile can be deployed as a Compose stack in Swarm mode. Add the workflow, point it at the registry and Dokploy, done.

Deployment flow

git pushVersion tag → production, main → staging
GitHub ActionsLint, tests, multi-stage Docker build with layer caching
Zot RegistryPrivate OCI registry
Dokploy APITriggers stack deploy
Rolling updateZero-downtime swap

A detail worth mentioning: I don’t use Dokploy’s built-in builder. It has a single-queue bottleneck — it can’t parallelize builds. GitHub Actions handles this — builds run in parallel and Docker layer caching just works out of the box. The workflow pushes images to my private Zot registry, then calls the Dokploy API to trigger deployment. Staging environments automatically scale to zero after 4 hours of inactivity — a cron job parses Traefik access logs and handles this without any manual intervention.

The very first iteration of the stack was built on Forgejo — self-hosted Git with built-in CI. Forgejo is excellent as a source repository, but the build pipeline story isn’t there yet. There’s a pragmatic limit to how much pipeline plumbing I’m willing to maintain, and Forgejo crossed it — for now.

Security: paranoia with a system

Infrastructure secrets live in OpenBao — the open-source fork of HashiCorp Vault. The automation layer accesses them through a wrapper around the OpenBao CLI that prevents secret values from leaking into command arguments, logs, or agent context. On the Dokploy side, service secrets are managed as environment variables. A sync script bridges the two: it reads from OpenBao’s KV store and pushes values into Dokploy’s environment variable system.

Basic hardening is handled by the provisioning script — SSH on a non-standard port, key-only auth, no passwords. On top of that:

  • Hetzner Cloud Firewalls with dedicated rulesets per node role
  • All inter-service communication over the private network
  • CrowdSec agents on every node reading Traefik and syslog, reporting to a central LAPI on the manager. The Traefik bouncer plugin blocks malicious IPs before they hit any service

The hard parts

These are the pitfalls we encountered or knew about going in. Most of them are essentially bugs that require workarounds — fixable on the vendor side, just not fixed yet.

Docker Swarm

Overlay network fragility

Docker Swarm’s overlay networking is the single biggest source of pain. After node outages, the overlay can break in ways that leave containers unable to communicate. The fix isn’t elegant: restart the Docker daemon on affected nodes, then cycle every service through scale 0 → scale 1 to clear stale VIP routing entries. Multiple GitHub issues document this problem, and they’ve been open for years.

This isn’t a dealbreaker, but Swarm’s networking layer is the weakest link. When it breaks, the symptoms are confusing — services appear healthy but can’t reach each other.

IPv6 advertising

The Swarm cluster was initialized with IPv4, but it still advertises IPv6 addresses. Currently this works, but any significant swarm disruption could trigger overlay failures through IPv6 routing confusion. The permanent fix requires reinitializing the entire swarm — downtime and redeployment of everything.

Single manager, single point of failure

There’s one manager node. No HA. If it goes down, Swarm orchestration stops. Running services keep running — they’re already on the workers — but no new deployments or scaling until the manager is back. Adding manager redundancy means 3 manager nodes minimum, which triples the cost of the management plane. For a small operation, I’ve accepted this trade-off. The static node services are completely unaffected since they don’t depend on Swarm at all.

The first deploy problem

Docker Swarm has had a bug since 2017 where --with-registry-auth doesn’t properly distribute registry credentials to worker nodes on the first deploy of a new image. The workaround: manually docker pull the image on each node before the first deploy. After that, updates work fine.

Dokploy & Traefik

File mount quirk

Learned this the hard way: Dokploy’s API requires both mountPath AND filePath parameters when creating file mounts. If you omit filePath, the content gets stored in Dokploy’s database but never actually written to disk. Docker then creates empty directories where your config files should be.

Traefik YAML escaping

Regex patterns with backslashes in YAML double-quoted strings cause Traefik’s file provider to fail silently. Not the single route — the entire file provider stops loading. All your carefully crafted middleware rules just vanish. The fix: use YAML single quotes or raw strings for anything containing backslashes.

Traefik doesn’t survive hard reboots

Dokploy runs Traefik as a standalone container with restart: always. Sounds fine, except after a hard reboot there’s a race condition between Docker and Dokploy — Traefik tries to start before Dokploy has set up the networks it needs. The fix is a cron job: @reboot sleep 30 && docker start dokploy-traefik.

The bottom line

Honestly, this was just fun to build. Taking a handful of cheap cloud servers and turning them into a platform that runs 20+ services — databases, apps, monitoring, backups, the whole thing — scratches an itch that managed services never will.

The bigger lesson was learning to think about it as a platform, not a collection of servers. Every piece — provisioning, secrets, deployments, observability — has to fit together, and the ecosystem of open and self-hostable tools has reached a point where it just works.

Working with an AI agent throughout the process taught me a lot about using them safely: scoping permissions, structuring context so the agent doesn’t drift, building guardrails into the scripts themselves. But the unexpected part was the knowledge transfer — in both directions. You have to explain infrastructure decisions clearly enough for an agent to act on them, and the agent explains back what it’s doing and why. It’s like ELI12 for your own stack. An interesting dynamic, but one you have to actively choose — it doesn’t happen by default.

But the best part is when you stop tinkering and it just runs. Services deploy, backups rotate, metrics flow, alerts fire when they should. Like clockwork. That’s the payoff.