ADR-013acceptedMarch 10, 2024

Zero-Downtime Deploy Strategy with Atomic Rollouts

Context

The portfolio site and its API are deployed independently (ADR-005) with different deployment mechanisms. The frontend deploys to Vercel via git push, which triggers a build-deploy pipeline. The API deploys via SSH to a dedicated server running pm2 as the process manager. Both deployment flows have a critical requirement: zero user-visible downtime. A visitor loading the site during a deploy should never see a broken page, a 502 error, or a partially loaded state. Vercel's deployment model is inherently atomic — the new build replaces the old one at the CDN edge in a single cut-over. But the API server requires explicit orchestration to ensure that in-flight requests complete before the new process takes over, and that the new process is healthy before receiving traffic. The challenge is not the happy path — it's the failure path: what happens when a deploy introduces a broken build or a crashing API process?

Decision

Implement atomic deployment on both surfaces with distinct strategies. Frontend (Vercel): rely on Vercel's immutable deployment model — each deploy creates a new unique URL, and the production alias only switches after the build succeeds and passes health checks. Previous deployments remain accessible via their unique URLs, enabling instant rollback by re-aliasing. No custom configuration needed — this is Vercel's default behavior, but it's a deliberate architectural dependency. API (pm2): use pm2's reload command (not restart) which spawns new processes before killing old ones, ensuring zero-gap coverage. The pm2 ecosystem file defines a ready signal (process.send('ready')) that the new process emits after connecting to MongoDB and completing initialization. Old processes only receive SIGTERM after new processes are marked ready. A deploy script wraps git pull, npm install, and pm2 reload in a sequence that aborts on any step failure, leaving the running process untouched.

Consequences

Positive: Zero downtime verified across 200+ deployments over 20 months. The pm2 ready-signal pattern ensures that no request hits an uninitialized API process — cold connections to MongoDB are resolved before the process receives traffic. Vercel's immutable deploys provide instant rollback: any previous deploy can be promoted to production in under 5 seconds via the dashboard or CLI. The deploy script's abort-on-failure behavior has prevented 3 broken deploys from reaching production (npm install failures, syntax errors caught by build). Negative: The pm2 reload strategy temporarily doubles memory usage during the overlap window (old + new processes running simultaneously). For a small API this is negligible (~50MB spike), but it would not scale to memory-constrained environments. The deploy script is imperative (bash) rather than declarative (CI/CD pipeline), making it harder to audit and reproduce. There's no automated rollback on the API side — if a deploy passes the ready signal but causes runtime errors, manual intervention is required.

Calibrated Uncertainty

Predictions at Decision Time

Expected Vercel's atomic deploy model to handle frontend rollouts without any custom work. Predicted pm2's reload with ready signals would provide zero-gap coverage. Assumed the deploy script would be a temporary solution replaced by a CI/CD pipeline within 6 months. Estimated the memory spike during dual-process overlap would be under 100MB.

Measured Outcomes

Vercel's atomic deploys work exactly as expected — zero frontend downtime across all deployments. pm2 reload with ready signals delivers zero-gap coverage as predicted. The 'temporary' deploy script has been in production for 20+ months with no replacement — it works reliably enough that the CI/CD migration has never been prioritized. Memory spike during API deploys is ~50MB, well under the 100MB estimate. The unexpected outcome: the imperative bash script, despite being architecturally inelegant, has proven more debuggable than expected because each step can be run manually in isolation.

Unknowns at Decision Time

Did not know at decision time whether pm2's reload command would handle MongoDB connection pooling correctly during the overlap window. If the old process held exclusive locks or connection limits, the new process might fail to connect. In practice, MongoDB's connection pooling handles concurrent connections from both processes without issue. Also unknown: whether Vercel would maintain backward compatibility for their immutable deploy model, which the rollback strategy depends on entirely.

Reversibility Classification

Two-Way Door

The deploy strategy is entirely configuration. Switching from pm2 reload to pm2 restart (simpler, but with a brief downtime window) is a one-line change. Replacing the bash deploy script with a CI/CD pipeline (GitHub Actions, etc.) is a parallel implementation that can coexist during migration. Vercel's deploy model requires no configuration — it's the platform default. Estimated effort to change any aspect: 1-4 hours depending on the target approach.

Strongest Counter-Argument

A proper CI/CD pipeline (GitHub Actions → SSH deploy → health check → traffic switch) would provide better auditability, automated rollback on health check failure, and a declarative deployment specification. The bash script is a single point of failure — if it's corrupted or the SSH session drops mid-deploy, the state is ambiguous. Docker containerization would eliminate the 'dual process' memory issue entirely by running the new container alongside the old one and switching at the load balancer level. The counter-counter: the bash script deploys in under 10 seconds and has a 100% success rate over 200+ deploys. The engineering effort to replace it would be spent on operational infrastructure rather than product features.

Technical Context

Stack

Vercelpm2Node.jsBashNginx

Deploys With Zero Downtime

200+

Frontend Rollback Time

<5 seconds

Api Reload Overlap Window

~3 seconds

Memory Spike During Deploy

~50MB

Constraints

pm2 reload requires enough memory for dual processes
API rollback is manual
Deploy script is bash, not CI/CD