feat(resilience): operational hardening (NEXT phase of the audit)

Acts on the audit's NEXT block — operational resilience. Backups (N1): - New `backup` compose service (postgres:16-alpine) runs scripts/backup-loop.sh: immediate pg_dump on start, then nightly, gzip, 14-day rotation into ./backups on the host. Configurable via BACKUP_RETENTION_DAYS / BACKUP_INTERVAL_SECONDS. (Offsite copy is the documented next step.) Resource limits + healthchecks (N2): - deploy.resources.limits.memory on postgres (2g), app (1500m), nginx (256m), backup (256m) so no container can starve the others (the Nginx outage was a reminder). - Nginx now has a healthcheck hitting a new self-served `/nginx-health` endpoint on the default_server (no upstream dependency). Chat resilience (N3): - buildSystemPrompt() wraps its 4 Prisma queries in try/catch with safe defaults — if Postgres is down the assistant degrades instead of 500-ing. - Result is cached for 60s (only on healthy builds) so we don't run 4 queries per message; CMS edits still appear within the TTL. - POST fails fast with 503 if OPENAI_API_KEY is missing (instead of breaking mid-stream after headers are sent). - streamText gets an onError handler that logs + persists an `error` AiEvent. Idempotent submissions (N4): - consultation/route.ts and operations.ts now wrap the email-tracking UPDATE in try/catch — the lead/signal is already saved, so a telemetry hiccup can't 500 the request and trigger a duplicate retry. operations.ts also returns emailError. Performance (N5): - Index GlobalNode(application, isActive) — backs the case-study join on every application page. Migration 20260609130000_index_globalnode_application. Verified: next build compiles (Docker parity, SESSION_SECRET unset), TypeScript clean, prisma schema valid, golden tests 17/17, `docker compose config` valid. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 23:07:38 -05:00
parent 18d5ed87c8
commit a81ee50ed8
10 changed files with 208 additions and 31 deletions
@@ -17,6 +17,12 @@ services:
      - pgdata:/var/lib/postgresql/data
    networks:
      - flux-net
+    # Resource caps so no single container can starve the others (the Nginx
+    # outage earlier was a reminder). VPS has ~11 GB; these leave headroom.
+    deploy:
+      resources:
+        limits:
+          memory: 2g
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${DB_USER} -d ${DB_NAME}"]
      interval: 5s
@@ -81,6 +87,10 @@ services:
      - flux-net
    expose:
      - "3000"
+    deploy:
+      resources:
+        limits:
+          memory: 1500m
    healthcheck:
      test:
        - CMD-SHELL
@@ -114,6 +124,46 @@ services:
      - app
    networks:
      - flux-net
+    deploy:
+      resources:
+        limits:
+          memory: 256m
+    healthcheck:
+      # Nginx self-health (served directly by the default_server, no upstream).
+      test: ["CMD-SHELL", "wget -q -O /dev/null http://127.0.0.1/nginx-health || exit 1"]
+      interval: 30s
+      timeout: 5s
+      retries: 3
+      start_period: 10s
+
+  # ── Automated Postgres backups ──
+  # Nightly pg_dump -> gzip into ./backups on the host, 14-day rotation.
+  # NOTE: this is LOCAL to the VPS. Offsite copy (S3/rsync) is the recommended
+  # next step once the client provides storage credentials.
+  backup:
+    image: postgres:16-alpine
+    restart: always
+    depends_on:
+      postgres:
+        condition: service_healthy
+    environment:
+      DB_USER: ${DB_USER}
+      DB_PASSWORD: ${DB_PASSWORD}
+      DB_NAME: ${DB_NAME}
+      BACKUP_DIR: /backups
+      RETENTION_DAYS: ${BACKUP_RETENTION_DAYS:-14}
+      BACKUP_INTERVAL_SECONDS: ${BACKUP_INTERVAL_SECONDS:-86400}
+    volumes:
+      - ./backups:/backups
+      - ./scripts/db-backup.sh:/usr/local/bin/db-backup.sh:ro
+      - ./scripts/backup-loop.sh:/usr/local/bin/backup-loop.sh:ro
+    entrypoint: ["/bin/sh", "/usr/local/bin/backup-loop.sh"]
+    networks:
+      - flux-net
+    deploy:
+      resources:
+        limits:
+          memory: 256m

 volumes:
  pgdata: