feat(resilience): operational hardening (NEXT phase of the audit)
Deploy to VPS / deploy (push) Has been cancelled
Deploy to VPS / deploy (push) Has been cancelled
Acts on the audit's NEXT block — operational resilience. Backups (N1): - New `backup` compose service (postgres:16-alpine) runs scripts/backup-loop.sh: immediate pg_dump on start, then nightly, gzip, 14-day rotation into ./backups on the host. Configurable via BACKUP_RETENTION_DAYS / BACKUP_INTERVAL_SECONDS. (Offsite copy is the documented next step.) Resource limits + healthchecks (N2): - deploy.resources.limits.memory on postgres (2g), app (1500m), nginx (256m), backup (256m) so no container can starve the others (the Nginx outage was a reminder). - Nginx now has a healthcheck hitting a new self-served `/nginx-health` endpoint on the default_server (no upstream dependency). Chat resilience (N3): - buildSystemPrompt() wraps its 4 Prisma queries in try/catch with safe defaults — if Postgres is down the assistant degrades instead of 500-ing. - Result is cached for 60s (only on healthy builds) so we don't run 4 queries per message; CMS edits still appear within the TTL. - POST fails fast with 503 if OPENAI_API_KEY is missing (instead of breaking mid-stream after headers are sent). - streamText gets an onError handler that logs + persists an `error` AiEvent. Idempotent submissions (N4): - consultation/route.ts and operations.ts now wrap the email-tracking UPDATE in try/catch — the lead/signal is already saved, so a telemetry hiccup can't 500 the request and trigger a duplicate retry. operations.ts also returns emailError. Performance (N5): - Index GlobalNode(application, isActive) — backs the case-study join on every application page. Migration 20260609130000_index_globalnode_application. Verified: next build compiles (Docker parity, SESSION_SECRET unset), TypeScript clean, prisma schema valid, golden tests 17/17, `docker compose config` valid. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
Executable
+31
@@ -0,0 +1,31 @@
|
||||
#!/bin/sh
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
# Single Postgres backup: pg_dump -> gzip -> N-day rotation.
|
||||
# Run by scripts/backup-loop.sh inside the `backup` compose service.
|
||||
# Env: DB_USER, DB_PASSWORD, DB_NAME, BACKUP_DIR, RETENTION_DAYS
|
||||
# ─────────────────────────────────────────────────────────────────────────────
|
||||
set -eu
|
||||
|
||||
BACKUP_DIR="${BACKUP_DIR:-/backups}"
|
||||
RETENTION_DAYS="${RETENTION_DAYS:-14}"
|
||||
TS=$(date -u +%Y%m%d_%H%M%S)
|
||||
OUT="${BACKUP_DIR}/flux_db_${TS}.sql.gz"
|
||||
|
||||
mkdir -p "$BACKUP_DIR"
|
||||
export PGPASSWORD="$DB_PASSWORD"
|
||||
|
||||
echo "[backup] $(date -u +%Y-%m-%dT%H:%M:%SZ) starting pg_dump -> ${OUT}"
|
||||
|
||||
# --no-owner/--no-privileges keep the dump portable across roles on restore.
|
||||
if pg_dump -h postgres -U "$DB_USER" -d "$DB_NAME" --no-owner --no-privileges | gzip -9 > "$OUT"; then
|
||||
SIZE=$(du -h "$OUT" | cut -f1)
|
||||
echo "[backup] OK: ${OUT} (${SIZE})"
|
||||
else
|
||||
echo "[backup] FAILED: pg_dump returned non-zero; removing partial file"
|
||||
rm -f "$OUT"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Rotation — drop dumps older than RETENTION_DAYS.
|
||||
DELETED=$(find "$BACKUP_DIR" -name 'flux_db_*.sql.gz' -mtime +"$RETENTION_DAYS" -print -delete 2>/dev/null | wc -l || echo 0)
|
||||
echo "[backup] rotation: kept last ${RETENTION_DAYS} days, pruned ${DELETED} old dump(s)"
|
||||
Reference in New Issue
Block a user