feat(resilience): operational hardening (NEXT phase of the audit)
Deploy to VPS / deploy (push) Has been cancelled
Deploy to VPS / deploy (push) Has been cancelled
Acts on the audit's NEXT block — operational resilience. Backups (N1): - New `backup` compose service (postgres:16-alpine) runs scripts/backup-loop.sh: immediate pg_dump on start, then nightly, gzip, 14-day rotation into ./backups on the host. Configurable via BACKUP_RETENTION_DAYS / BACKUP_INTERVAL_SECONDS. (Offsite copy is the documented next step.) Resource limits + healthchecks (N2): - deploy.resources.limits.memory on postgres (2g), app (1500m), nginx (256m), backup (256m) so no container can starve the others (the Nginx outage was a reminder). - Nginx now has a healthcheck hitting a new self-served `/nginx-health` endpoint on the default_server (no upstream dependency). Chat resilience (N3): - buildSystemPrompt() wraps its 4 Prisma queries in try/catch with safe defaults — if Postgres is down the assistant degrades instead of 500-ing. - Result is cached for 60s (only on healthy builds) so we don't run 4 queries per message; CMS edits still appear within the TTL. - POST fails fast with 503 if OPENAI_API_KEY is missing (instead of breaking mid-stream after headers are sent). - streamText gets an onError handler that logs + persists an `error` AiEvent. Idempotent submissions (N4): - consultation/route.ts and operations.ts now wrap the email-tracking UPDATE in try/catch — the lead/signal is already saved, so a telemetry hiccup can't 500 the request and trigger a duplicate retry. operations.ts also returns emailError. Performance (N5): - Index GlobalNode(application, isActive) — backs the case-study join on every application page. Migration 20260609130000_index_globalnode_application. Verified: next build compiles (Docker parity, SESSION_SECRET unset), TypeScript clean, prisma schema valid, golden tests 17/17, `docker compose config` valid. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -56,3 +56,4 @@ public/branding/
|
|||||||
# Local Claude Code / MCP config — agent-specific, not project
|
# Local Claude Code / MCP config — agent-specific, not project
|
||||||
.mcp.json
|
.mcp.json
|
||||||
.claude/
|
.claude/
|
||||||
|
backups/
|
||||||
|
|||||||
@@ -17,6 +17,12 @@ services:
|
|||||||
- pgdata:/var/lib/postgresql/data
|
- pgdata:/var/lib/postgresql/data
|
||||||
networks:
|
networks:
|
||||||
- flux-net
|
- flux-net
|
||||||
|
# Resource caps so no single container can starve the others (the Nginx
|
||||||
|
# outage earlier was a reminder). VPS has ~11 GB; these leave headroom.
|
||||||
|
deploy:
|
||||||
|
resources:
|
||||||
|
limits:
|
||||||
|
memory: 2g
|
||||||
healthcheck:
|
healthcheck:
|
||||||
test: ["CMD-SHELL", "pg_isready -U ${DB_USER} -d ${DB_NAME}"]
|
test: ["CMD-SHELL", "pg_isready -U ${DB_USER} -d ${DB_NAME}"]
|
||||||
interval: 5s
|
interval: 5s
|
||||||
@@ -81,6 +87,10 @@ services:
|
|||||||
- flux-net
|
- flux-net
|
||||||
expose:
|
expose:
|
||||||
- "3000"
|
- "3000"
|
||||||
|
deploy:
|
||||||
|
resources:
|
||||||
|
limits:
|
||||||
|
memory: 1500m
|
||||||
healthcheck:
|
healthcheck:
|
||||||
test:
|
test:
|
||||||
- CMD-SHELL
|
- CMD-SHELL
|
||||||
@@ -114,6 +124,46 @@ services:
|
|||||||
- app
|
- app
|
||||||
networks:
|
networks:
|
||||||
- flux-net
|
- flux-net
|
||||||
|
deploy:
|
||||||
|
resources:
|
||||||
|
limits:
|
||||||
|
memory: 256m
|
||||||
|
healthcheck:
|
||||||
|
# Nginx self-health (served directly by the default_server, no upstream).
|
||||||
|
test: ["CMD-SHELL", "wget -q -O /dev/null http://127.0.0.1/nginx-health || exit 1"]
|
||||||
|
interval: 30s
|
||||||
|
timeout: 5s
|
||||||
|
retries: 3
|
||||||
|
start_period: 10s
|
||||||
|
|
||||||
|
# ── Automated Postgres backups ──
|
||||||
|
# Nightly pg_dump -> gzip into ./backups on the host, 14-day rotation.
|
||||||
|
# NOTE: this is LOCAL to the VPS. Offsite copy (S3/rsync) is the recommended
|
||||||
|
# next step once the client provides storage credentials.
|
||||||
|
backup:
|
||||||
|
image: postgres:16-alpine
|
||||||
|
restart: always
|
||||||
|
depends_on:
|
||||||
|
postgres:
|
||||||
|
condition: service_healthy
|
||||||
|
environment:
|
||||||
|
DB_USER: ${DB_USER}
|
||||||
|
DB_PASSWORD: ${DB_PASSWORD}
|
||||||
|
DB_NAME: ${DB_NAME}
|
||||||
|
BACKUP_DIR: /backups
|
||||||
|
RETENTION_DAYS: ${BACKUP_RETENTION_DAYS:-14}
|
||||||
|
BACKUP_INTERVAL_SECONDS: ${BACKUP_INTERVAL_SECONDS:-86400}
|
||||||
|
volumes:
|
||||||
|
- ./backups:/backups
|
||||||
|
- ./scripts/db-backup.sh:/usr/local/bin/db-backup.sh:ro
|
||||||
|
- ./scripts/backup-loop.sh:/usr/local/bin/backup-loop.sh:ro
|
||||||
|
entrypoint: ["/bin/sh", "/usr/local/bin/backup-loop.sh"]
|
||||||
|
networks:
|
||||||
|
- flux-net
|
||||||
|
deploy:
|
||||||
|
resources:
|
||||||
|
limits:
|
||||||
|
memory: 256m
|
||||||
|
|
||||||
volumes:
|
volumes:
|
||||||
pgdata:
|
pgdata:
|
||||||
|
|||||||
@@ -22,6 +22,11 @@ server {
|
|||||||
listen 80 default_server;
|
listen 80 default_server;
|
||||||
server_name _;
|
server_name _;
|
||||||
|
|
||||||
|
# Nginx self-health endpoint (served directly, no upstream) — used by the
|
||||||
|
# docker-compose healthcheck. Reachable on 127.0.0.1 inside the container
|
||||||
|
# (no Host match needed, so it lands here on the default_server).
|
||||||
|
location = /nginx-health { return 200 "ok\n"; access_log off; }
|
||||||
|
|
||||||
# Keep ACME HTTP-01 working so certbot can still renew on any host.
|
# Keep ACME HTTP-01 working so certbot can still renew on any host.
|
||||||
location /.well-known/acme-challenge/ { root /var/www/certbot; }
|
location /.well-known/acme-challenge/ { root /var/www/certbot; }
|
||||||
|
|
||||||
|
|||||||
@@ -0,0 +1,10 @@
|
|||||||
|
-- ─────────────────────────────────────────────────────────────────────────
|
||||||
|
-- ADDITIVE MIGRATION — index GlobalNode(application, isActive).
|
||||||
|
-- The application detail page queries case studies by application slug +
|
||||||
|
-- isActive (the GlobalNode.application -> Application.slug join). Without an
|
||||||
|
-- index this is a full table scan on every application page render.
|
||||||
|
-- Idempotent. Safe for `migrate deploy`.
|
||||||
|
-- ─────────────────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
CREATE INDEX IF NOT EXISTS "GlobalNode_application_isActive_idx"
|
||||||
|
ON "GlobalNode" ("application", "isActive");
|
||||||
@@ -64,6 +64,9 @@ model GlobalNode {
|
|||||||
@@index([isActive])
|
@@index([isActive])
|
||||||
@@index([nodeType])
|
@@index([nodeType])
|
||||||
@@index([nodeType, isActive])
|
@@index([nodeType, isActive])
|
||||||
|
// Case studies on an application page filter by application slug + isActive
|
||||||
|
// (src/app/[locale]/applications/[slug]/page.tsx). Back this join with an index.
|
||||||
|
@@index([application, isActive])
|
||||||
}
|
}
|
||||||
|
|
||||||
// ------------------------------------------------------
|
// ------------------------------------------------------
|
||||||
|
|||||||
Executable
+15
@@ -0,0 +1,15 @@
|
|||||||
|
#!/bin/sh
|
||||||
|
# ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
# Backup service entrypoint. Runs one backup immediately on start, then loops
|
||||||
|
# every BACKUP_INTERVAL_SECONDS (default 24h). A loop (vs cron) inherits the
|
||||||
|
# container environment cleanly and survives restarts without lost schedules.
|
||||||
|
# ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
set -eu
|
||||||
|
|
||||||
|
INTERVAL="${BACKUP_INTERVAL_SECONDS:-86400}"
|
||||||
|
echo "[backup] service started; interval=${INTERVAL}s, retention=${RETENTION_DAYS:-14}d"
|
||||||
|
|
||||||
|
while true; do
|
||||||
|
/usr/local/bin/db-backup.sh || echo "[backup] cycle failed; will retry next interval"
|
||||||
|
sleep "$INTERVAL"
|
||||||
|
done
|
||||||
Executable
+31
@@ -0,0 +1,31 @@
|
|||||||
|
#!/bin/sh
|
||||||
|
# ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
# Single Postgres backup: pg_dump -> gzip -> N-day rotation.
|
||||||
|
# Run by scripts/backup-loop.sh inside the `backup` compose service.
|
||||||
|
# Env: DB_USER, DB_PASSWORD, DB_NAME, BACKUP_DIR, RETENTION_DAYS
|
||||||
|
# ─────────────────────────────────────────────────────────────────────────────
|
||||||
|
set -eu
|
||||||
|
|
||||||
|
BACKUP_DIR="${BACKUP_DIR:-/backups}"
|
||||||
|
RETENTION_DAYS="${RETENTION_DAYS:-14}"
|
||||||
|
TS=$(date -u +%Y%m%d_%H%M%S)
|
||||||
|
OUT="${BACKUP_DIR}/flux_db_${TS}.sql.gz"
|
||||||
|
|
||||||
|
mkdir -p "$BACKUP_DIR"
|
||||||
|
export PGPASSWORD="$DB_PASSWORD"
|
||||||
|
|
||||||
|
echo "[backup] $(date -u +%Y-%m-%dT%H:%M:%SZ) starting pg_dump -> ${OUT}"
|
||||||
|
|
||||||
|
# --no-owner/--no-privileges keep the dump portable across roles on restore.
|
||||||
|
if pg_dump -h postgres -U "$DB_USER" -d "$DB_NAME" --no-owner --no-privileges | gzip -9 > "$OUT"; then
|
||||||
|
SIZE=$(du -h "$OUT" | cut -f1)
|
||||||
|
echo "[backup] OK: ${OUT} (${SIZE})"
|
||||||
|
else
|
||||||
|
echo "[backup] FAILED: pg_dump returned non-zero; removing partial file"
|
||||||
|
rm -f "$OUT"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Rotation — drop dumps older than RETENTION_DAYS.
|
||||||
|
DELETED=$(find "$BACKUP_DIR" -name 'flux_db_*.sql.gz' -mtime +"$RETENTION_DAYS" -print -delete 2>/dev/null | wc -l || echo 0)
|
||||||
|
echo "[backup] rotation: kept last ${RETENTION_DAYS} days, pruned ${DELETED} old dump(s)"
|
||||||
@@ -92,7 +92,10 @@ export async function submitOperationsSignal(payload: {
|
|||||||
replyTo: payload.clientEmail,
|
replyTo: payload.clientEmail,
|
||||||
});
|
});
|
||||||
|
|
||||||
// Track email delivery in DB
|
// Track email delivery — best-effort. The signal (lead) is already saved,
|
||||||
|
// so a telemetry-update hiccup must NOT fail the request and make the
|
||||||
|
// client retry into a duplicate.
|
||||||
|
try {
|
||||||
await prisma.operationsSignal.update({
|
await prisma.operationsSignal.update({
|
||||||
where: { id: signal.id },
|
where: { id: signal.id },
|
||||||
data: {
|
data: {
|
||||||
@@ -101,8 +104,11 @@ export async function submitOperationsSignal(payload: {
|
|||||||
emailError: emailResult.error,
|
emailError: emailResult.error,
|
||||||
},
|
},
|
||||||
});
|
});
|
||||||
|
} catch (trackErr) {
|
||||||
|
console.warn("[operations] email tracking update failed (lead already saved):", trackErr);
|
||||||
|
}
|
||||||
|
|
||||||
return { success: true, ticketId, emailSent: emailResult.success };
|
return { success: true, ticketId, emailSent: emailResult.success, emailError: emailResult.error };
|
||||||
} catch (error) {
|
} catch (error) {
|
||||||
console.error("Error submitting signal:", error);
|
console.error("Error submitting signal:", error);
|
||||||
return { error: "Failed to submit request. Please try again." };
|
return { error: "Failed to submit request. Please try again." };
|
||||||
|
|||||||
@@ -39,9 +39,24 @@ const COMPARISON_DATA: Record<string, { rf: number; traditional: number; unit: s
|
|||||||
// ─── DYNAMIC SYSTEM PROMPT BUILDER ──────────────────────────────
|
// ─── DYNAMIC SYSTEM PROMPT BUILDER ──────────────────────────────
|
||||||
// Injects real-time database context so the AI knows what exists
|
// Injects real-time database context so the AI knows what exists
|
||||||
|
|
||||||
|
// Cache the built prompt briefly so we don't run 4 DB queries on every single
|
||||||
|
// chat message. CMS changes appear within the TTL. Only healthy builds are
|
||||||
|
// cached, so a transient DB outage retries on the next message.
|
||||||
|
let _promptCache: { value: string; at: number } | null = null;
|
||||||
|
const SYSTEM_PROMPT_TTL_MS = 60_000;
|
||||||
|
|
||||||
async function buildSystemPrompt(): Promise<string> {
|
async function buildSystemPrompt(): Promise<string> {
|
||||||
// Query real data from Prisma
|
if (_promptCache && Date.now() - _promptCache.at < SYSTEM_PROMPT_TTL_MS) {
|
||||||
const [activeApps, installationCount, eventCount, partsCount] = await Promise.all([
|
return _promptCache.value;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Live DB context. If Postgres is unreachable, fall back to safe defaults so
|
||||||
|
// the assistant still answers (degraded) instead of 500-ing the whole chat.
|
||||||
|
let activeApps: Array<{ slug: string; title: string; shortDescription: string; category: string }> = [];
|
||||||
|
let installationCount = 0, eventCount = 0, partsCount = 0;
|
||||||
|
let dbOk = true;
|
||||||
|
try {
|
||||||
|
[activeApps, installationCount, eventCount, partsCount] = await Promise.all([
|
||||||
prisma.application.findMany({
|
prisma.application.findMany({
|
||||||
where: { isActive: true },
|
where: { isActive: true },
|
||||||
select: { slug: true, title: true, shortDescription: true, category: true },
|
select: { slug: true, title: true, shortDescription: true, category: true },
|
||||||
@@ -51,10 +66,16 @@ async function buildSystemPrompt(): Promise<string> {
|
|||||||
prisma.globalNode.count({ where: { nodeType: 'event', isActive: true } }),
|
prisma.globalNode.count({ where: { nodeType: 'event', isActive: true } }),
|
||||||
prisma.sparePart.count({ where: { isActive: true } }),
|
prisma.sparePart.count({ where: { isActive: true } }),
|
||||||
]);
|
]);
|
||||||
|
} catch (e) {
|
||||||
|
dbOk = false;
|
||||||
|
log.warn('chat.system_prompt_db_unavailable', { err: String(e) });
|
||||||
|
}
|
||||||
|
|
||||||
const appList = activeApps.map((a: any) => ` - ${a.title} (slug: "${a.slug}", category: ${a.category})`).join('\n');
|
const appList = activeApps.length
|
||||||
|
? activeApps.map((a) => ` - ${a.title} (slug: "${a.slug}", category: ${a.category})`).join('\n')
|
||||||
|
: ' (live catalog temporarily unavailable — describe FLUX applications from general RF knowledge)';
|
||||||
|
|
||||||
return `You are "FluxAI", the intelligent engineering advisor and sales specialist for FLUX Srl — a world leader in solid-state Radio Frequency (RF), Microwave, and Infrared industrial equipment. Founded by Patrizio Grando with 40+ years of legacy. Headquarters: Romano d'Ezzelino, Vicenza, Italy.
|
const prompt = `You are "FluxAI", the intelligent engineering advisor and sales specialist for FLUX Srl — a world leader in solid-state Radio Frequency (RF), Microwave, and Infrared industrial equipment. Founded by Patrizio Grando with 40+ years of legacy. Headquarters: Romano d'Ezzelino, Vicenza, Italy.
|
||||||
|
|
||||||
PERSONALITY:
|
PERSONALITY:
|
||||||
- Senior RF engineer who also understands business ROI.
|
- Senior RF engineer who also understands business ROI.
|
||||||
@@ -143,6 +164,10 @@ PROACTIVE NEXT STEPS (always suggest the next logical action):
|
|||||||
comparison → "Let me quantify the difference for your specific operation..." → energy_savings_calculator
|
comparison → "Let me quantify the difference for your specific operation..." → energy_savings_calculator
|
||||||
|
|
||||||
LANGUAGE: Respond in the exact same language the user writes in.`;
|
LANGUAGE: Respond in the exact same language the user writes in.`;
|
||||||
|
|
||||||
|
// Only cache a healthy build so a transient DB outage retries next message.
|
||||||
|
if (dbOk) _promptCache = { value: prompt, at: Date.now() };
|
||||||
|
return prompt;
|
||||||
}
|
}
|
||||||
|
|
||||||
// ─── HELPER: Parse JSON safely ──────────────────────────────────
|
// ─── HELPER: Parse JSON safely ──────────────────────────────────
|
||||||
@@ -198,6 +223,17 @@ export async function POST(req: Request) {
|
|||||||
);
|
);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// ─── Fail fast if the AI provider isn't configured ─────────────
|
||||||
|
// Without this, a missing/invalid key surfaces mid-stream after headers
|
||||||
|
// are already sent, producing a confusing broken response.
|
||||||
|
if (!process.env.OPENAI_API_KEY) {
|
||||||
|
log.error("chat.openai_key_missing", new Error("OPENAI_API_KEY is not set"));
|
||||||
|
return new Response(
|
||||||
|
JSON.stringify({ error: "The AI assistant is temporarily unavailable. Please try again later." }),
|
||||||
|
{ status: 503, headers: { "Content-Type": "application/json" } },
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
const {
|
const {
|
||||||
messages,
|
messages,
|
||||||
context,
|
context,
|
||||||
@@ -287,6 +323,20 @@ export async function POST(req: Request) {
|
|||||||
system: systemPrompt + contextNote,
|
system: systemPrompt + contextNote,
|
||||||
messages: coreMessages,
|
messages: coreMessages,
|
||||||
providerOptions: { openai: { promptCacheKey: 'fluxai-v1' } },
|
providerOptions: { openai: { promptCacheKey: 'fluxai-v1' } },
|
||||||
|
// Surface streaming/provider errors (OpenAI 429/500, bad key) in the logs
|
||||||
|
// and, when possible, persist them to the conversation timeline.
|
||||||
|
onError: ({ error }) => {
|
||||||
|
log.error("chat.stream_error", error, { conversationId: conversationId ?? undefined });
|
||||||
|
if (conversationId) {
|
||||||
|
prisma.aiEvent.create({
|
||||||
|
data: {
|
||||||
|
conversationId,
|
||||||
|
type: "error",
|
||||||
|
payloadJson: JSON.stringify({ message: error instanceof Error ? error.message : String(error) }).slice(0, 2000),
|
||||||
|
},
|
||||||
|
}).catch(() => {});
|
||||||
|
}
|
||||||
|
},
|
||||||
onFinish: async ({ usage, toolCalls, toolResults }) => {
|
onFinish: async ({ usage, toolCalls, toolResults }) => {
|
||||||
if (!conversationId) return;
|
if (!conversationId) return;
|
||||||
try {
|
try {
|
||||||
|
|||||||
@@ -145,6 +145,9 @@ export async function POST(request: NextRequest) {
|
|||||||
replyTo: contact.email,
|
replyTo: contact.email,
|
||||||
});
|
});
|
||||||
|
|
||||||
|
// Best-effort email tracking — the lead is already saved; never fail the
|
||||||
|
// request (and risk a client retry / duplicate) over a telemetry update.
|
||||||
|
try {
|
||||||
await prisma.operationsSignal.update({
|
await prisma.operationsSignal.update({
|
||||||
where: { id: signal.id },
|
where: { id: signal.id },
|
||||||
data: {
|
data: {
|
||||||
@@ -153,6 +156,9 @@ export async function POST(request: NextRequest) {
|
|||||||
emailError: emailResult.error,
|
emailError: emailResult.error,
|
||||||
},
|
},
|
||||||
});
|
});
|
||||||
|
} catch (trackErr) {
|
||||||
|
log.warn("consultation.email_tracking_failed", { ticketId, err: String(trackErr) });
|
||||||
|
}
|
||||||
|
|
||||||
log.info("consultation.submitted", { ticketId, emailSent: emailResult.success });
|
log.info("consultation.submitted", { ticketId, emailSent: emailResult.success });
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user