Notification sends were happening inline in the request path. Under load they added 200–800 ms to mutation latency and silently dropped emails when the SMTP pool was exhausted. This PR moves delivery onto the existing pg-boss queue so the API returns immediately and failed sends retry with backoff.
Why
When we added @mentions in comments last quarter, every mention started triggering up to three sends (in-app, email, Slack) inside the same transaction that saved the comment. That was fine at launch. It is not fine now that a single task update can fan out to forty watchers.
- Sends run inline in the mutation handler
- SMTP timeout = 500 error for the comment
- No retries — a dropped email is gone
- p99 on
comments.create: 1.4 s
- Handler enqueues one job per recipient, returns
- Worker retries 3× with exponential backoff
- Dead-letter table for inspection after exhaustion
- p99 on
comments.create: 180 ms (staging)
File-by-file
Ordered for reading, not alphabetically. Start at the worker — it's the new thing — then the enqueue call site, then the plumbing.
packages/notify/src/worker.ts new +126
The heart of the PR. A pg-boss subscriber that pulls notify.deliver jobs, resolves the user's channel preferences, and calls the right adapter. Retries are configured per-channel — email gets three attempts, Slack gets one because its API is already idempotent on our side.
boss.work('notify.deliver', { batchSize: 20 }, async (jobs) => { for (const job of jobs) { const { userId, event, channel } = job.data; const prefs = await getPrefs(userId); if (!prefs[channel]) return; // user muted this channel try { await adapters[channel].send(userId, event); } catch (err) { if (job.retryCount >= MAX_RETRY[channel]) { await deadLetter(job, err); // don't throw — ack & park return; } throw err; // pg-boss reschedules } } });
packages/api/src/routers/comments.ts mod +14 −62
Where the win shows up. The mutation used to call sendEmail, sendSlack, and createInApp directly. Now it inserts the comment, computes recipients, and enqueues. The try/catch soup is gone.
const comment = await db.comments.insert(input); const recipients = await resolveWatchers(input.taskId, input.mentions); for (const r of recipients) { await sendEmail(r, comment); // blocked the response await sendSlack(r, comment); } await boss.insert(recipients.flatMap((r) => CHANNELS.map((ch) => ({ name: 'notify.deliver', data: { userId: r.id, channel: ch, event: toEvent(comment) }, singletonKey: `${comment.id}:${r.id}:${ch}`, // idempotent })))); return comment;
packages/db/migrations/0051_dead_letter.sql new +22
Table for jobs that exhaust their retries. Deliberately not auto-pruned — we want to look at these weekly until we trust the new path. Has the full job payload and the last error string.
packages/notify/src/adapters/{email,slack,inapp}.ts mod +88 −74
Mostly moves. Each adapter now implements a shared Adapter interface and throws a typed RetryableError or PermanentError so the worker knows whether to retry. The email adapter also drops its internal retry loop — the queue owns retries now, double-retrying was how we got duplicate emails in April.
apps/worker/src/index.ts, infra/fly.toml mod +31 −4
Registers the new subscriber in the existing worker process and bumps its concurrency from 5 → 20. No new deploy unit.
packages/notify/src/__tests__/worker.test.ts new +137
Covers the retry boundary, the dead-letter path, channel muting, and the singleton key dedupe. Uses a real pg-boss against the test database — we got burned last quarter when mocked queue tests passed but prod ordering broke.
Where to focus your review
worker.ts:31–44. I catch, check retryCount, and either park or rethrow. If this logic is wrong we either retry forever or drop messages — the two failure modes this PR exists to fix.comments.ts:28. ${commentId}:${userId}:${channel} should make re-enqueues idempotent if the API handler retries. Sanity-check that this can't collide across tasks.Test plan
Rollout
Behind notify_queue_v2. The old inline path stays in the codebase, dead but dormant, for one release in case we need to flip back.