Troubleshooting & Recovery
Quick issues on top, full operational recovery playbooks below — chain breakage, idempotency, re-anchoring missed events, secret rotation rollback, score reconciliation.
Quick issues
high401 Unauthorized on writes
- Missing or expired JWT
- Wrong token format (must be Bearer)
- Run POST /api/auth/login → returns fresh token
- Check Authorization header: `Bearer <token>` (note space)
- Tokens expire after 7 days by default — refresh on 401
medium409 Conflict / idempotent:true on charge replay
- Replaying request with same paymentRef
- Stripe webhook redelivery
- Network retry after success
- Treat `idempotent:true` (QMaskCard / QPayNet merchant.charge) as a SUCCESSFUL outcome — body contains the original transaction id
- For Stripe replays, use event.id as paymentRef
- QMaskCard enforces this via a partial unique index on (maskId, paymentRef) — see Playbook B
highVeilNetX chain verify fails (verified:false)
- JSONB key-order drift across pg versions
- Mid-write Postgres crash
- Restored from out-of-band backup
- Don't auto-correct — chain integrity is single source of truth
- Run Playbook A (chain breakage recovery) below — diagnose → dry-run rebuild → rebuild → verify
mediumWebhook not delivered
- Endpoint returned non-2xx
- Endpoint timeout >10s
- DNS resolution failed
- Check delivery log: GET /api/qpaynet/admin/webhook-deliveries?webhookId=<id>
- Verify your endpoint returns 200 within 10s — slow handlers must ack first, work async
- AEVION retries 5x with exponential backoff (immediate, 5m, 30m, 2h, 8h)
- After 5 failures → dead_letter; manual replay: POST /api/qpaynet/admin/webhook-deliveries/:id/retry
mediumWebhook signature mismatch
- Clock drift > 5 min between you and AEVION
- Computed HMAC over parsed body instead of raw bytes
- Forgot to include timestamp in payload prefix
- Secret rotated, your env still on old
- Confirm `Math.abs(now-ts) < 300` first — if not, sync your server's NTP
- Compute over `${timestamp}.${rawBodyBytes}` — NEVER over JSON.stringify(parsed)
- During rotation, accept BOTH new + old secrets — see Playbook D
low429 Rate limited
- Burst on money endpoints (/transfer, /merchant/charge)
- Public reads burst on /stats
- Read Retry-After header — wait that many seconds
- Implement exponential backoff in your client (Stripe-style: 2s, 4s, 8s, 16s, 32s)
- Contact support@aevion.app for partner-tier higher limits
highQGood donation not anchored to VeilNetX
- VeilNetX service down at emit time
- Cross-product ecosystemEvents disabled
- Network blip ledger ↔ qgood
- Check /api/veilnetx-ledger/health — should return ok
- Look for `[ecosystemEvents] veilnetx emit failed` in server logs
- Donations still persist on QGood side — anchoring is fire-and-forget by design
- Re-anchor missing entries via Playbook C
mediumZ-Tide stats.total_weight != SUM(leaderboard.score)
- Manual SQL write bypassed event log
- Migration script touched ZTideScore without ZTideEvent
- Cache staleness
- Run prod-smoke #25 first — confirms drift
- Run Playbook E (Z-Tide reconciliation) below
mediumQChainGov proposal won't execute
- Quorum not reached
- Tied vote (50/50 in yes-no mode)
- Admin not approved
- Check proposal.status — must be `passed` (not just `closed`)
- Run POST /api/qchaingov/proposals/:id/tally to refresh counts
- Admin must hit POST /api/qchaingov/proposals/:id/execute to fire side-effects
Recovery playbooks
Step-by-step operational runbooks for the 5 most consequential incidents. Run them in order — each playbook's last step is verification.
VeilNetX chain breakage recovery
WhenGET /chain/verify returns verified:false and you've ruled out client tampering. Most common cause: JSONB canonicalization drift after a pg minor upgrade, or a partially-applied DB restore.
- 11. Diagnose — find the exact entry where the chain brokechain-doctor walks the chain, recomputes each link's hash, and points at the first mismatch with full context.
bashnpm run veilnetx:doctor # or directly: node aevion-globus-backend/scripts/veilnetx-chain-doctor.js # Output: # chain length: 6 # walking entries… # ✗ break at index 1: stored hash a5d6571627f4… expected ce8f01abff32… # entry id: f3e2-… kind: deposit ts: 2026-05-12T09:41:00Z # suspect: canonical-JSON key order (meta has {b, a} vs {a, b}) - 22. Dry-run rebuild — confirm what would change without writingRecomputes every hash with the canonical (sorted-keys) serializer. Prints a diff: old hash → new hash per entry. Database is NOT touched.
bashnpm run veilnetx:rebuild-dry # or: node aevion-globus-backend/scripts/rebuild-veilnetx-chain.js --dry-run # Inspect the diff. If only canonicalization is changing, proceed. # If amounts or payloads changed, STOP — that indicates tampering, not drift. - 33. Take a Postgres backup before mutatingThe rebuild script is idempotent, but always snapshot first.
bashpg_dump $DATABASE_URL --table='VeilNetXEntry' --data-only -f veilnetx-backup-$(date +%Y%m%d-%H%M%S).sql - 44. Apply the rebuildRuns the same logic as --dry-run but writes the recomputed hash chain back. New chain is immediately verifiable.
bashnpm run veilnetx:rebuild # Then verify: curl -s $BASE/api/veilnetx-ledger/chain/verify | jq .verified # expected: true - 55. Record the incidentOpen a GitHub issue with the doctor output, rebuild diff, and root-cause hypothesis. Tag `veilnetx-integrity`. Even when fix is clean, the audit trail matters for future drift investigations.
QMaskCard suspected double-charge
WhenCustomer reports two debits with one purchase, or your reconciliation shows two QMaskCardCharge rows for one paymentRef.
- 11. Look up by paymentRef — should be exactly one rowQMaskCard has a partial unique index on (maskId, paymentRef) WHERE status='authorized'. Two authorized rows for the same pair is impossible at the DB layer.
sqlSELECT id, "maskId", "paymentRef", "amountCents", status, "createdAt" FROM "QMaskCardCharge" WHERE "paymentRef" = $1 ORDER BY "createdAt"; - 22. If only one row exists — replay returned idempotent:true (expected)Confirm the second customer-side debit was your client retrying. Replay returns `idempotent:true` and the original charge id, NOT a new charge.
json// expected response shape from POST /api/qmaskcard/charges replay: { id: "<original-uuid>", status: "authorized", idempotent: true, amountCents: 500 } - 33. If two rows exist — partial-index missingShould not happen post-2026-05-12 (fix commit 933c797c). If you see it, the index was never deployed.
sql-- Verify the index exists SELECT indexname, indexdef FROM pg_indexes WHERE tablename = 'QMaskCardCharge' AND indexname LIKE '%paymentRef%'; -- Expected: -- qmaskcard_charge_idempotency_idx -- CREATE UNIQUE INDEX … ON "QMaskCardCharge" ("maskId", "paymentRef") WHERE status = 'authorized' - 44. Recreate the index if missing, then audit historical doubles
sqlCREATE UNIQUE INDEX CONCURRENTLY qmaskcard_charge_idempotency_idx ON "QMaskCardCharge" ("maskId", "paymentRef") WHERE status = 'authorized'; -- Then find historical doubles: SELECT "maskId", "paymentRef", COUNT(*) AS n, SUM("amountCents") AS total FROM "QMaskCardCharge" WHERE status = 'authorized' GROUP BY 1, 2 HAVING COUNT(*) > 1;
Re-anchor missed events to VeilNetX
WhenQGood donation, QPayNet transfer, or Z-Tide promotion succeeded on its own module, but never landed in VeilNetX (visible as gap between module count and ledger count for that kind).
- 11. Find the gap — count divergence
sql-- Example: QGood donations vs VeilNetX-anchored donations SELECT (SELECT COUNT(*) FROM "QGoodDonation") AS qgood_count, (SELECT COUNT(*) FROM "VeilNetXEntry" WHERE kind = 'qgood:donation') AS anchored_count; - 22. List the un-anchored ones
sqlSELECT d.id, d."campaignId", d."amountCents", d."createdAt" FROM "QGoodDonation" d LEFT JOIN "VeilNetXEntry" v ON v."sourceId" = d.id::text AND v.kind = 'qgood:donation' WHERE v.id IS NULL ORDER BY d."createdAt" DESC LIMIT 100; - 33. Re-emit through the canonical event lib (NOT raw insert)Direct INSERT into VeilNetXEntry bypasses canonicalization and breaks the chain. Always call emitVeilNetX() so canonical-JSON + sequential hashing applies.
ts// scripts/reanchor-qgood.ts import { emitVeilNetX } from "../src/lib/ecosystemEvents"; // fetch un-anchored donations, then: for (const d of missing) { await emitVeilNetX({ kind: "qgood:donation", sourceId: d.id, actorEmail: d.donor_email_hash ?? null, amountCents: d.amountCents, currency: d.currency, meta: { campaignId: d.campaignId }, }); } - 44. Verify chain still intact
bashcurl -s $BASE/api/veilnetx-ledger/chain/verify | jq .verified # expect true npm run veilnetx:stats # confirm count matches qgood total
Webhook secret rotation — and rollback if new secret is bad
WhenRotating endpoint secret without dropping in-flight deliveries. Or: you rotated, receiver is now rejecting signatures, you need to revert.
- 11. Pre-stage receiver with two secrets (new + old)Deploy receiver code that accepts both. AEVION's verifier supports this via previousSecrets[] — see /developers/fintech/webhooks#6 secret rotation. Deploy this FIRST before changing secrets on AEVION's side.
- 22. Change the secret on AEVION
httpPATCH /api/qpaynet/me/webhook Authorization: Bearer <jwt> { "secret": "<new-32-byte-hex>" } # AEVION starts signing with NEW immediately; OLD is no longer used outbound. - 33. Watch receiver logs for 24hIf you see `secretIndex > 0` in your verify logs, that means a delivery still used the old secret — either an in-flight retry, or AEVION hasn't fully propagated. Wait another retry window before continuing.
- 44. ROLLBACK — if new secret is bad (e.g. typo, leaked)Switch BACK to the old secret on AEVION (still valid in your receiver thanks to step 1). Then mint a fresh new one and restart the rotation.
http# Revert outbound secret to the prior value: PATCH /api/qpaynet/me/webhook { "secret": "<the-old-value-still-in-AEVION_WEBHOOK_SECRET_OLD>" } # Receiver still verifies because both are in previousSecrets. - 55. Cleanup — drop old secret from receiverAfter 48h with no secretIndex > 0 hits, remove the old secret from receiver env and redeploy. Rotation complete.
Z-Tide score drift reconciliation
WhenSmoke #25 reports `Z-Tide aggregate drift sum=X vs total_weight=Y`. Means ZTideScore rows no longer sum to the recorded event total.
- 11. Confirm drift direction — events vs scores
sqlSELECT (SELECT COALESCE(SUM(weight),0) FROM "ZTideEvent") AS event_sum, (SELECT COALESCE(SUM(score),0) FROM "ZTideScore") AS score_sum; - 22. Find users whose score doesn't match their event sum
sqlSELECT s."userId", s.score AS score_now, COALESCE(SUM(e.weight), 0) AS expected FROM "ZTideScore" s LEFT JOIN "ZTideEvent" e ON e."userId" = s."userId" GROUP BY s."userId", s.score HAVING s.score != COALESCE(SUM(e.weight), 0) ORDER BY ABS(s.score - COALESCE(SUM(e.weight), 0)) DESC; - 33. Rebuild scores from events (single atomic update)
sqlBEGIN; UPDATE "ZTideScore" s SET score = COALESCE(sub.s, 0), "updatedAt" = NOW() FROM ( SELECT "userId", SUM(weight) AS s FROM "ZTideEvent" GROUP BY "userId" ) AS sub WHERE sub."userId" = s."userId"; -- Verify before commit: SELECT SUM(score) FROM "ZTideScore"; -- Should now equal SUM(weight) from ZTideEvent. COMMIT; - 44. Re-run prod-smoke #25 to confirm reconciliation
bashBASE=$YOUR_BASE node aevion-globus-backend/scripts/fintech-prod-smoke.js | grep "Z-Tide:"
fintech-bug. Include: timestamp (UTC), affected endpoint, request body (redact secrets), response body, and which playbook step you reached.