Track quality regressions across prompt + agent config changes.
Run a fixed set of test cases through your multi-agent pipeline.