Skip to content

Nightly tests + automated Claude review

This document covers the automated nightly suite that runs at 04:00 daily on the developer workstation, plus the optional Claude-driven review pass that fires only when something fails.

What gets tested

The runner is test/nightly/Run-NightlyAndReview.ps1. It wraps the existing Run-NightlyLocal.ps1 and adds a post-test review step. Phases:

Phase What it does
1 PowerShell unit tests (Pester)
1b Verify deleted-function references don't sneak back in
2 Backend Vitest tests (app/api/)
3 Frontend Vitest tests (app/ui/)
4a-c Provision a fresh Docker stack, wait for migrations, verify schema
4d-e Queue a demo crawler job, verify the data lands
4f Smoke-test all read endpoints
4g Entra ID crawler scenarios (Validate-Only, Identity-Only, Users-Groups, Full-Sync, With-Identity-Filter). Skipped when test/test.secrets.json is missing.
4h LLM / secrets / risk-profile substrate smoke test
5 Playwright E2E browser tests
6 API documentation completeness check
Review (Only on failure) Investigate, fix, re-run

Deep assertions added April 2026

The Full-Sync scenario does more than count rows now. After every Entra crawler completes successfully it runs:

  • Assert-MatrixWorks — verifies /api/permissions?userLimit=25 returns rows with the right shape, /api/access-package-groups is reachable, and /api/groups-with-nested returns the expected envelope. This catches the "matrix loads but is empty" class of bug.
  • Assert-BusinessRolesWork — verifies the Business Roles list returns rows with non-zero totalAssignments. This was the April 2026 regression where the route returned rows but with all-zero counts because the SQL filter used lowercase 'delivered' while the column stores 'Delivered'.
  • Assert-SyncLogShape — verifies the sync log has entries, every entry has a numeric DurationSeconds, and (only after a real Entra Full-Sync) there's an EntraID-FullCrawl row written by the crawler script at end-of-run.
  • Assert-PostSyncEndpoints — pings all the routes that were broken or T-SQL-leftover after the postgres rewrite (governance/summary, governance/categories, governance/review-compliance, admin/llm/status, admin/llm/config, admin/history-retention, risk-profiles, risk-classifiers, risk-scoring/runs).

The substrate phase (4h) runs Test-LLMSubstrate.ps1 and validates the LLM config endpoint, the secrets vault round-trip, and that the scoring run endpoint returns 412 (preconditions failed) rather than 500 when no classifier is active.

Scheduling

# Register the wrapper at 04:00 daily (default)
pwsh -File test\nightly\Register-ReviewSchedule.ps1

# Pick a different time
pwsh -File test\nightly\Register-ReviewSchedule.ps1 -Time '03:30'

# Also remove the old standalone test task — recommended, since the wrapper
# already runs the nightly tests.
pwsh -File test\nightly\Register-ReviewSchedule.ps1 -RemoveOldNightlyTask

# Remove the schedule
pwsh -File test\nightly\Register-ReviewSchedule.ps1 -Unregister

The task runs as the current user with S4U logon — no password prompt, runs whether or not the user is signed in. It does not wake the workstation, because Docker on Windows doesn't always cope with cold-start under power management. Make sure the box stays awake (or wake it via BIOS scheduling if you need to).

Logs land in test/nightly/results/<yyyy-MM-dd_HHmm>/. A one-line summary per run is appended to test/nightly/results/_rolling-summary.log so you can tail it to see the last week of pass/fail status.

The review pass

When the test suite has zero failures, the review pass is a no-op — it writes a single line to the rolling log and exits. No LLM tokens are spent. This is the design: pay only when there's something to fix.

When there are failures, the wrapper builds a structured prompt with:

  • The list of failed test names and their detail strings
  • The current branch, HEAD commit, and last commit's git log -1 --name-status
  • Paths to all log files in the run folder
  • The constraint block (what Claude is and isn't allowed to do)
  • The token budget

Then it picks one of three execution paths in priority order:

Path A — Claude Code in headless fix-it mode (preferred)

If the claude CLI is on PATH (or at $ClaudeCli), the wrapper invokes:

claude -p "<prompt>" --dangerously-skip-permissions --add-dir <repo>

Claude has read/edit/run permission on the repo, can rebuild containers, re-run individual tests, and commit fixes on a fresh nightly-review/<date> branch. It cannot push. The morning operator reviews and decides.

After Claude finishes, the wrapper re-runs the nightly suite once and uses that exit code as its own. So a successful auto-fix run looks like:

04:00  Run-NightlyAndReview.ps1 starts
04:01  Phase 1-4 run, 1 failure detected in Phase 4f
04:30  Claude invoked, identifies the issue, edits a file, rebuilds web
04:35  Claude commits to nightly-review/2026-04-09 and exits
04:35  Wrapper re-runs the nightly suite
05:05  Re-run completes with 0 failures
05:05  Wrapper writes "FIXED" to the rolling log and exits 0

Path B — Anthropic API analysis only

If claude isn't installed but ANTHROPIC_API_KEY is set (or test/test.secrets.json has an AnthropicApiKey field), the wrapper makes one API call and writes Claude's analysis to review-analysis.md in the run folder. No fix attempt, no re-run. Token usage is bounded by -MaxTokensPerReview (default 4096 → roughly $0.05 per call).

Path C — No LLM available

If neither path is configured, the wrapper writes the full prompt to claude-prompt.txt in the run folder so you can paste it into Claude Code manually in the morning.

Cost shape

Outcome LLM tokens Cost (rough)
All tests pass 0 $0
Failure, Path A ~5k-30k $0.10-$2.00
Failure, Path B ~2k-4k $0.02-$0.08

If the suite has been green for a week, the review system has cost you nothing. Cost only happens when there's actually something to investigate.

Safety constraints

The prompt template at test/nightly/claude-review-prompt.md explicitly forbids:

  • git push (anywhere, ever)
  • git reset --hard, git clean -f, git branch -D
  • docker compose down -v
  • Dropping database tables, deleting database rows
  • --no-verify, --no-gpg-sign, or any flag that bypasses commit hooks
  • Modifying CI/CD pipeline files

Claude is told to commit fixes on a fresh nightly-review/<date> branch and stop. The morning operator decides whether to merge.

Testing the wrapper without scheduling

Run it on demand:

# Full thing — runs the nightly suite, reviews failures, re-runs
pwsh -File test\nightly\Run-NightlyAndReview.ps1

# Skip the fix-it Claude invocation. Useful before you trust the system.
# Will still run the nightly tests + (if there's an API key) produce a
# read-only analysis markdown.
pwsh -File test\nightly\Run-NightlyAndReview.ps1 -NoFix

# Check that the assertions wired up correctly without a full nightly run
pwsh -File test\nightly\Test-LLMSubstrate.ps1
pwsh -File test\nightly\dry-run-assertions.ps1

The dry-run script loads only the new Assert-* helpers from Test-EntraIdCrawler.ps1 and runs them against whatever data is currently in the local stack. It needs the demo dataset (or a real crawler) loaded first; queue a demo job from the UI or via:

curl -X POST -H "Content-Type: application/json" \
  -d '{"jobType":"demo"}' http://localhost:3001/api/admin/crawler-jobs

Where to look in the morning

test/nightly/results/_rolling-summary.log    ← one line per nightly run
test/nightly/results/<date>/
  ├── results.json                            ← machine-readable test results
  ├── nightly-output.log                      ← full nightly stdout
  ├── review.log                              ← wrapper's own log
  ├── review-analysis.md                      ← Claude's analysis (Path A or B)
  └── claude-prompt.txt                       ← the prompt that was sent (Path C)

If the rolling log says FIXED, look at the nightly-review/<date> git branch to see what Claude changed.

If it says FAIL, open review-analysis.md to see what Claude found before giving up.

If it says PASS, you have nothing to do.

Limitations

  • The wrapper cannot wake the workstation. If the box was suspended at 04:00 the task runs whenever it next starts. Plan accordingly.
  • The Anthropic API key must live somewhere the wrapper can read it without the Identity Atlas stack. The vault inside Identity Atlas is intentionally NOT used as the primary source — at 4 AM the most likely reason for needing the review is that Identity Atlas itself is broken. Use ANTHROPIC_API_KEY env var or test/test.secrets.json.
  • Path A (fix-it mode) requires the Claude Code CLI on PATH. If you haven't installed it, the wrapper falls back to Path B automatically.
  • Re-runs are full nightly runs. They take ~30 minutes. The wrapper does not yet support "re-run only the failed tests" — if you need that, run the individual scenario script directly (e.g. Test-EntraIdCrawler.ps1 -Scenarios @('Identity-Only')).