Commercial Journeys — Machine Eval + GEPA

Step 1 — Define Goals & Metrics

Overview

Commercial Journeys is a multi-agent pipeline: M365 Data + Browser → Grouping Agent → Merging Agent → Ranking Agent → M365 Copilot Chat. We evaluate 40 metrics across 2 levels: per-agent output (fault isolation for each sub-agent) and final presentation (card quality + execution delivery + E2E coherence).

Pipeline Architecture

M365 Data
Browser History

→

Grouping Agent
→ Per-source grouped task information

→

Merging Agent
→ Cross-source merged journey candidates

→

Ranking Agent
→ (1) Final journeys with priority ranking
→ (2) Optimized prompt with ExecutionPlan per journey

→

M365 Copilot Chat
→ Final execution output (email drafts, meeting prep, summaries, etc.)

Evaluation Conventions

N/A Handling: Conditional metrics (e.g., Recurrence Handling, Incremental Update, Dismissed Suppression) are marked N/A when preconditions are not met. N/A judgments do not count toward pass/fail. Tracked as n/a_rate per metric — target: N/A rate < 40%.

End-to-End Ownership: We evaluate the entire feature end-to-end — from signal ingestion through Copilot execution output. Even though the final execution is handled by M365 Copilot Chat, the overall experience is what matters. If the feature delivers poor results for any reason (bad prompt design, poor execution quality, or both), it reflects on the feature as a whole. Root-cause tagging (cj_pipeline | m365_copilot | ambiguous) is used for debugging and prioritization, not for excluding failures from the overall verdict.

Part 1: Per-Agent Output Evaluation

Grouping Agent Output

3 metrics ▼

Output: By-source grouped tasks. Each data source independently produces task candidates from raw signals. Sources: Email, Teams, Calendar, Meeting Recaps, Files, Browser History.

Sub-metric	What to measure	How
Intra-source Grouping Coherence	Within a single data source, are all items in one cluster truly about the same task?	AI reviews each cluster's item set → judges whether semantically unified. E.g., two unrelated email threads lumped into one candidate = failure
Intra-source Grouping Completeness	Within a single data source, are all items related to the same task assigned to the same cluster?	AI scans unclustered items → judges whether they should belong to an existing cluster. E.g., a key email about "Q3 Budget" left as orphan while other Q3 Budget emails are clustered = failure
Per-source Coverage	Did the Grouping Agent produce candidates from data sources that clearly contain task signals?	Compare active data sources vs. data sources that produced zero candidates → flag sources with obvious signals that were entirely missed

Pass: Coherence error < 5% · Completeness error < 5%

Merging Agent Output

17 metrics · 3 passes ▼

Output: Cross-source merged journey candidates. Fuses task candidates from different data sources into complete, well-defined journeys. The most complex step — 16 metrics split into 3 focused evaluation passes to reduce LLM-as-Judge cognitive load.

Pass A — Fusion Quality

Sub-metric	What to measure	How	Boundary definition
Cross-source Fusion Accuracy (Over-merging)	Were candidates from different sources that describe distinct tasks incorrectly merged?	AI reviews merged journey vs. contributing candidates → flags cases where semantically different tasks were collapsed	Two tasks are "distinct" when they have different owners, different deadlines, or produce different deliverables — even if they share a topic umbrella
Cross-source Fusion Accuracy (Under-merging)	Were candidates from different sources that describe the same task left as separate journeys?	AI clusters all output journeys by semantic similarity → flags near-duplicates from different sources	Two journeys are "same task" when the user would describe them as one item on their to-do list
Source Attribution Precision	Are the linked source items correct — no wrong links?	Check each attributed source → precision = correct sources / total attributed sources	A source is "correct" when its content is directly relevant to the journey's task definition. Tangentially related items = incorrect
Source Attribution Recall	Are the most important source signals included?	Check whether high-signal items are attributed → recall = included key sources / total key sources	Key source = (a) primary task trigger, (b) any source with deadline info, (c) any source from task owner's direct manager or above
Source Role Assignment	Are data sources assigned correct roles? (Email=trigger, Calendar=deadline, Browser=execution evidence, Files=context)	Extract source roles → AI validates role logic against specialized prompt definitions	N/A when journey has only 1 contributing source
Data Source Coverage	Are journey candidates drawing from multiple M365 sources, or over-reliant on a single source?	Tag contributing sources per journey → measure source distribution across all qualified journeys	Flag if all journeys originate from a single source type when multiple sources contain task signals

Pass A: Over-merge < 3% · Under-merge < 5% · Precision ≥ 95% · Recall ≥ 90%

Pass B — Task Definition

Sub-metric	What to measure	How	Boundary definition
Task Accuracy	Is this a real task the user actually needs to complete?	AI evaluates: given the contributing source signals, is the generated task something the user would recognize as "yes, I need to do this"?	"Real task" = user was explicitly assigned, committed, or is the clear owner. "Not real" = AI inferred a task from peripheral signals (e.g., CC'd on FYI email → AI assumes user must act)
Task Granularity	Is the task scoped to a specific, actionable item?	AI evaluates scope. Score 1-5.	Score 5 = single clear deliverable. Score 3 = reasonable but could be more specific. Score 1 = "do your job" broad or trivially narrow
AI Actionability	Does this task genuinely benefit from AI help?	AI classifies: (a) Too simple — user can do faster without AI. (b) AI-appropriate. (c) Beyond AI capability. Flag (a) and (c)	"Too simple" = single click, one-line reply. "Beyond AI" = requires physical presence, confidential judgment, or inaccessible systems
Gate Condition Compliance	Does each journey pass the 7 gating conditions?	Programmatically check: IS_CONCRETE_TASK, IS_WORK_RELATED, HAS_USER_RELEVANCE, IS_NOT_COMPLETED, HAS_MEANINGFUL_COMPLEXITY, REQUIRES_BROWSER, IS_AI_EXECUTABLE	IS_NOT_COMPLETED: if user sent "done"/"completed"/"attached" AND no subsequent follow-up → completed. Ambiguous = not completed (benefit of doubt)
Hallucination Detection	Does the journey have zero traceable source signals — entirely fabricated?	Cross-reference journey description against all contributing candidates and raw signals → flag journeys with no traceable source	"Hallucination" = core task description has NO supporting evidence in ANY input source. "Reasonable inference" from strong multi-source signals ≠ hallucination

Pass B: Task Accuracy ≥ 90% · Hallucination rate = 0%

Pass C — Filtering & Lifecycle

Sub-metric	What to measure	How	Boundary definition
Lifecycle Accuracy	Is the lifecycle state (NEW_WORK / ONGOING_WORK / RECURRING_WORK) correctly assigned?	For ONGOING: verify match to existing journey + continued activity. For NEW: verify no existing journey covers this task	N/A for first-run evaluations where no journey history exists
Incremental Update Fidelity	When updating an existing journey, is the original context preserved while new info is correctly incorporated?	Compare updated journey against previous version + new signals → flag lost context or incorrect overwrites	N/A when journey is NEW_WORK with no prior version
Dismissed Topic Suppression	Are manually dismissed journeys correctly excluded from regeneration?	Cross-reference output against dismissed titles list → flag resurfaced dismissed topics	N/A when no dismiss history exists
Privacy & Sensitivity Filtering	Are sensitive topics correctly blocked? Any PII leakage?	PII scan (regex + NER) across all journey titles, summaries, and source references + personal-vs-work classification	Any PII (email addresses in title, SSN, credit card) = ship-blocker. Personal-but-not-PII (e.g., "dentist appointment") = High
Staleness Rejection	Are journeys based on signals older than the 3-day freshness window filtered out?	Check source signal timestamps → flag journeys based entirely on expired signals	"Entirely expired" = ALL contributing sources >3 days old AND no new signals merged in
Noise Rejection	Are low-value / trivial signals correctly suppressed?	Classify signal value → flag low-value signals that produced journeys	"Noise" = auto-generated system emails, all-day calendar placeholders, newsletters. "Not noise" = routine meetings if user is organizer or has action items

Pass C: Lifecycle ≥ 90% · Privacy failure = 0% (ship-blocker) · Noise pass-through < 5%

Ranking Agent Output

5 metrics · 2 passes ▼

The Rank Agent produces two distinct outputs: (1) final journeys with ranking, and (2) an optimized prompt with ExecutionPlan for each journey, to be sent to M365 Copilot Chat. These are evaluated as two separate passes.

Pass A — Ranking Quality

Sub-metric	What to measure	How
Priority Signal Alignment	Do top-ranked journeys genuinely have the highest urgency/importance?	Extract rule-based + model scores → verify top-N score highest on timeliness, stakeholder importance, ownership strength
Missing or Under-Ranked Detection	Are there important tasks that should be in the Top-3 but are either missing entirely or buried too low?	AI Judge reviews all input signals holistically → identifies high-urgency/high-importance tasks → checks whether they appear in the system's Top-3. Flags tasks that are absent or ranked significantly lower than their urgency warrants
Ranking Stability	Do rankings change erratically between sessions without new signals?	Compare logs across sessions → flag unexplained rank changes

Ranking: Top-3 inversion < 10% · Missing/under-ranked rate < 10%

Pass B — ExecutionPlan Quality

Sub-metric	What to measure	How
ExecutionPlan Quality	Is the generated ExecutionPlan well-structured, specific, and executable by Copilot?	AI evaluates: task clarity, context inclusion, output format, scope. Score 1-5
ExecutionPlan-Journey Alignment	Does the ExecutionPlan match the journey's task definition?	Compare plan objective against journey title/summary → flag mismatches

ExecutionPlan: avg score ≥ 3.5 · Alignment mismatch = 0%

Part 2: Final Presentation Evaluation

Card Presentation Quality

5 metrics ▼

What the user sees on the NTP before clicking. Does the card earn the click — and set honest expectations?

Sub-metric	What to measure	How
Brief Clarity	Is the task brief (title + summary) specific, actionable, and immediately understood?	AI evaluates: specificity, actionability, grammar, length. Binary: "Can you predict what happens if you click?"
Label Accuracy	Does the label (type + name) correctly reflect the actual urgency trigger?	Cross-reference label against source signals. E.g., "Asked by VP Chen" — verify VP Chen actually made the request
Icon Relevance	Does the action_type icon match the actual task?	Map action_type to journey content → flag mismatches
Promise Accuracy	Does the card accurately represent what the Copilot execution will deliver?	Compare card title + summary vs. actual execution output → score promise-delivery alignment 1-5. Score ≤ 2 = critical failure
Task Category Distribution	Are the displayed journey cards spread across different action types, or all clustered in one category?	Classify displayed cards by action_type → flag if any single type exceeds 60%
Trigger Diversity	Are the displayed journey cards varied in urgency trigger types (deadline-driven, people-driven, recurrence-driven)?	Extract label/trigger types across displayed cards → check distribution is not dominated by a single trigger type

Pass: Clarity ≥ 90% · Promise Accuracy score ≤ 2 = 0%

Execution & Delivery

5 metrics ▼

After the user clicks, M365 Copilot Chat executes the prompt generated by Rank Agent. Did the response deliver value? Root-cause tagging (cj_pipeline | m365_copilot | ambiguous) is applied for debugging — all failures count toward the overall feature verdict.

Sub-metric	What to measure	How	Root-cause notes
Execution Completion Rate	What % of executions completed successfully vs. failed/timed out?	Parse execution traces → compute success rate	Timeout/crash = likely m365_copilot. Malformed prompt = cj_pipeline
Output Correctness	Are factual claims grounded in the user's actual data?	For each claim, verify against source signals. Claim precision = supported / total. Target ≥ 95%	Wrong facts from bad context in prompt = cj_pipeline. Correct context but Copilot hallucinated = m365_copilot
Output Completeness	Does the output cover all key threads from the journey's source signals?	Compare entities/topics in output vs. key entities in source signals	CJ prompt omitted key context = cj_pipeline. Context provided but Copilot ignored = m365_copilot
Actionability	Are there concrete next steps the user can immediately act on?	Count actionable elements: draft ready to send? Link to right doc? Binary: ≥ 1 concrete action?	Prompt didn't request actionable format = cj_pipeline. Requested but not delivered = m365_copilot
Scenario Fit	Is the output format appropriate for the action_type?	Map action_type → expected format (REPLY_EMAIL → email draft, PREPARE_MEETING → agenda). Flag mismatches	CJ prompt specified wrong format = cj_pipeline. Correct request, wrong delivery = m365_copilot

Pass: Completion ≥ 95% · Correctness ≥ 95% (cj_pipeline only) · Effort Reduction avg ≥ 3.5

End-to-End Coherence & Task Resolution

5 metrics ▼

From raw signals to final execution output, does the entire chain hold together?

Sub-metric	What to measure	How
Journey Coverage (Missing Journeys)	Are there important tasks from the input signals that the pipeline failed to surface as journeys?	AI reviews all input signals holistically → identifies task-worthy signal clusters that produced no journey. Cross-references with Grouping and Merge outputs to locate where the signal was lost
Full-chain Topic Alignment	Does the primary topic stay consistent across all pipeline stages?	Extract primary topic at each stage → flag any stage that introduces a different topic or contradicts another
Task Resolution Score	Could the user meaningfully advance their task from this output alone?	Expert/AI rating 1-5. Score 1 = "useless, would start over." Score 5 = "could complete the task right here." The single offline metric most predictive of real-world satisfaction
Recurrence Handling	For RECURRING_WORK journeys, is the system recommending the correct upcoming instance?	Check recurrence metadata → validate instance timing. N/A for NEW_WORK and ONGOING_WORK

Pass: Topic alignment break = 0% · Task Resolution avg ≥ 3.5

Failure Mode Taxonomy

Critical failures are flagged independently from dimension scores. Any critical failure triggers an overall Fail verdict regardless of other scores.

Severity Definitions

Severity	Definition	Action
Ship-Blocker	Trust-destroying or compliance-violating. Zero tolerance — even one instance blocks release.	Blocks ship. Zero-tolerance threshold.
Critical	Fundamental value proposition broken. User sees wrong, fabricated, or misleading content.	Must fix before GA. Threshold: < 2% of evaluated journeys.
High	Meaningful quality gap. User gets some value but with significant friction.	Must be triaged. Acceptable only with documented mitigation plan.
Medium	Noticeable imperfection. User still gets value but experience is suboptimal.	Should fix. Can ship with known issues tracked.

Failure Mode Catalog

Failure Mode	Agent / Stage	Description	Severity
Privacy Violation	Merging	Sensitive / personal content surfaces in journey	Ship-Blocker
PII Leakage	Merging	PII exposed in journey title, summary, or sources	Ship-Blocker
Resurfaced Dismissed Topic	Merging	A manually dismissed journey topic reappears	Ship-Blocker
Hallucinated Journey	Merging	Journey has no traceable source signal — entirely fabricated	Critical
Wrong Source Attribution	Merging	Journey links to incorrect M365 items	Critical
Phantom Task	Merging	Signals are real but the inferred task is not	Critical
Completed Task Surfaced	Merging	Journey recommends a task the user already finished	Critical
Promise-Delivery Gap	Card Presentation	Card promises something the execution output doesn't deliver	Critical
Over-Promise	Card Presentation	Card implies capability the Copilot execution can't support	Critical
Execution Hallucination	Execution	Factual claims not supported by any source signal	Critical
Topic Drift	E2E	Primary topic changes between pipeline stages	High
Missed Source Entirely	Grouping	A data source with obvious task signals produced zero candidates	High
Mis-grouped Items	Grouping	Unrelated items from the same source lumped into one candidate	High
Orphaned Key Item	Grouping	A high-importance item not assigned to any candidate	High
Over-merged Tasks	Merging	Distinct tasks from different sources collapsed into one journey	High
Wrong Lifecycle	Merging	NEW_WORK tagged as ONGOING_WORK or vice versa	High
Scope Mismatch	Merging	Task is too broad or too narrow	High
AI Misfit	Merging	Task is too simple or beyond AI capability	High
Noise Leak	Merging	Low-value / trivial signals produced a journey	High
Stale Journey	Merging	Journey based on signals older than 3-day window	High
Missing Important Journey	E2E	A clearly important task from input signals is not surfaced as any journey	High
Priority Inversion	Ranking	A clearly more urgent journey ranked below a less important one	High
Bad ExecutionPlan	Ranking	Plan is vague, references wrong sources, or describes a different task	High
Missing Key Context	Execution	Execution output omits a major thread the user clearly cares about	High
Wrong Help Type	Execution	Correct topic but wrong output format	High
No Actionability	Execution	Correct information but no concrete next step	High
Wrong Recurrence Instance	E2E	Recurring task surfaces a past instance instead of upcoming one	High
Under-merged Duplicates	Merging	Same task from different sources left as separate journeys	Medium
Monoculture Recommendations	Card Presentation	All displayed journeys are the same action_type or trigger type	Medium
Vague Card	Card Presentation	Title/summary too generic — user can't predict what they'll get	Medium
Redundant Output	Execution	Execution output just restates source data — no synthesis	Medium

Verdict Rules

Condition	Overall Verdict
Any Ship-Blocker failure	Fail — blocks ship regardless of scores
Any Critical failure rate ≥ 2%	Fail — must fix before GA
Critical < 2% AND ≥ 2 High failures	Conditional — requires mitigation plan
Critical < 2% AND 0-1 High AND all pass criteria met	Pass — ship-ready

Step 2 — Build Eval Dataset

Coming Soon

Dataset construction details

Step 3 — Generate Outputs

Coming Soon

Output generation and trace capture details

Step 4 — Human Labeling & Calibration

Coming Soon

Annotation workflow and golden set details

Step 5 — Build Automated Judge

Coming Soon

LLM-as-Judge calibration and rule-based checks

Step 6 — Eval + GEPA Optimization

Overview

What is GEPA?

GEPA (Genetic-Pareto) is a reflective prompt evolution method. Instead of brute-force RL search, it lets a strong model review failures → analyze root causes → surgically rewrite prompts, managed through a genetic evolution framework with Pareto selection to avoid single-dimension overfitting. Children are added to the candidate pool unconditionally — natural selection happens through Pareto frontier recomputation, not explicit accept/reject gates.

Reference: Reflective Prompt Evolution Can Outperform Reinforcement Learning (ICLR 2026)

Two Datasets

Training Set (D_train)

The dual-purpose workhorse. Serves both roles during GEPA optimization:

Scoring: All/most cases scored per round to compute Pareto frontiers
Feedback sampling: Random batch sampled each round for failure trace collection & reflection

Hold-out Test Set

The sealed final exam. Completely isolated from optimization. Only used twice: baseline (6a) and final validation (6e).

Never seen during optimization — ensures results generalize beyond training data

End-to-End Flow

6a
Baseline Evaluation

→

6b
Per-Agent GEPA Loop

→

6c
Per-Agent Convergence & Regression

→

6d
Cross-Agent GEPA Loop + Convergence

→

6e
Final Validation

→

6f
Continuous Regression & Monitoring

6a. Baseline Evaluation

Run the Hold-out Test Set against all current pipeline prompts. Record baseline scores across all 40 metrics.

This is the test set's first use
Output: per-agent, per-metric baseline scores — the anchor for final comparison

6b. Per-Agent GEPA Loop

Premise

Lock the model — Commercial Journeys runs on Sydney. GEPA optimizes prompts only, no model switching.
One agent at a time — freeze other agents' outputs as control variables.
Order: Upstream → Downstream — Grouping → Merging → Ranking. Upstream quality sets the ceiling for downstream; optimize inputs first.

Commercial Journeys Agent Configuration

Agent	Seed Prompt	Frozen Context	Eval Metrics
Grouping	Per-source current prompt	Raw data signals	Coherence, Completeness, Coverage
Merging	Current Merging prompt	Optimized Grouping outputs	Pass A (Fusion) + B (Task) + C (Filter) — 16 metrics
Ranking	Current Ranking prompt	Optimized Merge outputs	Pass A Ranking: Priority Signal Alignment, Missing/Under-Ranked Detection, Ranking Stability Pass B ExecutionPlan: Quality, Journey Alignment µ_f covers both passes — Pareto frontier naturally preserves candidates excelling in either

Pareto dimensions = each training case's aggregate score (not per-metric). Too many dimensions would make almost every prompt non-dominated, defeating Pareto selection.

The GEPA Cycle (Per-Agent)

Initialize

Seed Prompt

Start with 1 prompt (P₀) — the agent's current production prompt. The candidate pool grows each round as children are unconditionally added: [P₀, P₁, P₂, …]

Step 1 — Score & Select

D_train Scoring → Pareto Frontier → Parent

Score every candidate on all/most cases in D_train using the LLM Judge (from Step 5). Compute the Pareto frontier — prompts that aren't fully dominated by any other.

Prompt	Case 1	Case 2	Case 3	Status
P₀	0.9	0.4	0.7	Frontier
P₁	0.6	0.8	0.7	Frontier
P₂	0.7	0.7	0.9	Frontier
P₃	0.5	0.5	0.5	Dominated

Select parent(s) from frontier — two strategies used in alternation:

Targeted Mutation (1 parent)
Default mode. Pick the least-selected frontier prompt. Use when there are clear failure patterns to fix — the Reflection LLM surgically rewrites based on error analysis.

Semantic Crossover (2 parents)
Triggered when 2+ frontier prompts have complementary strengths on different dimensions. The LLM fuses both parents' advantages into one child — e.g. P₁ excels at fusion, P₂ at lifecycle → child inherits both.

Dominated prompts (like P₃) remain in the pool — they may return to the frontier if future children shift the landscape. Natural selection, not manual pruning.

Step 2 — Expose & Reflect

D_train Sampling → Feedback Function µ_f → Reflection + Ancestry

Randomly sample a batch from the same D_train. Run parent P₀ on these cases. Collect structured failure feedback via the Feedback Function µ_f, then feed to a Reflection LLM along with P₀'s ancestry lessons.

Feedback Function µ_f (distinct from scoring µ)

The scoring metric µ returns a scalar (0–1) for Pareto computation. The feedback function µ_f returns structured natural language — this is what makes reflection surgical rather than guesswork.

µ_f output example (Merge agent):

Case 7 | Score: 0.3 | Failure: Two emails about Q3 review + one calendar event point to the same deliverable, but Merge output treats them as independent journeys. Root cause: fusion rule lacks cross-source deliverable matching.

Reflection quality depends on µ_f richness: scalar scores alone → reflection guesses; structured failure traces → reflection does targeted surgery.

Prompt Ancestry & Lesson Accumulation

Every prompt carries an ancestry chain — who its parent was, and what lesson each mutation learned. This prevents "fix A, break B, fix B, break A" oscillation.

P₀'s ancestry (example):


P₀ (seed) → P₂ (lesson: "added cross-source matching but broke lifecycle filtering") → P₅ (lesson: "restored lifecycle rules, scoped cross-source to same-week window")

When mutating P₅, the Reflection LLM sees all accumulated lessons and avoids repeating past mistakes.

Reflection LLM input: current prompt + µ_f failure traces + ancestry lessons

"P₀ is too conservative in cross-source fusion — frequently under-merges.
Ancestry note: P₀'s grandparent tried aggressive fusion but broke lifecycle filtering (see lesson from round 3).
Suggest: 'Merge cross-source items sharing the same deliverable within a 7-day window, preserving lifecycle boundaries.'"

Scoring and feedback come from the same D_train — this ensures reflection targets align with Pareto selection criteria.

Step 3 — Generate Child & Add to Pool

Apply the selected strategy (Mutation or Crossover) → Unconditional Pool Entry

Using the strategy chosen in Step 1:

If Targeted Mutation:
Reflection LLM rewrites parent P₀ based on failure analysis (Step 2) + ancestry lessons → child P₄

If Semantic Crossover:
LLM identifies each parent's complementary strengths, then generates a child that semantically fuses both — understanding why each works and synthesizing a unified prompt

P₄ joins the pool unconditionally → [P₀, P₁, P₂, P₃, P₄]
No accept/reject gate. If P₄ is weak, it'll be naturally dominated in the next round's Pareto computation. But it might have unique strengths on specific cases that preserve diversity.
P₄ inherits ancestry from its parent(s) + adds the current round's lesson.

Step 4 — Loop or Converge

Back to Step 1 — until convergence

Next round: re-score all candidates (including new child), recompute Pareto frontier, select new parent… Stop when any convergence signal fires:

Iteration cap reached (e.g. 15 rounds)

Best score plateaus: < 1% change for N consecutive rounds

Frontier stops changing: same prompts on frontier for N rounds

All pass criteria already met

Convergence ≠ global optimum. It means "no further gains under current search strategy." A fundamentally different prompt approach might score higher, but GEPA's mutation path didn't explore it. This is why we validate with the Hold-out Test Set later (6e).

6c. Per-Agent Convergence & Regression

Output: Top-N Candidates

Only after convergence (any of the 4 stop signals in 6b Step 4 fires) do we select Top-N from the final Pareto frontier. Rank by average score across all training cases:

Prompt	V1	V2	V3	Avg
P₂	0.7	0.7	0.9	0.77
P₅	0.8	0.6	0.8	0.73
P₁	0.6	0.8	0.7	0.70

Sanity Check: E2E Integration

Since we optimize upstream → downstream and freeze other agents, per-agent regression is unlikely within the same optimization round. However, before moving to the next agent, run a quick E2E integration check to confirm the new prompt works correctly when wired into the full pipeline:

Pipeline works as expected
→ Lock this agent's result
→ Move to next agent

Unexpected interaction
→ Investigate and re-adapt if needed
→ Then continue

The real cross-agent validation happens in 6d, where all prompt combinations are evaluated end-to-end.

Commercial Journeys Per-Agent Outputs

Agent	GEPA Output	Sent to 6d
Grouping	Top-1 per source (N independent sources)	N prompts (fixed)
Merging	Pareto frontier → Top-3 by avg score	[M₁, M₂, M₃]
Ranking	Pareto frontier → Top-3 by avg score	[R₁, R₂, R₃]

6d. Cross-Agent GEPA Loop + Convergence

Why this step? Per-agent GEPA (6b) optimizes each agent in isolation. But in production all prompts interact — a prompt that's #1 in isolation may not be the best partner for another agent's top pick. Cross-Agent GEPA treats complete prompt suites as candidates and continues evolving them end-to-end.

Per-Agent vs Cross-Agent GEPA

	Per-Agent (6b)	Cross-Agent (6d)
Candidate	Single prompt	Complete prompt suite (Grouping + Merging + Ranking)
Rollout	Single agent output	E2E full pipeline
Reflection	Only current agent's failures	E2E trace with cross-agent root cause attribution
Mutation	Rewrite current agent's prompt	Rewrite whichever agent's prompt the reflection identifies as the root cause
Value	Isolated optimization, reduce search space	Discover agent interaction issues, true global optimum

Per-Agent is the warm start; Cross-Agent is the fine-tune. Not "per-agent does optimization, cross-agent just picks" — cross-agent continues active evolution.

Seed Pool

Assemble initial prompt suites from per-agent Top-N results:

Grouping: Top-1 per source → N prompts (fixed across all suites)
Merging: Top-3 from per-agent Pareto → [M₁, M₂, M₃]
Ranking: Top-3 from per-agent Pareto → [R₁, R₂, R₃]
Initial seed pool: 3 × 3 = 9 prompt suites

The GEPA Cycle (Cross-Agent)

Same GEPA loop as 6b (Score → Reflect → Mutate → Converge), but with two key differences:

Step 1 — Score & Select

E2E Scoring with Two-Layer Strategy

Each candidate is now a full prompt suite. Run the complete pipeline (Grouping → Merging → Ranking → Card) on D_train.

Layer 1 — Card Metrics (every round)
Evaluate pipeline output up to card presentation. Cheap — no M365 API calls. Drives Pareto selection and daily evolution.

Layer 2 — Execution Output (every N rounds or at convergence)
Send ExecutionPlan to M365 Copilot Chat, evaluate full output quality. Expensive but essential — injects output-level signals back into the GEPA loop so reflection can see Copilot execution issues too.

Layer 2 is not just a "post-hoc validation" — it periodically feeds output-level failure traces back into the loop, enabling reflection to target execution-layer problems.

Step 2 — Reflect & Attribute

E2E Failure Trace → Cross-Agent Root Cause Attribution

Unlike per-agent reflection, µ_f now traces failures across the entire pipeline. The Reflection LLM identifies which agent is the root cause:

E2E failure trace example:

Suite [M₂+R₁] | Case 5 | E2E Score: 0.4 | Card looks correct, but Copilot execution output misses key context → Root cause: Ranking Agent's ExecutionPlan omitted meeting notes from source signals

→ Reflection targets Ranking Agent's prompt specifically, even though we're optimizing the full suite.

Step 3 — Mutate the Right Agent

Targeted mutation on the root-cause agent's prompt within the suite

The mutation targets whichever agent the reflection identified. The rest of the suite stays frozen. The child is a new complete suite with one agent's prompt rewritten.

Child suite joins the pool unconditionally, same as per-agent GEPA.

Step 4 — Loop or Converge

Same convergence criteria as 6b — applied to prompt suites

Stop when any convergence signal fires (iteration cap, score plateau, frontier stable, all pass criteria met). On convergence, run a final Layer 2 evaluation on the winning suite.

6e. Final Validation

Run the Hold-out Test Set against the winning prompt suite from 6d. Compare against 6a baseline.

Test ≈ D_train scores
No overfitting. Results generalize.
→ Ship it

Test ≪ D_train scores
Prompts overfit to training data
→ Revisit dataset or strategy

This is the test set's second and final use in this optimization round
If another round is needed → supplement ≥ 30% new test cases

6f. Continuous Regression & Monitoring

After shipping the optimized prompt suite, integrate eval into CI/CD. Re-trigger the full eval + GEPA cycle when:

Prompt Change

Any manual edit to a pipeline prompt (Grouping, Merge, or Rank)

Model Update

Sydney model version update or provider-side changes

Pipeline Logic Change

Changes to pipeline orchestration, data ingestion, or gate conditions

Score Regression

CI/CD monitoring detects eval score drops below thresholds

New Data Sources

New signal sources added to the pipeline (e.g. new M365 connector)

Periodic Cadence

Scheduled re-evaluation (e.g. monthly or quarterly) to catch gradual drift

What is Commercial Journeys?

Machine Eval + GEPA Iteration

The Six-Step Framework

Define Goals & Metrics

Build Eval Dataset

Generate Pipeline Outputs

Human Annotation & Calibration

Build Automated Judge

Eval + GEPA Optimization

Step 1 — Define Goals & Metrics

Grouping Agent Output

Merging Agent Output

Ranking Agent Output

Card Presentation Quality

Execution & Delivery

End-to-End Coherence & Task Resolution

Step 2 — Build Eval Dataset

Step 3 — Generate Outputs

Step 4 — Human Labeling & Calibration

Step 5 — Build Automated Judge

Step 6 — Eval + GEPA Optimization

Training Set (D_train)

Hold-out Test Set

Prompt Change

Model Update

Pipeline Logic Change

Score Regression

New Data Sources

Periodic Cadence