Microsoft Edge · Commercial Journeys

What is Commercial Journeys?

Commercial Journeys is a new AI-powered feature in Microsoft Edge that proactively surfaces work tasks for commercial (AAD) users on the Edge Copilot New Tab Page (NTP) based on user's browser and M365 activity. It identifies what the user needs to work on and helps them get it done faster by connecting to M365 Copilot Chat for execution.

Machine Eval + GEPA Iteration

A systematic framework for evaluating and optimizing the Commercial Journeys pipeline through automated evaluation and reflective prompt evolution.

The Six-Step Framework

Step 1

Define Goals & Metrics

Draft Done
  • 42 metrics across 3 pipeline agents + end-to-end quality
  • Failure severity: Ship-Blocker (0 tolerance) → Critical (<2%) → High → Medium
  • Every failure attributed to root cause: pipeline vs Copilot vs ambiguous
Step 2

Build Eval Dataset

Coming Soon
  • Real or LLM-simulated M365 and browser signals
  • Split into D_train (GEPA scoring + feedback) and Hold-out Test (sealed, max 2 uses)
  • Each case = input signals + ground truth
Step 3

Generate Pipeline Outputs

Coming Soon
  • Run full pipeline: Grouping → Merge → Rank → Card → Execution
  • Capture per-agent intermediate outputs, not just final cards
  • Log metadata: latency, token count, prompt version, model version
Step 4

Human Annotation & Calibration

Coming Soon
  • Bootstrap from Selfhost Manual Eval scores as golden set baseline
  • Expert annotation for machine-eval metrics: source attribution, lifecycle, fusion
  • Golden set = input + output + per-metric scores + scoring rationale
Step 5

Build Automated Judge

Coming Soon
  • LLM Judge + rule-based checks, calibrated against human labels (Cohen's κ ≥ 0.7)
  • Dual role: formal eval scorer + GEPA fitness function
  • Outputs score (0–1) for ranking + structured failure feedback for reflection
Step 6

Eval + GEPA Optimization

Mechanism Draft
  • Baseline → Per-agent GEPA (Grouping → Merge → Rank) → Cross-agent GEPA
  • Genetic-Pareto evolution: reflect on failures, mutate prompts, Pareto natural selection
  • Final validation on hold-out set; ship if no overfitting

Step 1 — Define Goals & Metrics

Overview

Commercial Journeys is a multi-agent pipeline: M365 Data + Browser → Grouping Agent → Merging Agent → Ranking Agent → M365 Copilot Chat. We evaluate 40 metrics across 2 levels: per-agent output (fault isolation for each sub-agent) and final presentation (card quality + execution delivery + E2E coherence).

Pipeline Architecture
M365 Data
Browser History
Grouping Agent
→ Per-source grouped task information
Merging Agent
→ Cross-source merged journey candidates
Ranking Agent
→ (1) Final journeys with priority ranking
→ (2) Optimized prompt with ExecutionPlan per journey
M365 Copilot Chat
→ Final execution output (email drafts, meeting prep, summaries, etc.)
Evaluation Conventions

N/A Handling: Conditional metrics (e.g., Recurrence Handling, Incremental Update, Dismissed Suppression) are marked N/A when preconditions are not met. N/A judgments do not count toward pass/fail. Tracked as n/a_rate per metric — target: N/A rate < 40%.

End-to-End Ownership: We evaluate the entire feature end-to-end — from signal ingestion through Copilot execution output. Even though the final execution is handled by M365 Copilot Chat, the overall experience is what matters. If the feature delivers poor results for any reason (bad prompt design, poor execution quality, or both), it reflects on the feature as a whole. Root-cause tagging (cj_pipeline | m365_copilot | ambiguous) is used for debugging and prioritization, not for excluding failures from the overall verdict.

Part 1: Per-Agent Output Evaluation

Grouping Agent Output

3 metrics

Output: By-source grouped tasks. Each data source independently produces task candidates from raw signals. Sources: Email, Teams, Calendar, Meeting Recaps, Files, Browser History.

Sub-metricWhat to measureHow
Intra-source Grouping CoherenceWithin a single data source, are all items in one cluster truly about the same task?AI reviews each cluster's item set → judges whether semantically unified. E.g., two unrelated email threads lumped into one candidate = failure
Intra-source Grouping CompletenessWithin a single data source, are all items related to the same task assigned to the same cluster?AI scans unclustered items → judges whether they should belong to an existing cluster. E.g., a key email about "Q3 Budget" left as orphan while other Q3 Budget emails are clustered = failure
Per-source CoverageDid the Grouping Agent produce candidates from data sources that clearly contain task signals?Compare active data sources vs. data sources that produced zero candidates → flag sources with obvious signals that were entirely missed
Pass: Coherence error < 5% · Completeness error < 5%

Merging Agent Output

17 metrics · 3 passes

Output: Cross-source merged journey candidates. Fuses task candidates from different data sources into complete, well-defined journeys. The most complex step — 16 metrics split into 3 focused evaluation passes to reduce LLM-as-Judge cognitive load.

Pass A — Fusion Quality
Sub-metricWhat to measureHowBoundary definition
Cross-source Fusion Accuracy (Over-merging)Were candidates from different sources that describe distinct tasks incorrectly merged?AI reviews merged journey vs. contributing candidates → flags cases where semantically different tasks were collapsedTwo tasks are "distinct" when they have different owners, different deadlines, or produce different deliverables — even if they share a topic umbrella
Cross-source Fusion Accuracy (Under-merging)Were candidates from different sources that describe the same task left as separate journeys?AI clusters all output journeys by semantic similarity → flags near-duplicates from different sourcesTwo journeys are "same task" when the user would describe them as one item on their to-do list
Source Attribution PrecisionAre the linked source items correct — no wrong links?Check each attributed source → precision = correct sources / total attributed sourcesA source is "correct" when its content is directly relevant to the journey's task definition. Tangentially related items = incorrect
Source Attribution RecallAre the most important source signals included?Check whether high-signal items are attributed → recall = included key sources / total key sourcesKey source = (a) primary task trigger, (b) any source with deadline info, (c) any source from task owner's direct manager or above
Source Role AssignmentAre data sources assigned correct roles? (Email=trigger, Calendar=deadline, Browser=execution evidence, Files=context)Extract source roles → AI validates role logic against specialized prompt definitionsN/A when journey has only 1 contributing source
Data Source CoverageAre journey candidates drawing from multiple M365 sources, or over-reliant on a single source?Tag contributing sources per journey → measure source distribution across all qualified journeysFlag if all journeys originate from a single source type when multiple sources contain task signals
Pass A: Over-merge < 3% · Under-merge < 5% · Precision ≥ 95% · Recall ≥ 90%
Pass B — Task Definition
Sub-metricWhat to measureHowBoundary definition
Task AccuracyIs this a real task the user actually needs to complete?AI evaluates: given the contributing source signals, is the generated task something the user would recognize as "yes, I need to do this"?"Real task" = user was explicitly assigned, committed, or is the clear owner. "Not real" = AI inferred a task from peripheral signals (e.g., CC'd on FYI email → AI assumes user must act)
Task GranularityIs the task scoped to a specific, actionable item?AI evaluates scope. Score 1-5.Score 5 = single clear deliverable. Score 3 = reasonable but could be more specific. Score 1 = "do your job" broad or trivially narrow
AI ActionabilityDoes this task genuinely benefit from AI help?AI classifies: (a) Too simple — user can do faster without AI. (b) AI-appropriate. (c) Beyond AI capability. Flag (a) and (c)"Too simple" = single click, one-line reply. "Beyond AI" = requires physical presence, confidential judgment, or inaccessible systems
Gate Condition ComplianceDoes each journey pass the 7 gating conditions?Programmatically check: IS_CONCRETE_TASK, IS_WORK_RELATED, HAS_USER_RELEVANCE, IS_NOT_COMPLETED, HAS_MEANINGFUL_COMPLEXITY, REQUIRES_BROWSER, IS_AI_EXECUTABLEIS_NOT_COMPLETED: if user sent "done"/"completed"/"attached" AND no subsequent follow-up → completed. Ambiguous = not completed (benefit of doubt)
Hallucination DetectionDoes the journey have zero traceable source signals — entirely fabricated?Cross-reference journey description against all contributing candidates and raw signals → flag journeys with no traceable source"Hallucination" = core task description has NO supporting evidence in ANY input source. "Reasonable inference" from strong multi-source signals ≠ hallucination
Pass B: Task Accuracy ≥ 90% · Hallucination rate = 0%
Pass C — Filtering & Lifecycle
Sub-metricWhat to measureHowBoundary definition
Lifecycle AccuracyIs the lifecycle state (NEW_WORK / ONGOING_WORK / RECURRING_WORK) correctly assigned?For ONGOING: verify match to existing journey + continued activity. For NEW: verify no existing journey covers this taskN/A for first-run evaluations where no journey history exists
Incremental Update FidelityWhen updating an existing journey, is the original context preserved while new info is correctly incorporated?Compare updated journey against previous version + new signals → flag lost context or incorrect overwritesN/A when journey is NEW_WORK with no prior version
Dismissed Topic SuppressionAre manually dismissed journeys correctly excluded from regeneration?Cross-reference output against dismissed titles list → flag resurfaced dismissed topicsN/A when no dismiss history exists
Privacy & Sensitivity FilteringAre sensitive topics correctly blocked? Any PII leakage?PII scan (regex + NER) across all journey titles, summaries, and source references + personal-vs-work classificationAny PII (email addresses in title, SSN, credit card) = ship-blocker. Personal-but-not-PII (e.g., "dentist appointment") = High
Staleness RejectionAre journeys based on signals older than the 3-day freshness window filtered out?Check source signal timestamps → flag journeys based entirely on expired signals"Entirely expired" = ALL contributing sources >3 days old AND no new signals merged in
Noise RejectionAre low-value / trivial signals correctly suppressed?Classify signal value → flag low-value signals that produced journeys"Noise" = auto-generated system emails, all-day calendar placeholders, newsletters. "Not noise" = routine meetings if user is organizer or has action items
Pass C: Lifecycle ≥ 90% · Privacy failure = 0% (ship-blocker) · Noise pass-through < 5%

Ranking Agent Output

5 metrics · 2 passes

The Rank Agent produces two distinct outputs: (1) final journeys with ranking, and (2) an optimized prompt with ExecutionPlan for each journey, to be sent to M365 Copilot Chat. These are evaluated as two separate passes.

Pass A — Ranking Quality
Sub-metricWhat to measureHow
Priority Signal AlignmentDo top-ranked journeys genuinely have the highest urgency/importance?Extract rule-based + model scores → verify top-N score highest on timeliness, stakeholder importance, ownership strength
Missing or Under-Ranked DetectionAre there important tasks that should be in the Top-3 but are either missing entirely or buried too low?AI Judge reviews all input signals holistically → identifies high-urgency/high-importance tasks → checks whether they appear in the system's Top-3. Flags tasks that are absent or ranked significantly lower than their urgency warrants
Ranking StabilityDo rankings change erratically between sessions without new signals?Compare logs across sessions → flag unexplained rank changes
Ranking: Top-3 inversion < 10% · Missing/under-ranked rate < 10%
Pass B — ExecutionPlan Quality
Sub-metricWhat to measureHow
ExecutionPlan QualityIs the generated ExecutionPlan well-structured, specific, and executable by Copilot?AI evaluates: task clarity, context inclusion, output format, scope. Score 1-5
ExecutionPlan-Journey AlignmentDoes the ExecutionPlan match the journey's task definition?Compare plan objective against journey title/summary → flag mismatches
ExecutionPlan: avg score ≥ 3.5 · Alignment mismatch = 0%
Part 2: Final Presentation Evaluation

Card Presentation Quality

5 metrics

What the user sees on the NTP before clicking. Does the card earn the click — and set honest expectations?

Sub-metricWhat to measureHow
Brief ClarityIs the task brief (title + summary) specific, actionable, and immediately understood?AI evaluates: specificity, actionability, grammar, length. Binary: "Can you predict what happens if you click?"
Label AccuracyDoes the label (type + name) correctly reflect the actual urgency trigger?Cross-reference label against source signals. E.g., "Asked by VP Chen" — verify VP Chen actually made the request
Icon RelevanceDoes the action_type icon match the actual task?Map action_type to journey content → flag mismatches
Promise AccuracyDoes the card accurately represent what the Copilot execution will deliver?Compare card title + summary vs. actual execution output → score promise-delivery alignment 1-5. Score ≤ 2 = critical failure
Task Category DistributionAre the displayed journey cards spread across different action types, or all clustered in one category?Classify displayed cards by action_type → flag if any single type exceeds 60%
Trigger DiversityAre the displayed journey cards varied in urgency trigger types (deadline-driven, people-driven, recurrence-driven)?Extract label/trigger types across displayed cards → check distribution is not dominated by a single trigger type
Pass: Clarity ≥ 90% · Promise Accuracy score ≤ 2 = 0%

Execution & Delivery

5 metrics

After the user clicks, M365 Copilot Chat executes the prompt generated by Rank Agent. Did the response deliver value? Root-cause tagging (cj_pipeline | m365_copilot | ambiguous) is applied for debugging — all failures count toward the overall feature verdict.

Sub-metricWhat to measureHowRoot-cause notes
Execution Completion RateWhat % of executions completed successfully vs. failed/timed out?Parse execution traces → compute success rateTimeout/crash = likely m365_copilot. Malformed prompt = cj_pipeline
Output CorrectnessAre factual claims grounded in the user's actual data?For each claim, verify against source signals. Claim precision = supported / total. Target ≥ 95%Wrong facts from bad context in prompt = cj_pipeline. Correct context but Copilot hallucinated = m365_copilot
Output CompletenessDoes the output cover all key threads from the journey's source signals?Compare entities/topics in output vs. key entities in source signalsCJ prompt omitted key context = cj_pipeline. Context provided but Copilot ignored = m365_copilot
ActionabilityAre there concrete next steps the user can immediately act on?Count actionable elements: draft ready to send? Link to right doc? Binary: ≥ 1 concrete action?Prompt didn't request actionable format = cj_pipeline. Requested but not delivered = m365_copilot
Scenario FitIs the output format appropriate for the action_type?Map action_type → expected format (REPLY_EMAIL → email draft, PREPARE_MEETING → agenda). Flag mismatchesCJ prompt specified wrong format = cj_pipeline. Correct request, wrong delivery = m365_copilot
Pass: Completion ≥ 95% · Correctness ≥ 95% (cj_pipeline only) · Effort Reduction avg ≥ 3.5

End-to-End Coherence & Task Resolution

5 metrics

From raw signals to final execution output, does the entire chain hold together?

Sub-metricWhat to measureHow
Journey Coverage (Missing Journeys)Are there important tasks from the input signals that the pipeline failed to surface as journeys?AI reviews all input signals holistically → identifies task-worthy signal clusters that produced no journey. Cross-references with Grouping and Merge outputs to locate where the signal was lost
Full-chain Topic AlignmentDoes the primary topic stay consistent across all pipeline stages?Extract primary topic at each stage → flag any stage that introduces a different topic or contradicts another
Task Resolution ScoreCould the user meaningfully advance their task from this output alone?Expert/AI rating 1-5. Score 1 = "useless, would start over." Score 5 = "could complete the task right here." The single offline metric most predictive of real-world satisfaction
Recurrence HandlingFor RECURRING_WORK journeys, is the system recommending the correct upcoming instance?Check recurrence metadata → validate instance timing. N/A for NEW_WORK and ONGOING_WORK
Pass: Topic alignment break = 0% · Task Resolution avg ≥ 3.5
Failure Mode Taxonomy

Critical failures are flagged independently from dimension scores. Any critical failure triggers an overall Fail verdict regardless of other scores.

Severity Definitions
SeverityDefinitionAction
Ship-BlockerTrust-destroying or compliance-violating. Zero tolerance — even one instance blocks release.Blocks ship. Zero-tolerance threshold.
CriticalFundamental value proposition broken. User sees wrong, fabricated, or misleading content.Must fix before GA. Threshold: < 2% of evaluated journeys.
HighMeaningful quality gap. User gets some value but with significant friction.Must be triaged. Acceptable only with documented mitigation plan.
MediumNoticeable imperfection. User still gets value but experience is suboptimal.Should fix. Can ship with known issues tracked.
Failure Mode Catalog
Failure ModeAgent / StageDescriptionSeverity
Privacy ViolationMergingSensitive / personal content surfaces in journeyShip-Blocker
PII LeakageMergingPII exposed in journey title, summary, or sourcesShip-Blocker
Resurfaced Dismissed TopicMergingA manually dismissed journey topic reappearsShip-Blocker
Hallucinated JourneyMergingJourney has no traceable source signal — entirely fabricatedCritical
Wrong Source AttributionMergingJourney links to incorrect M365 itemsCritical
Phantom TaskMergingSignals are real but the inferred task is notCritical
Completed Task SurfacedMergingJourney recommends a task the user already finishedCritical
Promise-Delivery GapCard PresentationCard promises something the execution output doesn't deliverCritical
Over-PromiseCard PresentationCard implies capability the Copilot execution can't supportCritical
Execution HallucinationExecutionFactual claims not supported by any source signalCritical
Topic DriftE2EPrimary topic changes between pipeline stagesHigh
Missed Source EntirelyGroupingA data source with obvious task signals produced zero candidatesHigh
Mis-grouped ItemsGroupingUnrelated items from the same source lumped into one candidateHigh
Orphaned Key ItemGroupingA high-importance item not assigned to any candidateHigh
Over-merged TasksMergingDistinct tasks from different sources collapsed into one journeyHigh
Wrong LifecycleMergingNEW_WORK tagged as ONGOING_WORK or vice versaHigh
Scope MismatchMergingTask is too broad or too narrowHigh
AI MisfitMergingTask is too simple or beyond AI capabilityHigh
Noise LeakMergingLow-value / trivial signals produced a journeyHigh
Stale JourneyMergingJourney based on signals older than 3-day windowHigh
Missing Important JourneyE2EA clearly important task from input signals is not surfaced as any journeyHigh
Priority InversionRankingA clearly more urgent journey ranked below a less important oneHigh
Bad ExecutionPlanRankingPlan is vague, references wrong sources, or describes a different taskHigh
Missing Key ContextExecutionExecution output omits a major thread the user clearly cares aboutHigh
Wrong Help TypeExecutionCorrect topic but wrong output formatHigh
No ActionabilityExecutionCorrect information but no concrete next stepHigh
Wrong Recurrence InstanceE2ERecurring task surfaces a past instance instead of upcoming oneHigh
Under-merged DuplicatesMergingSame task from different sources left as separate journeysMedium
Monoculture RecommendationsCard PresentationAll displayed journeys are the same action_type or trigger typeMedium
Vague CardCard PresentationTitle/summary too generic — user can't predict what they'll getMedium
Redundant OutputExecutionExecution output just restates source data — no synthesisMedium
Verdict Rules
ConditionOverall Verdict
Any Ship-Blocker failureFail — blocks ship regardless of scores
Any Critical failure rate ≥ 2%Fail — must fix before GA
Critical < 2% AND ≥ 2 High failuresConditional — requires mitigation plan
Critical < 2% AND 0-1 High AND all pass criteria metPass — ship-ready

Step 2 — Build Eval Dataset

Coming Soon

Dataset construction details

Step 3 — Generate Outputs

Coming Soon

Output generation and trace capture details

Step 4 — Human Labeling & Calibration

Coming Soon

Annotation workflow and golden set details

Step 5 — Build Automated Judge

Coming Soon

LLM-as-Judge calibration and rule-based checks

Step 6 — Eval + GEPA Optimization

Overview
What is GEPA?

GEPA (Genetic-Pareto) is a reflective prompt evolution method. Instead of brute-force RL search, it lets a strong model review failures → analyze root causes → surgically rewrite prompts, managed through a genetic evolution framework with Pareto selection to avoid single-dimension overfitting. Children are added to the candidate pool unconditionally — natural selection happens through Pareto frontier recomputation, not explicit accept/reject gates.

Reference: Reflective Prompt Evolution Can Outperform Reinforcement Learning (ICLR 2026)

Two Datasets
Training Set (D_train)

The dual-purpose workhorse. Serves both roles during GEPA optimization:

  • Scoring: All/most cases scored per round to compute Pareto frontiers
  • Feedback sampling: Random batch sampled each round for failure trace collection & reflection
Hold-out Test Set

The sealed final exam. Completely isolated from optimization. Only used twice: baseline (6a) and final validation (6e).

Never seen during optimization — ensures results generalize beyond training data
End-to-End Flow
6a
Baseline Evaluation
6b
Per-Agent GEPA Loop
6c
Per-Agent Convergence & Regression
6d
Cross-Agent GEPA Loop + Convergence
6e
Final Validation
6f
Continuous Regression & Monitoring
6a. Baseline Evaluation

Run the Hold-out Test Set against all current pipeline prompts. Record baseline scores across all 40 metrics.

  • This is the test set's first use
  • Output: per-agent, per-metric baseline scores — the anchor for final comparison
6b. Per-Agent GEPA Loop
Premise
  • Lock the model — Commercial Journeys runs on Sydney. GEPA optimizes prompts only, no model switching.
  • One agent at a time — freeze other agents' outputs as control variables.
  • Order: Upstream → Downstream — Grouping → Merging → Ranking. Upstream quality sets the ceiling for downstream; optimize inputs first.
Commercial Journeys Agent Configuration
AgentSeed PromptFrozen ContextEval Metrics
GroupingPer-source current promptRaw data signalsCoherence, Completeness, Coverage
MergingCurrent Merging promptOptimized Grouping outputsPass A (Fusion) + B (Task) + C (Filter) — 16 metrics
RankingCurrent Ranking promptOptimized Merge outputsPass A Ranking: Priority Signal Alignment, Missing/Under-Ranked Detection, Ranking Stability
Pass B ExecutionPlan: Quality, Journey Alignment
µ_f covers both passes — Pareto frontier naturally preserves candidates excelling in either

Pareto dimensions = each training case's aggregate score (not per-metric). Too many dimensions would make almost every prompt non-dominated, defeating Pareto selection.

The GEPA Cycle (Per-Agent)
Initialize
Seed Prompt
Start with 1 prompt (P₀) — the agent's current production prompt. The candidate pool grows each round as children are unconditionally added: [P₀, P₁, P₂, …]
Step 1 — Score & Select
D_train Scoring → Pareto Frontier → Parent
Score every candidate on all/most cases in D_train using the LLM Judge (from Step 5). Compute the Pareto frontier — prompts that aren't fully dominated by any other.
PromptCase 1Case 2Case 3Status
P₀0.90.40.7 Frontier
P₁0.60.80.7 Frontier
P₂0.70.70.9 Frontier
P₃0.50.50.5 Dominated
Select parent(s) from frontier — two strategies used in alternation:
Targeted Mutation (1 parent)
Default mode. Pick the least-selected frontier prompt. Use when there are clear failure patterns to fix — the Reflection LLM surgically rewrites based on error analysis.
Semantic Crossover (2 parents)
Triggered when 2+ frontier prompts have complementary strengths on different dimensions. The LLM fuses both parents' advantages into one child — e.g. P₁ excels at fusion, P₂ at lifecycle → child inherits both.

Dominated prompts (like P₃) remain in the pool — they may return to the frontier if future children shift the landscape. Natural selection, not manual pruning.

Step 2 — Expose & Reflect
D_train Sampling → Feedback Function µ_f → Reflection + Ancestry
Randomly sample a batch from the same D_train. Run parent P₀ on these cases. Collect structured failure feedback via the Feedback Function µ_f, then feed to a Reflection LLM along with P₀'s ancestry lessons.
Feedback Function µ_f (distinct from scoring µ)

The scoring metric µ returns a scalar (0–1) for Pareto computation. The feedback function µ_f returns structured natural language — this is what makes reflection surgical rather than guesswork.

µ_f output example (Merge agent):
Case 7 | Score: 0.3 | Failure: Two emails about Q3 review + one calendar event point to the same deliverable, but Merge output treats them as independent journeys. Root cause: fusion rule lacks cross-source deliverable matching.

Reflection quality depends on µ_f richness: scalar scores alone → reflection guesses; structured failure traces → reflection does targeted surgery.
Prompt Ancestry & Lesson Accumulation

Every prompt carries an ancestry chain — who its parent was, and what lesson each mutation learned. This prevents "fix A, break B, fix B, break A" oscillation.

P₀'s ancestry (example):
P₀ (seed) → P₂ (lesson: "added cross-source matching but broke lifecycle filtering") → P₅ (lesson: "restored lifecycle rules, scoped cross-source to same-week window")

When mutating P₅, the Reflection LLM sees all accumulated lessons and avoids repeating past mistakes.
Reflection LLM input: current prompt + µ_f failure traces + ancestry lessons

"P₀ is too conservative in cross-source fusion — frequently under-merges.
  Ancestry note: P₀'s grandparent tried aggressive fusion but broke lifecycle filtering (see lesson from round 3).
  Suggest: 'Merge cross-source items sharing the same deliverable within a 7-day window, preserving lifecycle boundaries.'"

Scoring and feedback come from the same D_train — this ensures reflection targets align with Pareto selection criteria.

Step 3 — Generate Child & Add to Pool
Apply the selected strategy (Mutation or Crossover) → Unconditional Pool Entry
Using the strategy chosen in Step 1:
If Targeted Mutation:
Reflection LLM rewrites parent P₀ based on failure analysis (Step 2) + ancestry lessons → child P₄
If Semantic Crossover:
LLM identifies each parent's complementary strengths, then generates a child that semantically fuses both — understanding why each works and synthesizing a unified prompt
P₄ joins the pool unconditionally → [P₀, P₁, P₂, P₃, P₄]
No accept/reject gate. If P₄ is weak, it'll be naturally dominated in the next round's Pareto computation. But it might have unique strengths on specific cases that preserve diversity.
P₄ inherits ancestry from its parent(s) + adds the current round's lesson.
Step 4 — Loop or Converge
Back to Step 1 — until convergence
Next round: re-score all candidates (including new child), recompute Pareto frontier, select new parent… Stop when any convergence signal fires:
Iteration cap reached (e.g. 15 rounds)
Best score plateaus: < 1% change for N consecutive rounds
Frontier stops changing: same prompts on frontier for N rounds
All pass criteria already met

Convergence ≠ global optimum. It means "no further gains under current search strategy." A fundamentally different prompt approach might score higher, but GEPA's mutation path didn't explore it. This is why we validate with the Hold-out Test Set later (6e).

6c. Per-Agent Convergence & Regression
Output: Top-N Candidates

Only after convergence (any of the 4 stop signals in 6b Step 4 fires) do we select Top-N from the final Pareto frontier. Rank by average score across all training cases:

PromptV1V2V3Avg
P₂0.70.70.90.77
P₅0.80.60.80.73
P₁0.60.80.70.70
Sanity Check: E2E Integration

Since we optimize upstream → downstream and freeze other agents, per-agent regression is unlikely within the same optimization round. However, before moving to the next agent, run a quick E2E integration check to confirm the new prompt works correctly when wired into the full pipeline:

Pipeline works as expected
→ Lock this agent's result
→ Move to next agent
Unexpected interaction
→ Investigate and re-adapt if needed
→ Then continue

The real cross-agent validation happens in 6d, where all prompt combinations are evaluated end-to-end.

Commercial Journeys Per-Agent Outputs
AgentGEPA OutputSent to 6d
GroupingTop-1 per source (N independent sources)N prompts (fixed)
MergingPareto frontier → Top-3 by avg score[M₁, M₂, M₃]
RankingPareto frontier → Top-3 by avg score[R₁, R₂, R₃]
6d. Cross-Agent GEPA Loop + Convergence

Why this step? Per-agent GEPA (6b) optimizes each agent in isolation. But in production all prompts interact — a prompt that's #1 in isolation may not be the best partner for another agent's top pick. Cross-Agent GEPA treats complete prompt suites as candidates and continues evolving them end-to-end.

Per-Agent vs Cross-Agent GEPA
Per-Agent (6b)Cross-Agent (6d)
CandidateSingle promptComplete prompt suite (Grouping + Merging + Ranking)
RolloutSingle agent outputE2E full pipeline
ReflectionOnly current agent's failuresE2E trace with cross-agent root cause attribution
MutationRewrite current agent's promptRewrite whichever agent's prompt the reflection identifies as the root cause
ValueIsolated optimization, reduce search spaceDiscover agent interaction issues, true global optimum

Per-Agent is the warm start; Cross-Agent is the fine-tune. Not "per-agent does optimization, cross-agent just picks" — cross-agent continues active evolution.

Seed Pool

Assemble initial prompt suites from per-agent Top-N results:

  • Grouping: Top-1 per source → N prompts (fixed across all suites)
  • Merging: Top-3 from per-agent Pareto → [M₁, M₂, M₃]
  • Ranking: Top-3 from per-agent Pareto → [R₁, R₂, R₃]
  • Initial seed pool: 3 × 3 = 9 prompt suites
The GEPA Cycle (Cross-Agent)

Same GEPA loop as 6b (Score → Reflect → Mutate → Converge), but with two key differences:

Step 1 — Score & Select
E2E Scoring with Two-Layer Strategy
Each candidate is now a full prompt suite. Run the complete pipeline (Grouping → Merging → Ranking → Card) on D_train.
Layer 1 — Card Metrics (every round)
Evaluate pipeline output up to card presentation. Cheap — no M365 API calls. Drives Pareto selection and daily evolution.
Layer 2 — Execution Output (every N rounds or at convergence)
Send ExecutionPlan to M365 Copilot Chat, evaluate full output quality. Expensive but essential — injects output-level signals back into the GEPA loop so reflection can see Copilot execution issues too.

Layer 2 is not just a "post-hoc validation" — it periodically feeds output-level failure traces back into the loop, enabling reflection to target execution-layer problems.

Step 2 — Reflect & Attribute
E2E Failure Trace → Cross-Agent Root Cause Attribution
Unlike per-agent reflection, µ_f now traces failures across the entire pipeline. The Reflection LLM identifies which agent is the root cause:
E2E failure trace example:
Suite [M₂+R₁] | Case 5 | E2E Score: 0.4 | Card looks correct, but Copilot execution output misses key context → Root cause: Ranking Agent's ExecutionPlan omitted meeting notes from source signals

→ Reflection targets Ranking Agent's prompt specifically, even though we're optimizing the full suite.
Step 3 — Mutate the Right Agent
Targeted mutation on the root-cause agent's prompt within the suite
The mutation targets whichever agent the reflection identified. The rest of the suite stays frozen. The child is a new complete suite with one agent's prompt rewritten.

Child suite joins the pool unconditionally, same as per-agent GEPA.
Step 4 — Loop or Converge
Same convergence criteria as 6b — applied to prompt suites
Stop when any convergence signal fires (iteration cap, score plateau, frontier stable, all pass criteria met). On convergence, run a final Layer 2 evaluation on the winning suite.
6e. Final Validation

Run the Hold-out Test Set against the winning prompt suite from 6d. Compare against 6a baseline.

Test ≈ D_train scores
No overfitting. Results generalize.
→ Ship it
Test ≪ D_train scores
Prompts overfit to training data
→ Revisit dataset or strategy
  • This is the test set's second and final use in this optimization round
  • If another round is needed → supplement ≥ 30% new test cases
6f. Continuous Regression & Monitoring

After shipping the optimized prompt suite, integrate eval into CI/CD. Re-trigger the full eval + GEPA cycle when:

Prompt Change

Any manual edit to a pipeline prompt (Grouping, Merge, or Rank)

Model Update

Sydney model version update or provider-side changes

Pipeline Logic Change

Changes to pipeline orchestration, data ingestion, or gate conditions

Score Regression

CI/CD monitoring detects eval score drops below thresholds

New Data Sources

New signal sources added to the pipeline (e.g. new M365 connector)

Periodic Cadence

Scheduled re-evaluation (e.g. monthly or quarterly) to catch gradual drift