Microsoft Edge · Commercial Journeys

What is Commercial Journeys?

Commercial Journeys is a new AI-powered feature in Microsoft Edge that proactively surfaces work tasks for commercial (AAD) users on the Edge Copilot New Tab Page (NTP) based on user's browser and M365 activity. It identifies what the user needs to work on and helps them get it done faster by connecting to M365 Copilot Chat for execution.

Machine Eval + GEPA Iteration

A systematic framework for evaluating and optimizing the Commercial Journeys pipeline through automated evaluation and reflective prompt evolution.

The Six-Step Framework

Step 1

Define Goals & Metrics

Draft Done
  • 42 metrics across 3 pipeline agents + end-to-end quality
  • Failure severity: Ship-Blocker (0 tolerance) → Critical (<2%) → High → Medium
  • Every failure attributed to root cause: pipeline vs Copilot vs ambiguous
Step 2

Build Eval Dataset

Coming Soon
  • Real or LLM-simulated M365 and browser signals
  • Split into D_train (GEPA scoring + feedback) and Hold-out Test (sealed, max 2 uses)
  • Each case = input signals + ground truth
Step 3

Generate Pipeline Outputs

Coming Soon
  • Run full pipeline: Grouping → Merge → Rank → Card → Execution
  • Capture per-agent intermediate outputs, not just final cards
  • Log metadata: latency, token count, prompt version, model version
Step 4

Human Annotation & Calibration

Coming Soon
  • Bootstrap from Selfhost Manual Eval scores as golden set baseline
  • Expert annotation for machine-eval metrics: source attribution, lifecycle, fusion
  • Golden set = input + output + per-metric scores + scoring rationale
Step 5

Build Automated Judge

Coming Soon
  • LLM Judge + rule-based checks, calibrated against human labels (Cohen's κ ≥ 0.7)
  • Dual role: formal eval scorer + GEPA fitness function
  • Outputs score (0–1) for ranking + structured failure feedback for reflection
Step 6

Eval + GEPA Optimization

Mechanism Draft
  • Baseline → Per-agent GEPA (Grouping → Merge → Rank) → Cross-agent GEPA
  • Genetic-Pareto evolution: reflect on failures, mutate prompts, Pareto natural selection
  • Final validation on hold-out set; ship if no overfitting

Step 1 — Define Goals & Metrics

Overview

Commercial Journeys is a multi-agent pipeline. Evaluation operates at two levels: per-agent output (fault isolation) and end-to-end output (holistic quality).

42
Total Metrics
2
Eval Levels
27
Failure Modes
8
Eval Dimensions
Pipeline Architecture
M365 Data
Browser History
Grouping Agent ×N
3 metrics
Merge Agent
16 metrics (3 passes)
Rank Agent
10 metrics
NTP Cards
+ Copilot Execution
Evaluation Conventions

N/A Handling: Conditional metrics (e.g., Recurrence Handling, Incremental Update, Dismissed Suppression) are marked N/A when preconditions are not met. N/A judgments do not count toward pass/fail. Tracked as n/a_rate per metric — target: N/A rate < 40%.

End-to-End Ownership: We evaluate the entire feature end-to-end — from signal ingestion through Copilot execution output. Even though the final execution is handled by M365 Copilot Chat, the overall experience is what matters. If the feature delivers poor results for any reason (bad prompt design, poor execution quality, or both), it reflects on the feature as a whole. Root-cause tagging (cj_pipeline | m365_copilot | ambiguous) is used for debugging and prioritization, not for excluding failures from the overall verdict.

Part 1: Per-Agent Output Evaluation

Grouping Agent Output

3 metrics

Does each data source's clustering correctly identify task candidates from raw signals? Runs independently per source — Email, Teams, Calendar, Meeting Recaps, Files, Browser History.

Sub-metricWhat to measureHow
Intra-source Grouping CoherenceWithin a single data source, are all items in one cluster truly about the same task?AI reviews each cluster's item set → judges whether semantically unified. E.g., two unrelated email threads lumped into one candidate = failure
Intra-source Grouping CompletenessWithin a single data source, are all items related to the same task assigned to the same cluster?AI scans unclustered items → judges whether they should belong to an existing cluster. E.g., a key email about "Q3 Budget" left as orphan while other Q3 Budget emails are clustered = failure
Per-source CoverageDid the Grouping Agent produce candidates from data sources that clearly contain task signals?Compare active data sources vs. data sources that produced zero candidates → flag sources with obvious signals that were entirely missed
Pass: Coherence error < 5% · Completeness error < 5%

Merge Agent Output

16 metrics · 3 passes

Are candidates from different data sources correctly fused into complete, well-defined journeys? The most complex step — 16 metrics split into 3 focused evaluation passes to reduce LLM-as-Judge cognitive load.

Pass A — Fusion Quality
Sub-metricWhat to measureHowBoundary definition
Cross-source Fusion Accuracy (Over-merging)Were candidates from different sources that describe distinct tasks incorrectly merged?AI reviews merged journey vs. contributing candidates → flags cases where semantically different tasks were collapsedTwo tasks are "distinct" when they have different owners, different deadlines, or produce different deliverables — even if they share a topic umbrella
Cross-source Fusion Accuracy (Under-merging)Were candidates from different sources that describe the same task left as separate journeys?AI clusters all output journeys by semantic similarity → flags near-duplicates from different sourcesTwo journeys are "same task" when the user would describe them as one item on their to-do list
Source Attribution PrecisionAre the linked source items correct — no wrong links?Check each attributed source → precision = correct sources / total attributed sourcesA source is "correct" when its content is directly relevant to the journey's task definition. Tangentially related items = incorrect
Source Attribution RecallAre the most important source signals included?Check whether high-signal items are attributed → recall = included key sources / total key sourcesKey source = (a) primary task trigger, (b) any source with deadline info, (c) any source from task owner's direct manager or above
Source Role AssignmentAre data sources assigned correct roles? (Email=trigger, Calendar=deadline, Browser=execution evidence, Files=context)Extract source roles → AI validates role logic against specialized prompt definitionsN/A when journey has only 1 contributing source
Pass A: Over-merge < 3% · Under-merge < 5% · Precision ≥ 95% · Recall ≥ 90%
Pass B — Task Definition
Sub-metricWhat to measureHowBoundary definition
Task AccuracyIs this a real task the user actually needs to complete?AI evaluates: given the contributing source signals, is the generated task something the user would recognize as "yes, I need to do this"?"Real task" = user was explicitly assigned, committed, or is the clear owner. "Not real" = AI inferred a task from peripheral signals (e.g., CC'd on FYI email → AI assumes user must act)
Task GranularityIs the task scoped to a specific, actionable item?AI evaluates scope. Score 1-5.Score 5 = single clear deliverable. Score 3 = reasonable but could be more specific. Score 1 = "do your job" broad or trivially narrow
AI ActionabilityDoes this task genuinely benefit from AI help?AI classifies: (a) Too simple — user can do faster without AI. (b) AI-appropriate. (c) Beyond AI capability. Flag (a) and (c)"Too simple" = single click, one-line reply. "Beyond AI" = requires physical presence, confidential judgment, or inaccessible systems
Gate Condition ComplianceDoes each journey pass the 7 gating conditions?Programmatically check: IS_CONCRETE_TASK, IS_WORK_RELATED, HAS_USER_RELEVANCE, IS_NOT_COMPLETED, HAS_MEANINGFUL_COMPLEXITY, REQUIRES_BROWSER, IS_AI_EXECUTABLEIS_NOT_COMPLETED: if user sent "done"/"completed"/"attached" AND no subsequent follow-up → completed. Ambiguous = not completed (benefit of doubt)
Hallucination DetectionDoes the journey have zero traceable source signals — entirely fabricated?Cross-reference journey description against all contributing candidates and raw signals → flag journeys with no traceable source"Hallucination" = core task description has NO supporting evidence in ANY input source. "Reasonable inference" from strong multi-source signals ≠ hallucination
Pass B: Task Accuracy ≥ 90% · Hallucination rate = 0%
Pass C — Filtering & Lifecycle
Sub-metricWhat to measureHowBoundary definition
Lifecycle AccuracyIs the lifecycle state (NEW_WORK / ONGOING_WORK / RECURRING_WORK) correctly assigned?For ONGOING: verify match to existing journey + continued activity. For NEW: verify no existing journey covers this taskN/A for first-run evaluations where no journey history exists
Incremental Update FidelityWhen updating an existing journey, is the original context preserved while new info is correctly incorporated?Compare updated journey against previous version + new signals → flag lost context or incorrect overwritesN/A when journey is NEW_WORK with no prior version
Dismissed Topic SuppressionAre manually dismissed journeys correctly excluded from regeneration?Cross-reference output against dismissed titles list → flag resurfaced dismissed topicsN/A when no dismiss history exists
Privacy & Sensitivity FilteringAre sensitive topics correctly blocked? Any PII leakage?PII scan (regex + NER) across all journey titles, summaries, and source references + personal-vs-work classificationAny PII (email addresses in title, SSN, credit card) = ship-blocker. Personal-but-not-PII (e.g., "dentist appointment") = High
Staleness RejectionAre journeys based on signals older than the 3-day freshness window filtered out?Check source signal timestamps → flag journeys based entirely on expired signals"Entirely expired" = ALL contributing sources >3 days old AND no new signals merged in
Noise RejectionAre low-value / trivial signals correctly suppressed?Classify signal value → flag low-value signals that produced journeys"Noise" = auto-generated system emails, all-day calendar placeholders, newsletters. "Not noise" = routine meetings if user is organizer or has action items
Pass C: Lifecycle ≥ 90% · Privacy failure = 0% (ship-blocker) · Noise pass-through < 5%

Rank Agent Output

10 metrics

Are journeys correctly prioritized, and do the top-N come with high-quality execution plans? Uses dual scoring: rule-based (40%) + model scoring (60%).

Sub-metricWhat to measureHow
Priority Signal AlignmentDo top-ranked journeys genuinely have the highest urgency/importance?Extract rule-based + model scores → verify top-N score highest on timeliness, stakeholder importance, ownership strength
Recall@3 (Top-3 Hit Rate)Among all active journeys, how many of the human-labeled "ideal Top-3" are captured by the system's Top-3?Golden set includes human-annotated ideal Top-3. Recall@3 = (# human Top-3 in system Top-3) / 3
Time Sensitivity AccuracyFor time-driven journeys, is the urgency score correctly calculated?Compare urgency score against actual calendar/email timestamps
Ranking StabilityDo rankings change erratically between sessions without new signals?Compare logs across sessions → flag unexplained rank changes
Task Category DistributionAre the top-N spread across different task types or all clustered in one?Classify by action_type → compute entropy / top-1 concentration ratio
Data Source CoverageAre the top-N drawing from multiple M365 sources or over-reliant on one?Tag contributing sources per journey → measure source distribution
Temporal Trigger DiversityAre urgency triggers varied (deadline-driven, people-driven, recurrence-driven)?Extract label types → check distribution across top-N
Cross-user ConsistencyIs low diversity or ranking quality systemic or isolated?Aggregate per-user scores → flag if >50% show same pattern
ExecutionPlan QualityIs the generated ExecutionPlan well-structured, specific, and executable by Copilot?AI evaluates: task clarity, context inclusion, output format, scope. Score 1-5
ExecutionPlan-Journey AlignmentDoes the ExecutionPlan match the journey's task definition?Compare plan objective against journey title/summary → flag mismatches
Pass: Top-3 inversion < 10% · Recall@3 ≥ 0.6 · No single action_type > 60% · ExecutionPlan avg ≥ 3.5
Part 2: Final Output Evaluation

Card Presentation Quality

4 metrics

Does the card earn the click — and set honest expectations for what's behind it?

Sub-metricWhat to measureHow
Brief ClarityIs the task brief (title + summary) specific, actionable, and immediately understood?AI evaluates: specificity, actionability, grammar, length. Binary: "Can you predict what happens if you click?"
Label AccuracyDoes the label (type + name) correctly reflect the actual urgency trigger?Cross-reference label against source signals. E.g., "Asked by VP Chen" — verify VP Chen actually made the request
Icon RelevanceDoes the action_type icon match the actual task?Map action_type to journey content → flag mismatches
Promise AccuracyDoes the card accurately represent what the Copilot execution will deliver?Compare card title + summary vs. actual execution output → score promise-delivery alignment 1-5. Score ≤ 2 = critical failure
Pass: Clarity ≥ 90% · Promise Accuracy score ≤ 2 = 0%

Execution & Output Quality

6 metrics

After the user clicks "Open in Copilot," did the response actually deliver value? Root-cause tagging (cj_pipeline | m365_copilot | ambiguous) is applied for debugging — all failures count toward the overall feature verdict.

Sub-metricWhat to measureHowRoot-cause notes
Execution Completion RateWhat % of executions completed successfully vs. failed/timed out?Parse execution traces → compute success rateTimeout/crash = likely m365_copilot. Malformed prompt = cj_pipeline
Output CorrectnessAre factual claims grounded in the user's actual data?For each claim, verify against source signals. Claim precision = supported / total. Target ≥ 95%Wrong facts from bad context in prompt = cj_pipeline. Correct context but Copilot hallucinated = m365_copilot
Output CompletenessDoes the output cover all key threads from the journey's source signals?Compare entities/topics in output vs. key entities in source signalsCJ prompt omitted key context = cj_pipeline. Context provided but Copilot ignored = m365_copilot
Effort ReductionDoes the output synthesize and organize beyond raw data?AI rating 1-5. Score 1 = "just restated what I already know." Score 5 = "saves me 20+ min"Typically m365_copilot
ActionabilityAre there concrete next steps the user can immediately act on?Count actionable elements: draft ready to send? Link to right doc? Binary: ≥ 1 concrete action?Prompt didn't request actionable format = cj_pipeline. Requested but not delivered = m365_copilot
Scenario FitIs the output format appropriate for the action_type?Map action_type → expected format (REPLY_EMAIL → email draft, PREPARE_MEETING → agenda). Flag mismatchesCJ prompt specified wrong format = cj_pipeline. Correct request, wrong delivery = m365_copilot
Pass: Completion ≥ 95% · Correctness ≥ 95% (cj_pipeline only) · Effort Reduction avg ≥ 3.5

End-to-End Coherence & Task Resolution

3 metrics

From raw signals to final output, does the entire chain hold together?

Sub-metricWhat to measureHow
Full-chain Topic AlignmentDoes the primary topic stay consistent across all pipeline stages?Extract primary topic at each stage → flag any stage that introduces a different topic or contradicts another
Task Resolution ScoreCould the user meaningfully advance their task from this output alone?Expert/AI rating 1-5. Score 1 = "useless, would start over." Score 5 = "could complete the task right here." The single offline metric most predictive of real-world satisfaction
Recurrence HandlingFor RECURRING_WORK journeys, is the system recommending the correct upcoming instance?Check recurrence metadata → validate instance timing. N/A for NEW_WORK and ONGOING_WORK
Pass: Topic alignment break = 0% · Task Resolution avg ≥ 3.5
Failure Mode Taxonomy

Critical failures are flagged independently from dimension scores. Any critical failure triggers an overall Fail verdict regardless of other scores.

Severity Definitions
SeverityDefinitionAction
Ship-BlockerTrust-destroying or compliance-violating. Zero tolerance — even one instance blocks release.Blocks ship. Zero-tolerance threshold.
CriticalFundamental value proposition broken. User sees wrong, fabricated, or misleading content.Must fix before GA. Threshold: < 2% of evaluated journeys.
HighMeaningful quality gap. User gets some value but with significant friction.Must be triaged. Acceptable only with documented mitigation plan.
MediumNoticeable imperfection. User still gets value but experience is suboptimal.Should fix. Can ship with known issues tracked.
Failure Mode Catalog
Failure ModeAgent / StageDescriptionSeverity
Privacy ViolationMergeSensitive / personal content surfaces in journeyShip-Blocker
PII LeakageMergePII exposed in journey title, summary, or sourcesShip-Blocker
Resurfaced Dismissed TopicMergeA manually dismissed journey topic reappearsShip-Blocker
Hallucinated JourneyMergeJourney has no traceable source signal — entirely fabricatedCritical
Wrong Source AttributionMergeJourney links to incorrect M365 itemsCritical
Phantom TaskMergeSignals are real but the inferred task is notCritical
Completed Task SurfacedMergeJourney recommends a task the user already finishedCritical
Promise-Delivery GapOutputCard promises something the execution output doesn't deliverCritical
Over-PromiseOutputCard implies capability the Copilot execution can't supportCritical
Output HallucinationOutputFactual claims not supported by any source signalCritical
Topic DriftOutputPrimary topic changes between pipeline stagesHigh
Missed Source EntirelyGroupingA data source with obvious task signals produced zero candidatesHigh
Mis-grouped ItemsGroupingUnrelated items from the same source lumped into one candidateHigh
Orphaned Key ItemGroupingA high-importance item not assigned to any candidateHigh
Over-merged TasksMergeDistinct tasks from different sources collapsed into one journeyHigh
Wrong LifecycleMergeNEW_WORK tagged as ONGOING_WORK or vice versaHigh
Scope MismatchMergeTask is too broad or too narrowHigh
AI MisfitMergeTask is too simple or beyond AI capabilityHigh
Noise LeakMergeLow-value / trivial signals produced a journeyHigh
Stale JourneyMergeJourney based on signals older than 3-day windowHigh
Priority InversionRankA clearly more urgent journey ranked below a less important oneHigh
Bad ExecutionPlanRankPlan is vague, references wrong sources, or describes a different taskHigh
Missing Key ContextOutputOutput omits a major thread the user clearly cares aboutHigh
Wrong Help TypeOutputCorrect topic but wrong output formatHigh
No ActionabilityOutputCorrect information but no concrete next stepHigh
Wrong Recurrence InstanceOutputRecurring task surfaces a past instance instead of upcoming oneHigh
Under-merged DuplicatesMergeSame task from different sources left as separate journeysMedium
Monoculture RecommendationsRankAll top-N journeys are the same action_type or trigger typeMedium
Vague CardOutputTitle/summary too generic — user can't predict what they'll getMedium
Redundant OutputOutputOutput just restates source data — no synthesisMedium
Verdict Rules
ConditionOverall Verdict
Any Ship-Blocker failureFail — blocks ship regardless of scores
Any Critical failure rate ≥ 2%Fail — must fix before GA
Critical < 2% AND ≥ 2 High failuresConditional — requires mitigation plan
Critical < 2% AND 0-1 High AND all pass criteria metPass — ship-ready
Manual Eval Cross-reference
Manual EvalMachine Eval MappingCoverage
C.1 Task AccuracyMerge Pass B: Task AccuracyBoth
C.2 Task RelevanceMerge Pass B: Gate ComplianceBoth
C.3 AI ActionabilityMerge Pass B: AI ActionabilityBoth
C.4 PrivacyMerge Pass C: Privacy FilteringBoth
C.5-C.8 Card DisplayCard: Clarity, Label, IconBoth
X.1 Top-3 RankingRank: Priority + Recall@3Both
X.2 Missing JourneysRank: Recall@3 + GroupingBoth
X.3 Shouldn't Be ThereMerge Pass B: Gate ComplianceBoth
O.1-O.5 Output QualityExecution & Output (all)Both
U.1-U.4 UXManual only
G.1-G.2 OverallManual only

Step 2 — Build Eval Dataset

Coming Soon

Dataset construction details

Step 3 — Generate Outputs

Coming Soon

Output generation and trace capture details

Step 4 — Human Labeling & Calibration

Coming Soon

Annotation workflow and golden set details

Step 5 — Build Automated Judge

Coming Soon

LLM-as-Judge calibration and rule-based checks

Step 6 — Eval + GEPA Optimization

Overview

Commercial Journeys is a multi-agent pipeline. Evaluation operates at two levels: per-agent output (fault isolation) and end-to-end output (holistic quality).

What is GEPA?

GEPA (Genetic-Pareto) is a reflective prompt evolution method. Instead of brute-force RL search, it lets a strong model review failures → analyze root causes → surgically rewrite prompts, managed through a genetic evolution framework with Pareto selection to avoid single-dimension overfitting. Children are added to the candidate pool unconditionally — natural selection happens through Pareto frontier recomputation, not explicit accept/reject gates.

Reference: Reflective Prompt Evolution Can Outperform Reinforcement Learning (ICLR 2026)

Two Datasets, Two Roles
Training Set (D_train)

The dual-purpose workhorse. Serves both roles during GEPA optimization:

  • Scoring: All/most cases scored per round to compute Pareto frontiers
  • Feedback sampling: Random batch sampled each round for failure trace collection & reflection
Why not split? Pareto dimensions = per-case scores. If feedback comes from different data, reflection targets won't align with selection criteria.
Hold-out Test Set

The sealed final exam. Completely isolated from optimization. Only used twice: baseline (6a) and final validation (6e).

Never seen during optimization — ensures results generalize beyond training data
End-to-End Flow
6a
Baseline
6b
Per-Agent GEPA
6c
Convergence
6d
Cross-Agent GEPA
6e
Final Validation
6f
Continuous Regression
6a. Baseline Evaluation

Run the Hold-out Test Set against all current pipeline prompts. Record baseline scores across all 42 metrics.

  • This is the test set's first use
  • Output: per-agent, per-metric baseline scores — the anchor for final comparison
6b. Per-Agent GEPA Loop
Premise
  • Lock the model — Commercial Journeys runs on Sydney. GEPA optimizes prompts only, no model switching.
  • One agent at a time — freeze other agents' outputs as control variables.
  • Order: Upstream → Downstream — Grouping (×N) → Merge → Rank. Upstream quality sets the ceiling for downstream; optimize inputs first.
Commercial Journeys Agent Configuration
AgentSeed PromptFrozen ContextEval Metrics
Grouping ×NPer-source current promptRaw data signalsCoherence, Completeness, Coverage
MergeCurrent Merge promptOptimized Grouping outputsPass A (Fusion) + B (Task) + C (Filter) — 16 metrics
RankCurrent Rank promptOptimized Merge outputsPriority, Recall@3, ExecutionPlan Quality & Alignment

Pareto dimensions = each training case's aggregate score (not per-metric). Too many dimensions would make almost every prompt non-dominated, defeating Pareto selection.

The GEPA Cycle (Per-Agent)
Initialize
Seed Prompt
Start with 1 prompt (P₀) — the agent's current production prompt. The candidate pool grows each round as children are unconditionally added: [P₀, P₁, P₂, …]
Step 1 — Score & Select
D_train Scoring → Pareto Frontier → Parent
Score every candidate on all/most cases in D_train using the LLM Judge (from Step 5). Compute the Pareto frontier — prompts that aren't fully dominated by any other.
PromptCase 1Case 2Case 3Status
P₀0.90.40.7 Frontier
P₁0.60.80.7 Frontier
P₂0.70.70.9 Frontier
P₃0.50.50.5 Dominated
Select parent from frontier (prefer least-selected). Say we pick P₀.

Note: dominated prompts (like P₃) remain in the pool — they may return to the frontier if future children shift the landscape. Natural selection, not manual pruning.

Step 2 — Expose & Reflect
D_train Sampling → Feedback Function µ_f → Reflection + Ancestry
Randomly sample a batch from the same D_train. Run parent P₀ on these cases. Collect structured failure feedback via the Feedback Function µ_f, then feed to a Reflection LLM along with P₀'s ancestry lessons.
Feedback Function µ_f (distinct from scoring µ)

The scoring metric µ returns a scalar (0–1) for Pareto computation. The feedback function µ_f returns structured natural language — this is what makes reflection surgical rather than guesswork.

µ_f output example (Merge agent):
Case 7 | Score: 0.3 | Failure: Two emails about Q3 review + one calendar event point to the same deliverable, but Merge output treats them as independent journeys. Root cause: fusion rule lacks cross-source deliverable matching.

Reflection quality depends on µ_f richness: scalar scores alone → reflection guesses; structured failure traces → reflection does targeted surgery.
Prompt Ancestry & Lesson Accumulation

Every prompt carries an ancestry chain — who its parent was, and what lesson each mutation learned. This prevents "fix A, break B, fix B, break A" oscillation.

P₀'s ancestry (example):
P₀ (seed) → P₂ (lesson: "added cross-source matching but broke lifecycle filtering") → P₅ (lesson: "restored lifecycle rules, scoped cross-source to same-week window")

When mutating P₅, the Reflection LLM sees all accumulated lessons and avoids repeating past mistakes.
Reflection LLM input: current prompt + µ_f failure traces + ancestry lessons

"P₀ is too conservative in cross-source fusion — frequently under-merges.
  Ancestry note: P₀'s grandparent tried aggressive fusion but broke lifecycle filtering (see lesson from round 3).
  Suggest: 'Merge cross-source items sharing the same deliverable within a 7-day window, preserving lifecycle boundaries.'"

Scoring and feedback come from the same D_train — this ensures reflection targets align with Pareto selection criteria.

Step 3 — Mutate & Add to Pool
Targeted Mutation / Semantic Crossover → Unconditional Pool Entry
Two mutation modes, both informed by ancestry lessons:
Targeted Mutation (single parent)
Reflection LLM rewrites parent P₀ based on failure analysis + ancestry lessons → child P₄
Semantic Crossover (two parents)
Select two frontier prompts. LLM identifies each one's complementary strengths (e.g. P₁ excels at fusion, P₂ at lifecycle). Generates a child that semantically fuses both — not module splicing, but understanding why each works and synthesizing.
Crossover example (Merge prompt):
P₁ strength: excellent cross-source fusion (handles email+calendar overlap)
P₂ strength: robust lifecycle filtering (correctly expires completed tasks)
→ Child P₄: LLM synthesizes a prompt that captures both fusion logic and lifecycle rules, resolving conflicts where they overlap
P₄ joins the pool unconditionally → [P₀, P₁, P₂, P₃, P₄]
No accept/reject gate. If P₄ is weak, it'll be naturally dominated in the next round's Pareto computation. But it might have unique strengths on specific cases that preserve diversity.
P₄ inherits ancestry from its parent(s) + adds the current round's lesson.
Step 4 — Loop or Converge
Back to Step 1 — until convergence
Next round: re-score all candidates (including new child), recompute Pareto frontier, select new parent… Stop when any convergence signal fires:
Iteration cap reached (e.g. 15 rounds)
Best score plateaus: < 1% change for N consecutive rounds
Frontier stops changing: same prompts on frontier for N rounds
All pass criteria already met

Convergence ≠ global optimum. It means "no further gains under current search strategy." A fundamentally different prompt approach might score higher, but GEPA's mutation path didn't explore it. This is why we validate with the Hold-out Test Set later (6e).

6c. Per-Agent Convergence & Regression
Output: Top-N Candidates

Only after convergence (any of the 4 stop signals in 6b Step 4 fires) do we select Top-N from the final Pareto frontier. Rank by average score across all training cases:

PromptV1V2V3Avg
P₂0.70.70.90.77
P₅0.80.60.80.73
P₁0.60.80.70.70
Sanity Check: E2E Integration

Since we optimize upstream → downstream and freeze other agents, per-agent regression is unlikely within the same optimization round. However, before moving to the next agent, run a quick E2E integration check to confirm the new prompt works correctly when wired into the full pipeline:

Pipeline works as expected
→ Lock this agent's result
→ Move to next agent
Unexpected interaction
→ Investigate and re-adapt if needed
→ Then continue

The real cross-agent validation happens in 6d, where all prompt combinations are evaluated end-to-end.

Commercial Journeys Per-Agent Outputs
AgentGEPA OutputSent to 6d
Grouping ×NTop-1 per source (N independent sources)N prompts (fixed)
MergePareto frontier → Top-3 by avg score[M₁, M₂, M₃]
RankPareto frontier → Top-3 by avg score[R₁, R₂, R₃]
6d. Cross-Agent GEPA Loop + Convergence

Why this step? Per-agent optimization freezes other agents. But in production, all prompts interact. A prompt that's #1 in isolation may not be the best partner for another agent's top pick. We need to find the globally optimal combination.

Step 1 — Assemble Candidates
Collect Per-Agent Top-N
Grouping: Top-1 per source → N prompts (fixed, not combinatorially expanded)
Merge: Top-3 from per-agent Pareto → [M₁, M₂, M₃]
Rank: Top-3 from per-agent Pareto → [R₁, R₂, R₃]

Combination space: 3 × 3 = 9 prompt suites (Grouping is fixed across all combinations)
Step 2 — Full E2E Evaluation
Run All 9 Combinations End-to-End
Each combination runs the full pipeline (Grouping → Merge → Rank → Card → Execution) on D_train. Score all 42 metrics per combination.

Cost: Only 9 E2E eval runs — very manageable.
CombinationE2E ScoreShip-BlockersCritical Rate
M₁ + R₁0.8101.2%
M₂ + R₁0.8500.8%
M₃ + R₁0.7901.5%
M₁ + R₂0.8001.1%
Step 3 — Select & Confirm
Global Optimum + Convergence Check
Select the highest-scoring combination: M₂ + R₁ wins (not the per-agent #1s M₁ + R₁).

Convergence check: The global optimum is stable if:
Winner is clearly ahead (gap > margin of error)
Zero Ship-Blockers in winning combination
Critical failure rate within threshold (< 2%)
All per-agent pass criteria still hold in the combined run

If no combination meets all criteria → revisit per-agent GEPA (6b) with adjusted constraints, or expand Top-N to explore more combinations.

6e. Final Validation

Run the Hold-out Test Set against the winning prompt suite from 6d. Compare against 6a baseline.

Test ≈ D_train scores
No overfitting. Results generalize.
→ Ship it
Test ≪ D_train scores
Prompts overfit to training data
→ Revisit dataset or strategy
  • This is the test set's second and final use in this optimization round
  • If another round is needed → supplement ≥ 30% new test cases
6f. Continuous Regression & Monitoring

After shipping the optimized prompt suite, integrate eval into CI/CD. Re-trigger the full eval + GEPA cycle when:

Prompt Change

Any manual edit to a pipeline prompt (Grouping, Merge, or Rank)

Model Update

Sydney model version update or provider-side changes

Pipeline Logic Change

Changes to pipeline orchestration, data ingestion, or gate conditions

Score Regression

CI/CD monitoring detects eval score drops below thresholds

New Data Sources

New signal sources added to the pipeline (e.g. new M365 connector)

Periodic Cadence

Scheduled re-evaluation (e.g. monthly or quarterly) to catch gradual drift