● Microsoft Edge · Commercial Journeys

Machine Eval + GEPA Iteration

A systematic framework for evaluating and optimizing the Commercial Journeys pipeline through automated evaluation and reflective prompt evolution.

The Six-Step Framework

STEP 1

Define Goals & Metrics

✓ Draft Done
  • Clarify evaluation target
  • Define per-agent metrics + end-to-end metrics
  • Set scoring criteria and boundary definitions
  • Define failure taxonomy (Ship-Blocker / Critical / High / Medium)
STEP 2

Build Eval Dataset

Pending
  • Build input data + ground truth; cover positive / negative / edge cases
  • Split into 2 non-overlapping sets:
    • Training Set (D_train) — GEPA scoring + feedback sampling
    • Hold-out Test Set — formal eval only
  • Hold-out locked first; max 2 uses per optimization round
STEP 3

Generate Outputs

Pending
  • Collect final results + per-agent intermediate outputs
  • Record metadata (latency, tokens, model version, prompt version)
STEP 4

Human Labeling & Calibration

Pending
  • Source 1: Selfhost Manual Eval scores (C.1-O.5) → golden set baseline
  • Source 2: Expert annotation for machine-eval-specific metrics (Source Attribution, Lifecycle, etc.)
  • ≥ 2 independent annotators; golden set = input + output + per-metric scores + rationale
STEP 5

Build Automated Judge

Pending
  • LLM Judge calibration (Cohen's κ ≥ 0.7) + rule-based checks
  • Same Judge = formal eval scorer + GEPA fitness function
STEP 6

Eval + GEPA Optimization

✓ Mechanism Draft
  • 6a Baseline evaluation
  • 6b Per-agent GEPA loop (upstream → downstream)
  • 6c Per-agent convergence + regression
  • 6d Cross-agent GEPA loop + convergence
  • 6e Final validation vs baseline
  • 6f Continuous regression & monitoring

Step 1 — Define Goals & Metrics

Commercial Journeys is a multi-agent pipeline. Evaluation operates at two levels: per-agent output (fault isolation) and end-to-end output (holistic quality).

42
Total Metrics
8
Eval Scopes
27
Failure Modes
4
Severity Levels
Pipeline Architecture
📧 M365 Data
🌐 Browser History
Grouping Agent ×N
3 metrics
Merge Agent
16 metrics (3 passes)
Rank Agent
10 metrics
🃏 NTP Cards
⚡ Copilot Execution
13 metrics
Evaluation Conventions
N/A Handling

Conditional metrics marked N/A when preconditions aren't met. Don't count toward pass/fail. Tracked as n/a_rate per metric (target: < 40%).

Responsibility Boundary

Every failure tagged: cj_pipeline | m365_copilot | ambiguous. Only cj_pipeline counts toward Commercial Journeys pass/fail.

Part 1: Per-Agent Output Evaluation
📋

Grouping Agent Output

3 metrics

Does each data source's clustering correctly identify task candidates from raw signals? Runs independently per source — Email, Teams, Calendar, Meeting Recaps, Files, Browser History.

Intra-source Grouping Coherence

Are all items in one cluster truly about the same task?

AI reviews each cluster → judges semantic unity. Unrelated items lumped = failure.
Intra-source Grouping Completeness

Are all items for the same task in the same cluster?

AI scans unclustered items → flags items that should belong to existing cluster.
Per-source Coverage

Did sources with task signals produce candidates?

Compare active sources vs. zero-candidate sources → flag missed signals.
✓ Coherence error < 5% · Completeness error < 5%
🔀

Merge Agent Output

16 metrics · 3 passes

Are candidates from different data sources correctly fused into complete, well-defined journeys? The most complex step — 16 metrics split into 3 focused evaluation passes to reduce LLM-as-Judge cognitive load.

PASS A — Fusion Quality
Cross-source Fusion (Over-merging)

Were distinct tasks incorrectly merged?

Different owners, deadlines, or deliverables = distinct tasks
Cross-source Fusion (Under-merging)

Were same-task candidates left separate?

User would describe as one to-do item = same task
Source Attribution Precision

No wrong source links?

Content directly relevant = correct. Tangential = incorrect.
Source Attribution Recall

Key sources included?

Key = primary trigger, deadline info, manager+ stakeholder.
Source Role Assignment

Correct roles? (Email=trigger, Calendar=deadline, etc.)

N/A when single source.
✓ Over-merge < 3% · Under-merge < 5% · Precision ≥ 95% · Recall ≥ 90%
PASS B — Task Definition
Task Accuracy

Is this a real task the user needs to do?

Explicitly assigned/committed = real. CC'd FYI → inferred action = not real.
Task Granularity

Right scope? Not too broad/narrow?

Score 1-5. "Prepare for work this week" = 1. "Draft reply to VP's email" = 5.
AI Actionability

Does this task benefit from AI?

Too simple (one-click) or beyond AI (physical) = flagged.
Gate Condition Compliance

Passes all 7 gates?

IS_CONCRETE_TASK · IS_WORK_RELATED · HAS_USER_RELEVANCE · IS_NOT_COMPLETED · HAS_MEANINGFUL_COMPLEXITY · REQUIRES_BROWSER · IS_AI_EXECUTABLE
Hallucination Detection

Zero traceable source signals?

No evidence in ANY source = hallucination (critical). Strong multi-source inference ≠ hallucination.
✓ Task Accuracy ≥ 90% · Hallucination = 0%
PASS C — Filtering & Lifecycle
Lifecycle Accuracy

NEW / ONGOING / RECURRING correctly assigned?

N/A on first-run with no history.
Incremental Update Fidelity

Original context preserved when updating?

N/A for NEW_WORK.
Dismissed Topic Suppression

Dismissed journeys stay dismissed?

N/A when no dismiss history.
Privacy & Sensitivity Filtering

Sensitive topics blocked? PII scrubbed?

Any PII = ship-blocker. Personal-but-not-PII = High.
Staleness Rejection

Signals > 3 days filtered out?

ALL sources expired AND no fresh merge = stale.
Noise Rejection

Trivial signals suppressed?

Auto-generated emails, all-day placeholders, newsletters = noise.
✓ Lifecycle ≥ 90% · Privacy failure = 0% · Noise pass-through < 5%
🏆

Rank Agent Output

10 metrics

Are journeys correctly prioritized with high-quality execution plans?

Priority Signal Alignment

Top-ranked = highest urgency/importance?

Recall@3

How many human-labeled ideal Top-3 in system's Top-3?

Time Sensitivity Accuracy

Urgency score matches actual timestamps?

Ranking Stability

No erratic changes without new signals?

Task Category Distribution

Top-N spread across different action types?

Data Source Coverage

Top-N drawing from multiple sources?

Temporal Trigger Diversity

Varied urgency triggers?

Cross-user Consistency

Systemic issues vs. isolated?

ExecutionPlan Quality

Well-structured, specific, executable? Score 1-5.

ExecutionPlan-Journey Alignment

Plan matches journey's task?

✓ Top-3 inversion < 10% · Recall@3 ≥ 0.6 · No single action_type > 60% · ExecutionPlan avg ≥ 3.5
Part 2: Final Output Evaluation
🃏

Card Presentation Quality

4 metrics
Brief Clarity

Specific, actionable, immediately understood?

Label Accuracy

Label reflects actual urgency trigger?

Icon Relevance

Icon matches actual task?

Promise Accuracy

Card accurately represents execution output? Score 1-5.

✓ Clarity ≥ 90% · Promise score ≤ 2 = 0%

Execution & Output Quality

6 metrics

Root-cause tagging required: cj_pipeline | m365_copilot | ambiguous

Execution Completion

Success vs. fail/timeout rate

Timeout = likely m365. Malformed prompt = cj_pipeline.
Output Correctness

Factual claims grounded in user data? ≥ 95%

Bad context = cj. Copilot hallucinated = m365.
Output Completeness

Covers all key threads?

Commercial Journeys omitted context = cj. Copilot ignored = m365.
Effort Reduction

Synthesizes beyond raw data? Score 1-5.

Typically m365_copilot.
Actionability

Concrete next steps? ≥ 1 action?

Prompt didn't request = cj. Requested but not delivered = m365.
Scenario Fit

Output format matches action_type?

Wrong format in prompt = cj. Correct request, wrong delivery = m365.
✓ Completion ≥ 95% · Correctness ≥ 95% (cj_pipeline) · Effort Reduction avg ≥ 3.5
🔗

End-to-End Coherence & Task Resolution

3 metrics
Full-chain Topic Alignment

Primary topic consistent across all stages?

Task Resolution Score

Can user advance task from output alone? Score 1-5.

Recurrence Handling

Correct upcoming instance for RECURRING_WORK?

✓ Topic break = 0% · Resolution avg ≥ 3.5
Failure Mode Taxonomy
27 failure modes by severity:
🔴 3
🟠 7
🟡 13
🟢 4
🔴 Ship-Blocker — Zero Tolerance
Privacy ViolationMerge
PII LeakageMerge
Resurfaced Dismissed TopicMerge
🟠 Critical — < 2% Threshold
Hallucinated JourneyMerge
Wrong Source AttributionMerge
Phantom TaskMerge
Completed Task SurfacedMerge
Promise-Delivery GapOutput
Over-PromiseOutput
Output HallucinationOutput
🟡 High — Needs Mitigation
Topic DriftOutput
Missed Source EntirelyGrouping
Mis-grouped ItemsGrouping
Orphaned Key ItemGrouping
Over-merged TasksMerge
Wrong LifecycleMerge
Scope MismatchMerge
AI MisfitMerge
Noise LeakMerge
Stale JourneyMerge
Priority InversionRank
Bad ExecutionPlanRank
Missing Key ContextOutput
🟢 Medium — Can Ship with Tracking
Wrong Help TypeOutput
Under-merged DuplicatesMerge
Monoculture RecommendationsRank
Vague CardOutput
Verdict Rules
🚫 Any 🔴 Ship-Blocker → Fail — blocks ship
🚫 🟠 Critical ≥ 2% → Fail — must fix before GA
⚠️ 🟠 < 2% AND ≥ 2 🟡 → Conditional — needs mitigation
✅ 🟠 < 2% AND 0-1 🟡 AND all pass → Ship-ready
Evaluation Structure Summary
Eval ScopeWhatSub-metrics
Grouping AgentPer-source clustering quality3
Merge Agent — Pass AFusion quality5
Merge Agent — Pass BTask definition5
Merge Agent — Pass CFiltering & lifecycle6
Rank AgentPriority, recall, diversity, ExecutionPlan10
Card PresentationPre-click experience4
Execution & OutputPost-click value6
E2E CoherenceFull-chain consistency3
Total42
Manual Eval Cross-reference
Manual EvalMachine Eval MappingCoverage
C.1 Task AccuracyMerge Pass B: Task Accuracy♻️ Both
C.2 Task RelevanceMerge Pass B: Gate Compliance♻️ Both
C.3 AI ActionabilityMerge Pass B: AI Actionability♻️ Both
C.4 PrivacyMerge Pass C: Privacy Filtering♻️ Both
C.5-C.8 Card DisplayCard: Clarity, Label, Icon♻️ Both
X.1 Top-3 RankingRank: Priority + Recall@3♻️ Both
X.2 Missing JourneysRank: Recall@3 + Grouping♻️ Both
X.3 Shouldn't Be ThereMerge Pass B: Gate Compliance♻️ Both
O.1-O.5 Output QualityExecution & Output (all)♻️ Both
U.1-U.4 UXManual only
G.1-G.2 OverallManual only
Consumer Eval Cross-reference
Consumer DimensionCommercial MappingStatus
L1 RelevanceMerge Pass B: Task Accuracy + Hallucination
L1 GroundednessMerge Pass A: Source Attribution
L1 HelpfulnessMerge Pass B: AI Actionability + Gates
L1 Safety & PrivacyMerge Pass C: Privacy Filtering
L1 Technical FeasibilityCard: Promise Accuracy
L2 ClarityCard: Brief Clarity
L2 Promise AccuracyCard: Promise Accuracy
L3 CorrectnessExecution: Output Correctness
L3 CompletenessExecution: Output Completeness
L3 Effort ReductionExecution: Effort Reduction
L3 ActionabilityExecution: Actionability
L3 Scenario FitExecution: Scenario Fit
L4 Self-SufficiencyE2E: Task Resolution Score
L4 E2E CoherenceE2E: Full-chain Topic Alignment
L4 Engagement / TrustOnline-only⚪ N/A

Step 2 — Build Eval Dataset

🚧 Coming soon — Dataset construction details will be added here.

Step 3 — Generate Outputs

🚧 Coming soon — Output generation and trace capture details will be added here.

Step 4 — Human Labeling & Calibration

🚧 Coming soon — Annotation workflow and golden set details will be added here.

Step 5 — Build Automated Judge

🚧 Coming soon — LLM-as-Judge calibration and rule-based checks will be added here.

Step 6 — Eval + GEPA Optimization

Overview
What is GEPA?

GEPA (Genetic-Pareto) is a reflective prompt evolution method. Instead of brute-force RL search, it lets a strong model review failures → analyze root causes → surgically rewrite prompts, managed through a genetic evolution framework with Pareto selection to avoid single-dimension overfitting. Children are added to the candidate pool unconditionally — natural selection happens through Pareto frontier recomputation, not explicit accept/reject gates.

Reference: Reflective Prompt Evolution Can Outperform Reinforcement Learning (ICLR 2026)

Two Datasets, Two Roles
Training Set (D_train)

The dual-purpose workhorse. Serves both roles during GEPA optimization:

  • Scoring: All/most cases scored per round to compute Pareto frontiers
  • Feedback sampling: Random batch sampled each round for failure trace collection & reflection
Why not split? Pareto dimensions = per-case scores. If feedback comes from different data, reflection targets won't align with selection criteria.
Hold-out Test Set

The sealed final exam. Completely isolated from optimization. Only used twice: baseline (6a) and final validation (6e).

Never seen during optimization — ensures results generalize beyond training data
End-to-End Flow
6a
Baseline
6b
Per-Agent GEPA
6c
Convergence
6d
Cross-Agent GEPA
6e
Final Validation
6f
Continuous Regression
6a. Baseline Evaluation

Run the Hold-out Test Set against all current pipeline prompts. Record baseline scores across all 42 metrics.

  • This is the test set's first use
  • Output: per-agent, per-metric baseline scores — the anchor for final comparison
6b. Per-Agent GEPA Loop
Premise
  • Lock the model — Commercial Journeys runs on Sydney. GEPA optimizes prompts only, no model switching.
  • One agent at a time — freeze other agents' outputs as control variables.
  • Order: Upstream → Downstream — Grouping (×N) → Merge → Rank. Upstream quality sets the ceiling for downstream; optimize inputs first.
Commercial Journeys Agent Configuration
AgentSeed PromptFrozen ContextEval Metrics
Grouping ×NPer-source current promptRaw data signalsCoherence, Completeness, Coverage
MergeCurrent Merge promptOptimized Grouping outputsPass A (Fusion) + B (Task) + C (Filter) — 16 metrics
RankCurrent Rank promptOptimized Merge outputsPriority, Recall@3, ExecutionPlan Quality & Alignment

Pareto dimensions = each training case's aggregate score (not per-metric). Too many dimensions would make almost every prompt non-dominated, defeating Pareto selection.

The GEPA Cycle (Per-Agent)
Initialize
Seed Prompt
Start with 1 prompt (P₀) — the agent's current production prompt. The candidate pool grows each round as children are unconditionally added: [P₀, P₁, P₂, …]
Step 1 — Score & Select
D_train Scoring → Pareto Frontier → Parent
Score every candidate on all/most cases in D_train using the LLM Judge (from Step 5). Compute the Pareto frontier — prompts that aren't fully dominated by any other.
PromptCase 1Case 2Case 3Status
P₀0.90.40.7 Frontier
P₁0.60.80.7 Frontier
P₂0.70.70.9 Frontier
P₃0.50.50.5 Dominated
Select parent from frontier (prefer least-selected). Say we pick P₀.

Note: dominated prompts (like P₃) remain in the pool — they may return to the frontier if future children shift the landscape. Natural selection, not manual pruning.

Step 2 — Expose & Reflect
D_train Sampling → Feedback Function µ_f → Reflection + Ancestry
Randomly sample a batch from the same D_train. Run parent P₀ on these cases. Collect structured failure feedback via the Feedback Function µ_f, then feed to a Reflection LLM along with P₀'s ancestry lessons.
Feedback Function µ_f (distinct from scoring µ)

The scoring metric µ returns a scalar (0–1) for Pareto computation. The feedback function µ_f returns structured natural language — this is what makes reflection surgical rather than guesswork.

µ_f output example (Merge agent):
Case 7 | Score: 0.3 | Failure: Two emails about Q3 review + one calendar event point to the same deliverable, but Merge output treats them as independent journeys. Root cause: fusion rule lacks cross-source deliverable matching.

Reflection quality depends on µ_f richness: scalar scores alone → reflection guesses; structured failure traces → reflection does targeted surgery.
Prompt Ancestry & Lesson Accumulation

Every prompt carries an ancestry chain — who its parent was, and what lesson each mutation learned. This prevents "fix A, break B, fix B, break A" oscillation.

P₀'s ancestry (example):
P₀ (seed) → P₂ (lesson: "added cross-source matching but broke lifecycle filtering") → P₅ (lesson: "restored lifecycle rules, scoped cross-source to same-week window")

When mutating P₅, the Reflection LLM sees all accumulated lessons and avoids repeating past mistakes.
Reflection LLM input: current prompt + µ_f failure traces + ancestry lessons

"P₀ is too conservative in cross-source fusion — frequently under-merges.
  Ancestry note: P₀'s grandparent tried aggressive fusion but broke lifecycle filtering (see lesson from round 3).
  Suggest: 'Merge cross-source items sharing the same deliverable within a 7-day window, preserving lifecycle boundaries.'"

Scoring and feedback come from the same D_train — this ensures reflection targets align with Pareto selection criteria.

Step 3 — Mutate & Add to Pool
Targeted Mutation / Semantic Crossover → Unconditional Pool Entry
Two mutation modes, both informed by ancestry lessons:
Targeted Mutation (single parent)
Reflection LLM rewrites parent P₀ based on failure analysis + ancestry lessons → child P₄
Semantic Crossover (two parents)
Select two frontier prompts. LLM identifies each one's complementary strengths (e.g. P₁ excels at fusion, P₂ at lifecycle). Generates a child that semantically fuses both — not module splicing, but understanding why each works and synthesizing.
Crossover example (Merge prompt):
P₁ strength: excellent cross-source fusion (handles email+calendar overlap)
P₂ strength: robust lifecycle filtering (correctly expires completed tasks)
→ Child P₄: LLM synthesizes a prompt that captures both fusion logic and lifecycle rules, resolving conflicts where they overlap
P₄ joins the pool unconditionally → [P₀, P₁, P₂, P₃, P₄]
No accept/reject gate. If P₄ is weak, it'll be naturally dominated in the next round's Pareto computation. But it might have unique strengths on specific cases that preserve diversity.
P₄ inherits ancestry from its parent(s) + adds the current round's lesson.
Step 4 — Loop or Converge
Back to Step 1 — until convergence
Next round: re-score all candidates (including new child), recompute Pareto frontier, select new parent… Stop when any convergence signal fires:
Iteration cap reached (e.g. 15 rounds)
Best score plateaus: < 1% change for N consecutive rounds
Frontier stops changing: same prompts on frontier for N rounds
All pass criteria already met

Convergence ≠ global optimum. It means "no further gains under current search strategy." A fundamentally different prompt approach might score higher, but GEPA's mutation path didn't explore it. This is why we validate with the Hold-out Test Set later (6e).

6c. Per-Agent Convergence & Regression
Output: Top-N Candidates

Only after convergence (any of the 4 stop signals in 6b Step 4 fires) do we select Top-N from the final Pareto frontier. Rank by average score across all training cases:

PromptV1V2V3Avg
P₂0.70.70.90.77
P₅0.80.60.80.73
P₁0.60.80.70.70
Sanity Check: E2E Integration

Since we optimize upstream → downstream and freeze other agents, per-agent regression is unlikely within the same optimization round. However, before moving to the next agent, run a quick E2E integration check to confirm the new prompt works correctly when wired into the full pipeline:

Pipeline works as expected
→ Lock this agent's result
→ Move to next agent
Unexpected interaction
→ Investigate and re-adapt if needed
→ Then continue

The real cross-agent validation happens in 6d, where all prompt combinations are evaluated end-to-end.

Commercial Journeys Per-Agent Outputs
AgentGEPA OutputSent to 6d
Grouping ×NTop-1 per source (N independent sources)N prompts (fixed)
MergePareto frontier → Top-3 by avg score[M₁, M₂, M₃]
RankPareto frontier → Top-3 by avg score[R₁, R₂, R₃]
6d. Cross-Agent GEPA Loop + Convergence

Why this step? Per-agent optimization freezes other agents. But in production, all prompts interact. A prompt that's #1 in isolation may not be the best partner for another agent's top pick. We need to find the globally optimal combination.

Step 1 — Assemble Candidates
Collect Per-Agent Top-N
Grouping: Top-1 per source → N prompts (fixed, not combinatorially expanded)
Merge: Top-3 from per-agent Pareto → [M₁, M₂, M₃]
Rank: Top-3 from per-agent Pareto → [R₁, R₂, R₃]

Combination space: 3 × 3 = 9 prompt suites (Grouping is fixed across all combinations)
Step 2 — Full E2E Evaluation
Run All 9 Combinations End-to-End
Each combination runs the full pipeline (Grouping → Merge → Rank → Card → Execution) on D_train. Score all 42 metrics per combination.

Cost: Only 9 E2E eval runs — very manageable.
CombinationE2E ScoreShip-BlockersCritical Rate
M₁ + R₁0.8101.2%
M₂ + R₁0.8500.8%
M₃ + R₁0.7901.5%
M₁ + R₂0.8001.1%
Step 3 — Select & Confirm
Global Optimum + Convergence Check
Select the highest-scoring combination: M₂ + R₁ wins (not the per-agent #1s M₁ + R₁).

Convergence check: The global optimum is stable if:
Winner is clearly ahead (gap > margin of error)
Zero Ship-Blockers in winning combination
Critical failure rate within threshold (< 2%)
All per-agent pass criteria still hold in the combined run

If no combination meets all criteria → revisit per-agent GEPA (6b) with adjusted constraints, or expand Top-N to explore more combinations.

6e. Final Validation

Run the Hold-out Test Set against the winning prompt suite from 6d. Compare against 6a baseline.

Test ≈ D_train scores
No overfitting. Results generalize.
→ Ship it
Test ≪ D_train scores
Prompts overfit to training data
→ Revisit dataset or strategy
  • This is the test set's second and final use in this optimization round
  • If another round is needed → supplement ≥ 30% new test cases
6f. Continuous Regression & Monitoring

After shipping the optimized prompt suite, integrate eval into CI/CD. Re-trigger the full eval + GEPA cycle when:

Prompt Change

Any manual edit to a pipeline prompt (Grouping, Merge, or Rank)

Model Update

Sydney model version update or provider-side changes

Pipeline Logic Change

Changes to pipeline orchestration, data ingestion, or gate conditions

Score Regression

CI/CD monitoring detects eval score drops below thresholds

New Data Sources

New signal sources added to the pipeline (e.g. new M365 connector)

Periodic Cadence

Scheduled re-evaluation (e.g. monthly or quarterly) to catch gradual drift