● Microsoft Edge · Commercial Journeys

Machine Eval + GEPA Iteration

A systematic framework for evaluating and optimizing the Commercial Journeys pipeline through automated evaluation and reflective prompt evolution.

The Six-Step Framework

STEP 1

Define Goals & Metrics

✓ Draft Done
  • Clarify evaluation target
  • Define per-agent metrics + end-to-end metrics
  • Set scoring criteria and boundary definitions
  • Define failure taxonomy (Ship-Blocker / Critical / High / Medium)
STEP 2

Build Eval Dataset

Pending
  • Build input data + ground truth; cover positive / negative / edge cases
  • Split into 3 non-overlapping sets:
    • Feedback Set (D_feedback) — GEPA samples each round
    • Validation Set (D_pareto) — fixed Pareto scoring
    • Hold-out Test Set — formal eval only
  • Hold-out locked first; max 2 uses per optimization round
STEP 3

Generate Outputs

Pending
  • Collect final results + per-agent intermediate outputs
  • Record metadata (latency, tokens, model version, prompt version)
STEP 4

Human Labeling & Calibration

Pending
  • Source 1: Selfhost Manual Eval scores (C.1-O.5) → golden set baseline
  • Source 2: Expert annotation for machine-eval-specific metrics (Source Attribution, Lifecycle, etc.)
  • ≥ 2 independent annotators; golden set = input + output + per-metric scores + rationale
STEP 5

Build Automated Judge

Pending
  • LLM Judge calibration (Cohen's κ ≥ 0.7) + rule-based checks
  • Same Judge = formal eval scorer + GEPA fitness function
STEP 6

Eval + GEPA Optimization

✓ Mechanism Draft
  • 6a Baseline run on hold-out test set
  • 6b Per-agent GEPA loop (order flexible)
  • 6b-2 Global combination: top-N × top-N → best pairing
  • 6c Final validation vs baseline
  • 6d Continuous regression + CI/CD

Step 1 — Define Goals & Metrics

Commercial Journeys is a multi-agent pipeline. Evaluation operates at two levels: per-agent output (fault isolation) and end-to-end output (holistic quality).

42
Total Metrics
8
Eval Scopes
27
Failure Modes
4
Severity Levels
Pipeline Architecture
📧 M365 Data
🌐 Browser History
Grouping Agent ×6
3 metrics
Merge Agent
16 metrics (3 passes)
Rank Agent
10 metrics
🃏 NTP Cards
⚡ Copilot Execution
13 metrics
Evaluation Conventions
N/A Handling

Conditional metrics marked N/A when preconditions aren't met. Don't count toward pass/fail. Tracked as n/a_rate per metric (target: < 40%).

Responsibility Boundary

Every failure tagged: cj_pipeline | m365_copilot | ambiguous. Only cj_pipeline counts toward CJ pass/fail.

Part 1: Per-Agent Output Evaluation
📋

Grouping Agent Output

3 metrics

Does each data source's clustering correctly identify task candidates from raw signals? Runs independently per source — Email, Teams, Calendar, Meeting Recaps, Files, Browser History.

Intra-source Grouping Coherence

Are all items in one cluster truly about the same task?

AI reviews each cluster → judges semantic unity. Unrelated items lumped = failure.
Intra-source Grouping Completeness

Are all items for the same task in the same cluster?

AI scans unclustered items → flags items that should belong to existing cluster.
Per-source Coverage

Did sources with task signals produce candidates?

Compare active sources vs. zero-candidate sources → flag missed signals.
✓ Coherence error < 5% · Completeness error < 5%
🔀

Merge Agent Output

16 metrics · 3 passes

Are candidates from different data sources correctly fused into complete, well-defined journeys? The most complex step — 16 metrics split into 3 focused evaluation passes to reduce LLM-as-Judge cognitive load.

PASS A — Fusion Quality
Cross-source Fusion (Over-merging)

Were distinct tasks incorrectly merged?

Different owners, deadlines, or deliverables = distinct tasks
Cross-source Fusion (Under-merging)

Were same-task candidates left separate?

User would describe as one to-do item = same task
Source Attribution Precision

No wrong source links?

Content directly relevant = correct. Tangential = incorrect.
Source Attribution Recall

Key sources included?

Key = primary trigger, deadline info, manager+ stakeholder.
Source Role Assignment

Correct roles? (Email=trigger, Calendar=deadline, etc.)

N/A when single source.
✓ Over-merge < 3% · Under-merge < 5% · Precision ≥ 95% · Recall ≥ 90%
PASS B — Task Definition
Task Accuracy

Is this a real task the user needs to do?

Explicitly assigned/committed = real. CC'd FYI → inferred action = not real.
Task Granularity

Right scope? Not too broad/narrow?

Score 1-5. "Prepare for work this week" = 1. "Draft reply to VP's email" = 5.
AI Actionability

Does this task benefit from AI?

Too simple (one-click) or beyond AI (physical) = flagged.
Gate Condition Compliance

Passes all 7 gates?

IS_CONCRETE_TASK · IS_WORK_RELATED · HAS_USER_RELEVANCE · IS_NOT_COMPLETED · HAS_MEANINGFUL_COMPLEXITY · REQUIRES_BROWSER · IS_AI_EXECUTABLE
Hallucination Detection

Zero traceable source signals?

No evidence in ANY source = hallucination (critical). Strong multi-source inference ≠ hallucination.
✓ Task Accuracy ≥ 90% · Hallucination = 0%
PASS C — Filtering & Lifecycle
Lifecycle Accuracy

NEW / ONGOING / RECURRING correctly assigned?

N/A on first-run with no history.
Incremental Update Fidelity

Original context preserved when updating?

N/A for NEW_WORK.
Dismissed Topic Suppression

Dismissed journeys stay dismissed?

N/A when no dismiss history.
Privacy & Sensitivity Filtering

Sensitive topics blocked? PII scrubbed?

Any PII = ship-blocker. Personal-but-not-PII = High.
Staleness Rejection

Signals > 3 days filtered out?

ALL sources expired AND no fresh merge = stale.
Noise Rejection

Trivial signals suppressed?

Auto-generated emails, all-day placeholders, newsletters = noise.
✓ Lifecycle ≥ 90% · Privacy failure = 0% · Noise pass-through < 5%
🏆

Rank Agent Output

10 metrics

Are journeys correctly prioritized with high-quality execution plans?

Priority Signal Alignment

Top-ranked = highest urgency/importance?

Recall@3

How many human-labeled ideal Top-3 in system's Top-3?

Time Sensitivity Accuracy

Urgency score matches actual timestamps?

Ranking Stability

No erratic changes without new signals?

Task Category Distribution

Top-N spread across different action types?

Data Source Coverage

Top-N drawing from multiple sources?

Temporal Trigger Diversity

Varied urgency triggers?

Cross-user Consistency

Systemic issues vs. isolated?

ExecutionPlan Quality

Well-structured, specific, executable? Score 1-5.

ExecutionPlan-Journey Alignment

Plan matches journey's task?

✓ Top-3 inversion < 10% · Recall@3 ≥ 0.6 · No single action_type > 60% · ExecutionPlan avg ≥ 3.5
Part 2: Final Output Evaluation
🃏

Card Presentation Quality

4 metrics
Brief Clarity

Specific, actionable, immediately understood?

Label Accuracy

Label reflects actual urgency trigger?

Icon Relevance

Icon matches actual task?

Promise Accuracy

Card accurately represents execution output? Score 1-5.

✓ Clarity ≥ 90% · Promise score ≤ 2 = 0%

Execution & Output Quality

6 metrics

Root-cause tagging required: cj_pipeline | m365_copilot | ambiguous

Execution Completion

Success vs. fail/timeout rate

Timeout = likely m365. Malformed prompt = cj_pipeline.
Output Correctness

Factual claims grounded in user data? ≥ 95%

Bad context = cj. Copilot hallucinated = m365.
Output Completeness

Covers all key threads?

CJ omitted context = cj. Copilot ignored = m365.
Effort Reduction

Synthesizes beyond raw data? Score 1-5.

Typically m365_copilot.
Actionability

Concrete next steps? ≥ 1 action?

Prompt didn't request = cj. Requested but not delivered = m365.
Scenario Fit

Output format matches action_type?

Wrong format in prompt = cj. Correct request, wrong delivery = m365.
✓ Completion ≥ 95% · Correctness ≥ 95% (cj_pipeline) · Effort Reduction avg ≥ 3.5
🔗

End-to-End Coherence & Task Resolution

3 metrics
Full-chain Topic Alignment

Primary topic consistent across all stages?

Task Resolution Score

Can user advance task from output alone? Score 1-5.

Recurrence Handling

Correct upcoming instance for RECURRING_WORK?

✓ Topic break = 0% · Resolution avg ≥ 3.5
Failure Mode Taxonomy
27 failure modes by severity:
🔴 3
🟠 7
🟡 13
🟢 4
🔴 Ship-Blocker — Zero Tolerance
Privacy ViolationMerge
PII LeakageMerge
Resurfaced Dismissed TopicMerge
🟠 Critical — < 2% Threshold
Hallucinated JourneyMerge
Wrong Source AttributionMerge
Phantom TaskMerge
Completed Task SurfacedMerge
Promise-Delivery GapOutput
Over-PromiseOutput
Output HallucinationOutput
🟡 High — Needs Mitigation
Topic DriftOutput
Missed Source EntirelyGrouping
Mis-grouped ItemsGrouping
Orphaned Key ItemGrouping
Over-merged TasksMerge
Wrong LifecycleMerge
Scope MismatchMerge
AI MisfitMerge
Noise LeakMerge
Stale JourneyMerge
Priority InversionRank
Bad ExecutionPlanRank
Missing Key ContextOutput
🟢 Medium — Can Ship with Tracking
Wrong Help TypeOutput
Under-merged DuplicatesMerge
Monoculture RecommendationsRank
Vague CardOutput
Verdict Rules
🚫 Any 🔴 Ship-Blocker → Fail — blocks ship
🚫 🟠 Critical ≥ 2% → Fail — must fix before GA
⚠️ 🟠 < 2% AND ≥ 2 🟡 → Conditional — needs mitigation
✅ 🟠 < 2% AND 0-1 🟡 AND all pass → Ship-ready
Evaluation Structure Summary
Eval ScopeWhatSub-metrics
Grouping AgentPer-source clustering quality3
Merge Agent — Pass AFusion quality5
Merge Agent — Pass BTask definition5
Merge Agent — Pass CFiltering & lifecycle6
Rank AgentPriority, recall, diversity, ExecutionPlan10
Card PresentationPre-click experience4
Execution & OutputPost-click value6
E2E CoherenceFull-chain consistency3
Total42
Manual Eval Cross-reference
Manual EvalMachine Eval MappingCoverage
C.1 Task AccuracyMerge Pass B: Task Accuracy♻️ Both
C.2 Task RelevanceMerge Pass B: Gate Compliance♻️ Both
C.3 AI ActionabilityMerge Pass B: AI Actionability♻️ Both
C.4 PrivacyMerge Pass C: Privacy Filtering♻️ Both
C.5-C.8 Card DisplayCard: Clarity, Label, Icon♻️ Both
X.1 Top-3 RankingRank: Priority + Recall@3♻️ Both
X.2 Missing JourneysRank: Recall@3 + Grouping♻️ Both
X.3 Shouldn't Be ThereMerge Pass B: Gate Compliance♻️ Both
O.1-O.5 Output QualityExecution & Output (all)♻️ Both
U.1-U.4 UXManual only
G.1-G.2 OverallManual only
Consumer Eval Cross-reference
Consumer DimensionCommercial MappingStatus
L1 RelevanceMerge Pass B: Task Accuracy + Hallucination
L1 GroundednessMerge Pass A: Source Attribution
L1 HelpfulnessMerge Pass B: AI Actionability + Gates
L1 Safety & PrivacyMerge Pass C: Privacy Filtering
L1 Technical FeasibilityCard: Promise Accuracy
L2 ClarityCard: Brief Clarity
L2 Promise AccuracyCard: Promise Accuracy
L3 CorrectnessExecution: Output Correctness
L3 CompletenessExecution: Output Completeness
L3 Effort ReductionExecution: Effort Reduction
L3 ActionabilityExecution: Actionability
L3 Scenario FitExecution: Scenario Fit
L4 Self-SufficiencyE2E: Task Resolution Score
L4 E2E CoherenceE2E: Full-chain Topic Alignment
L4 Engagement / TrustOnline-only⚪ N/A

Step 2 — Build Eval Dataset

🚧 Coming soon — Dataset construction details will be added here.

Step 3 — Generate Outputs

🚧 Coming soon — Output generation and trace capture details will be added here.

Step 4 — Human Labeling & Calibration

🚧 Coming soon — Annotation workflow and golden set details will be added here.

Step 5 — Build Automated Judge

🚧 Coming soon — LLM-as-Judge calibration and rule-based checks will be added here.

Step 6 — Eval + GEPA Optimization

The final step combines baseline evaluation with iterative prompt optimization using GEPA (Genetic-Pareto) — a reflective prompt evolution method that outperforms RL while using up to 35× fewer rollouts.

6a. Baseline Evaluation

Run the hold-out test set against all current pipeline prompts. Record baseline scores across all 42 metrics. This is the test set's first use.

6b. GEPA Optimization Loop
Premise: Lock the Model

Commercial Journeys runs on Sydney. GEPA optimizes prompts only — no model switching.

Optimization Order: Per-Agent, Then Global

Optimize one agent at a time (freezing others), then combine the best candidates across agents. The per-agent order isn't critical — what matters is the global combination step at the end.

Option A: Upstream → Downstream
Grouping → Merge → Rank
Better inputs flow down; harder to see E2E impact early
Option B: Downstream → Upstream
Rank → Merge → Grouping
E2E metrics visible early; upstream changes may shift inputs

Either way, the 6b-2 Global Combination step (below) finds the best cross-agent pairing, so per-agent order is not the deciding factor.

Per-Agent GEPA Cycle — Visual Flowchart
初始状态
🌱 种子 Prompt
一开始只有 1 个 prompt(P₀)——当前正在用的 prompt。
经过几轮迭代后,候选池逐渐长大:[P₀, P₁, P₂, P₃, ...]
Step 1
📊 验证集打分 → Pareto 前沿 → 选父本
固定验证集(D_pareto)给候选池每个 prompt 打分。找出 Pareto 前沿——没有被任何其他 prompt 全面碾压的就在前沿上。
PromptCase 1Case 2Case 3状态
P₀0.90.40.7✅ 前沿
P₁0.60.80.7✅ 前沿
P₂0.70.70.9✅ 前沿
P₃0.50.50.5❌ 被支配
➜ 从前沿 [P₀, P₁, P₂] 中选父本(优先选被选次数少的),假设选了 P₀
Step 2
🔍 反馈集暴露失败 → 反思(Reflection)
反馈集(D_feedback)随机抽一批 case,用父本 P₀ 跑。收集失败 trace,喂给 Reflection LLM(GPT-5 / Claude Opus):
💭 Reflection LLM 分析:
"P₀ 在跨源融合时太保守,经常 under-merge。
  建议加入'当多个源指向同一 deliverable 时应合并'的指令。"
Step 3
🧬 定向变异(Mutation)
Reflection LLM 基于分析结果,有针对性地改写父本 P₀ → 生成子代 P₄
P₀ 原文 → 基于反思定向改写 → P₄ 新版

定向变异——不是随机改,是基于具体失败原因改。
🔀 多模块 prompt 还可杂交(Crossover)——取各候选最优模块组合。
Step 4
🎯 录取(Acceptance)
同一批反馈集 case跑子代 P₄,跟父本 P₀ 比较:
P₄ 比 P₀ 好
→ 加入候选池
[P₀, P₁, P₂, P₄]
P₄ 不如 P₀
→ 丢弃
池子不变
Step 5
🔄 回到 Step 1,重复
下一轮重新给所有候选打分、算 Pareto、选父本……
⏱️ 达到迭代上限(如 15 轮)
📉 连续 N 轮分数变化 < 1%
🚫 候选池不再增长
🎯 所有 pass criteria 已达标
📋 三个数据集各司其职
🔒 验证集(D_pareto) — 固定不变,给所有 prompt 打分。像「统考试卷」。
🔀 反馈集(D_feedback) — 每轮随机抽样,暴露失败。像「平时作业」。
🔐 测试集(Hold-out) — 完全隔离,只在最终评估用。像「高考」。
💡 一句话总结
Pareto 选父本 → Reflection 分析失败 → Mutation 定向改写 → Acceptance 录取/淘汰 → 重复直到收敛。
关于收敛:收敛 ≠ 全局最优

收敛只是说"在当前搜索策略下,继续优化已经没有回报了"。不能证明找到的是全局最优 prompt——有可能存在完全不同思路的 prompt 更好,但 GEPA 的变异路径没有探索到。

三个收敛信号
📉 分数不动了

连续 N 轮最优 prompt 分数波动 < 1%

例:第 10-13 轮分数在 0.82-0.824 之间震荡
🚫 生不出更好的了

连续 N 轮新子代全被淘汰,候选池不增长

例:连续 3 轮 P₁₅、P₁₆、P₁₇ 都不如父本
🎯 已经够好了

所有 pass criteria 达标

例:Hallucination < 2% ✅,Task Accuracy > 90% ✅
为什么最后要用测试集验证(6c)

不是验证"是不是最优",而是验证"优化出来的 prompt 在没见过的题上是不是真的好":

测试集 ≈ 验证集
没有过拟合,结果可信
→ 上线
⚠️
测试集 ≪ 验证集
Prompt "刷题刷出来的"
→ 调整数据集或策略
Per-Agent GEPA Configuration
AgentSeed PromptFrozen ContextEval MetricsPareto Dimensions
RankCurrent Rank promptFixed Merge outputsPriority Alignment, Recall@3, ExecutionPlan Quality & AlignmentPer-case aggregate score across eval metrics
MergeCurrent Merge promptFixed Grouping outputsPass A + B + C (all 16 metrics)Per-case aggregate score across eval metrics
Grouping (×6)Per-source promptsRaw source signalsCoherence, Completeness, CoveragePer-case aggregate score across eval metrics

Pareto dimensions = each golden case's aggregate score (not per-metric). This keeps the Pareto frontier meaningful — too many dimensions would make almost every prompt non-dominated.

Top-N Selection from Pareto Frontier

After GEPA converges, the frontier usually has multiple non-dominated prompts. Select top-N by average score across all golden cases:

PromptG1G2G3AvgRank
P₂0.70.70.90.77🥇
P₅0.80.60.80.73🥈
P₁0.60.80.70.70🥉
Regression Check After Each Agent

After optimizing each agent, run a quick E2E eval to confirm other agents haven't regressed. If regression is significant, re-adapt the affected agent before moving on.

6b-2. Global Combination (Cross-Agent)

Local optima ≠ global optimum. Each agent was optimized with others frozen, but in production all prompts interact. This step finds the best combination.

🧩 Assemble Candidates
Grouping: Top-1 per source (6 sources independent, no cross-combination needed)
Merge: Top-3 from Pareto frontier → [M₁, M₂, M₃]
Rank: Top-3 from Pareto frontier → [R₁, R₂, R₃]
🔄 Combinatorial E2E Eval
3 × 3 = 9 combinations, each evaluated end-to-end on the validation set.
Cost: only 9 E2E eval runs — very manageable.
🏆 Select Global Optimum
The winning combination might be M₂ + R₁ (not M₁ + R₁).
A prompt that's #2 in isolation may pair better with another agent's top pick.
6c. Final Validation

Run the hold-out test set against the globally optimal prompt combination from 6b-2. Compare against 6a baseline. This is the test set's second and final use within this optimization round.

  • Confirm real improvement (not just validation-set overfitting)
  • If a second GEPA round is needed, supplement ≥ 30% new test cases
6d. Continuous Regression & Calibration
Automated Regression
  • Every prompt change / model version update / pipeline logic change → auto-run eval
  • Integrate into CI/CD — score drops trigger alerts
LLM Judge Calibration
PhaseSignalPrecisionVolume
Selfhost earlyManual eval scoresHighSmall
Selfhost mid-lateUser behavior (Dismiss, Pin, Thumbs, CTR)MediumLarge
Post-GALarge-scale behavior + periodic human auditMedium→HighLarge
Dataset Maintenance
  • Periodically add new cases (new scenarios, edge cases, data sources)
  • Track N/A rates — supplement coverage when needed
  • If test set reused across multiple optimization rounds → refresh ≥ 30%