Commercial Journeys — Machine Eval + GEPA

Step 1 — Define Goals & Metrics

Commercial Journeys is a multi-agent pipeline. Evaluation operates at two levels: per-agent output (fault isolation) and end-to-end output (holistic quality).

Total Metrics

Eval Scopes

Failure Modes

Severity Levels

Pipeline Architecture

📧 M365 Data
🌐 Browser History

→

Grouping Agent ×6
3 metrics

→

Merge Agent
16 metrics (3 passes)

→

Rank Agent
10 metrics

→

🃏 NTP Cards
⚡ Copilot Execution
13 metrics

Evaluation Conventions

N/A Handling

Conditional metrics marked N/A when preconditions aren't met. Don't count toward pass/fail. Tracked as n/a_rate per metric (target: < 40%).

Responsibility Boundary

Every failure tagged: cj_pipeline | m365_copilot | ambiguous. Only cj_pipeline counts toward CJ pass/fail.

Part 1: Per-Agent Output Evaluation

📋

Grouping Agent Output

3 metrics ▼

Does each data source's clustering correctly identify task candidates from raw signals? Runs independently per source — Email, Teams, Calendar, Meeting Recaps, Files, Browser History.

Intra-source Grouping Coherence

Are all items in one cluster truly about the same task?

AI reviews each cluster → judges semantic unity. Unrelated items lumped = failure.

Intra-source Grouping Completeness

Are all items for the same task in the same cluster?

AI scans unclustered items → flags items that should belong to existing cluster.

Per-source Coverage

Did sources with task signals produce candidates?

Compare active sources vs. zero-candidate sources → flag missed signals.

✓ Coherence error < 5% · Completeness error < 5%

🔀

Merge Agent Output

16 metrics · 3 passes ▼

Are candidates from different data sources correctly fused into complete, well-defined journeys? The most complex step — 16 metrics split into 3 focused evaluation passes to reduce LLM-as-Judge cognitive load.

PASS A — Fusion Quality

Cross-source Fusion (Over-merging)

Were distinct tasks incorrectly merged?

Different owners, deadlines, or deliverables = distinct tasks

Cross-source Fusion (Under-merging)

Were same-task candidates left separate?

User would describe as one to-do item = same task

Source Attribution Precision

No wrong source links?

Content directly relevant = correct. Tangential = incorrect.

Source Attribution Recall

Key sources included?

Key = primary trigger, deadline info, manager+ stakeholder.

Source Role Assignment

Correct roles? (Email=trigger, Calendar=deadline, etc.)

N/A when single source.

✓ Over-merge < 3% · Under-merge < 5% · Precision ≥ 95% · Recall ≥ 90%

PASS B — Task Definition

Task Accuracy

Is this a real task the user needs to do?

Explicitly assigned/committed = real. CC'd FYI → inferred action = not real.

Task Granularity

Right scope? Not too broad/narrow?

Score 1-5. "Prepare for work this week" = 1. "Draft reply to VP's email" = 5.

AI Actionability

Does this task benefit from AI?

Too simple (one-click) or beyond AI (physical) = flagged.

Gate Condition Compliance

Passes all 7 gates?

IS_CONCRETE_TASK · IS_WORK_RELATED · HAS_USER_RELEVANCE · IS_NOT_COMPLETED · HAS_MEANINGFUL_COMPLEXITY · REQUIRES_BROWSER · IS_AI_EXECUTABLE

Hallucination Detection

Zero traceable source signals?

No evidence in ANY source = hallucination (critical). Strong multi-source inference ≠ hallucination.

✓ Task Accuracy ≥ 90% · Hallucination = 0%

PASS C — Filtering & Lifecycle

Lifecycle Accuracy

NEW / ONGOING / RECURRING correctly assigned?

N/A on first-run with no history.

Incremental Update Fidelity

Original context preserved when updating?

N/A for NEW_WORK.

Dismissed Topic Suppression

Dismissed journeys stay dismissed?

N/A when no dismiss history.

Privacy & Sensitivity Filtering

Sensitive topics blocked? PII scrubbed?

Any PII = ship-blocker. Personal-but-not-PII = High.

Staleness Rejection

Signals > 3 days filtered out?

ALL sources expired AND no fresh merge = stale.

Noise Rejection

Trivial signals suppressed?

Auto-generated emails, all-day placeholders, newsletters = noise.

✓ Lifecycle ≥ 90% · Privacy failure = 0% · Noise pass-through < 5%

🏆

Rank Agent Output

10 metrics ▼

Are journeys correctly prioritized with high-quality execution plans?

Priority Signal Alignment

Top-ranked = highest urgency/importance?

Recall@3

How many human-labeled ideal Top-3 in system's Top-3?

Time Sensitivity Accuracy

Urgency score matches actual timestamps?

Ranking Stability

No erratic changes without new signals?

Task Category Distribution

Top-N spread across different action types?

Data Source Coverage

Top-N drawing from multiple sources?

Temporal Trigger Diversity

Varied urgency triggers?

Cross-user Consistency

Systemic issues vs. isolated?

ExecutionPlan Quality

Well-structured, specific, executable? Score 1-5.

ExecutionPlan-Journey Alignment

Plan matches journey's task?

✓ Top-3 inversion < 10% · Recall@3 ≥ 0.6 · No single action_type > 60% · ExecutionPlan avg ≥ 3.5

Part 2: Final Output Evaluation

🃏

Card Presentation Quality

4 metrics ▼

Brief Clarity

Specific, actionable, immediately understood?

Label Accuracy

Label reflects actual urgency trigger?

Icon Relevance

Icon matches actual task?

Promise Accuracy

Card accurately represents execution output? Score 1-5.

✓ Clarity ≥ 90% · Promise score ≤ 2 = 0%

⚡

Execution & Output Quality

6 metrics ▼

Root-cause tagging required: cj_pipeline | m365_copilot | ambiguous

Execution Completion

Success vs. fail/timeout rate

Timeout = likely m365. Malformed prompt = cj_pipeline.

Output Correctness

Factual claims grounded in user data? ≥ 95%

Bad context = cj. Copilot hallucinated = m365.

Output Completeness

Covers all key threads?

CJ omitted context = cj. Copilot ignored = m365.

Effort Reduction

Synthesizes beyond raw data? Score 1-5.

Typically m365_copilot.

Actionability

Concrete next steps? ≥ 1 action?

Prompt didn't request = cj. Requested but not delivered = m365.

Scenario Fit

Output format matches action_type?

Wrong format in prompt = cj. Correct request, wrong delivery = m365.

✓ Completion ≥ 95% · Correctness ≥ 95% (cj_pipeline) · Effort Reduction avg ≥ 3.5

🔗

End-to-End Coherence & Task Resolution

3 metrics ▼

Full-chain Topic Alignment

Primary topic consistent across all stages?

Task Resolution Score

Can user advance task from output alone? Score 1-5.

Recurrence Handling

Correct upcoming instance for RECURRING_WORK?

✓ Topic break = 0% · Resolution avg ≥ 3.5

Failure Mode Taxonomy

27 failure modes by severity:

🔴 3

🟠 7

🟡 13

🟢 4

🔴 Ship-Blocker — Zero Tolerance

Privacy ViolationMerge

PII LeakageMerge

Resurfaced Dismissed TopicMerge

🟠 Critical — < 2% Threshold

Hallucinated JourneyMerge

Wrong Source AttributionMerge

Phantom TaskMerge

Completed Task SurfacedMerge

Promise-Delivery GapOutput

Over-PromiseOutput

Output HallucinationOutput

🟡 High — Needs Mitigation

Topic DriftOutput

Missed Source EntirelyGrouping

Mis-grouped ItemsGrouping

Orphaned Key ItemGrouping

Over-merged TasksMerge

Wrong LifecycleMerge

Scope MismatchMerge

AI MisfitMerge

Noise LeakMerge

Stale JourneyMerge

Priority InversionRank

Bad ExecutionPlanRank

Missing Key ContextOutput

🟢 Medium — Can Ship with Tracking

Wrong Help TypeOutput

Under-merged DuplicatesMerge

Monoculture RecommendationsRank

Vague CardOutput

Verdict Rules

🚫 Any 🔴 Ship-Blocker → Fail — blocks ship

🚫 🟠 Critical ≥ 2% → Fail — must fix before GA

⚠️ 🟠 < 2% AND ≥ 2 🟡 → Conditional — needs mitigation

✅ 🟠 < 2% AND 0-1 🟡 AND all pass → Ship-ready

Evaluation Structure Summary

Eval Scope	What	Sub-metrics
Grouping Agent	Per-source clustering quality	3
Merge Agent — Pass A	Fusion quality	5
Merge Agent — Pass B	Task definition	5
Merge Agent — Pass C	Filtering & lifecycle	6
Rank Agent	Priority, recall, diversity, ExecutionPlan	10
Card Presentation	Pre-click experience	4
Execution & Output	Post-click value	6
E2E Coherence	Full-chain consistency	3
Total		42

Manual Eval Cross-reference

Manual Eval	Machine Eval Mapping	Coverage
C.1 Task Accuracy	Merge Pass B: Task Accuracy	♻️ Both
C.2 Task Relevance	Merge Pass B: Gate Compliance	♻️ Both
C.3 AI Actionability	Merge Pass B: AI Actionability	♻️ Both
C.4 Privacy	Merge Pass C: Privacy Filtering	♻️ Both
C.5-C.8 Card Display	Card: Clarity, Label, Icon	♻️ Both
X.1 Top-3 Ranking	Rank: Priority + Recall@3	♻️ Both
X.2 Missing Journeys	Rank: Recall@3 + Grouping	♻️ Both
X.3 Shouldn't Be There	Merge Pass B: Gate Compliance	♻️ Both
O.1-O.5 Output Quality	Execution & Output (all)	♻️ Both
U.1-U.4 UX	—	Manual only
G.1-G.2 Overall	—	Manual only

Consumer Eval Cross-reference

Consumer Dimension	Commercial Mapping	Status
L1 Relevance	Merge Pass B: Task Accuracy + Hallucination	✅
L1 Groundedness	Merge Pass A: Source Attribution	✅
L1 Helpfulness	Merge Pass B: AI Actionability + Gates	✅
L1 Safety & Privacy	Merge Pass C: Privacy Filtering	✅
L1 Technical Feasibility	Card: Promise Accuracy	✅
L2 Clarity	Card: Brief Clarity	✅
L2 Promise Accuracy	Card: Promise Accuracy	✅
L3 Correctness	Execution: Output Correctness	✅
L3 Completeness	Execution: Output Completeness	✅
L3 Effort Reduction	Execution: Effort Reduction	✅
L3 Actionability	Execution: Actionability	✅
L3 Scenario Fit	Execution: Scenario Fit	✅
L4 Self-Sufficiency	E2E: Task Resolution Score	✅
L4 E2E Coherence	E2E: Full-chain Topic Alignment	✅
L4 Engagement / Trust	Online-only	⚪ N/A

Step 2 — Build Eval Dataset

🚧 Coming soon — Dataset construction details will be added here.

Step 3 — Generate Outputs

🚧 Coming soon — Output generation and trace capture details will be added here.

Step 4 — Human Labeling & Calibration

🚧 Coming soon — Annotation workflow and golden set details will be added here.

Step 5 — Build Automated Judge

🚧 Coming soon — LLM-as-Judge calibration and rule-based checks will be added here.

Step 6 — Eval + GEPA Optimization

The final step combines baseline evaluation with iterative prompt optimization using GEPA (Genetic-Pareto) — a reflective prompt evolution method that outperforms RL while using up to 35× fewer rollouts.

6a. Baseline Evaluation

Run the hold-out test set against all current pipeline prompts. Record baseline scores across all 42 metrics. This is the test set's first use.

6b. GEPA Optimization Loop

Premise: Lock the Model

Commercial Journeys runs on Sydney. GEPA optimizes prompts only — no model switching.

Optimization Order: Per-Agent, Then Global

Optimize one agent at a time (freezing others), then combine the best candidates across agents. The per-agent order isn't critical — what matters is the global combination step at the end.

Option A: Upstream → Downstream
Grouping → Merge → Rank
Better inputs flow down; harder to see E2E impact early

Option B: Downstream → Upstream
Rank → Merge → Grouping
E2E metrics visible early; upstream changes may shift inputs

Either way, the 6b-2 Global Combination step (below) finds the best cross-agent pairing, so per-agent order is not the deciding factor.

Per-Agent GEPA Cycle — Visual Flowchart

初始状态

🌱 种子 Prompt

一开始只有 1 个 prompt（P₀）——当前正在用的 prompt。
经过几轮迭代后，候选池逐渐长大：[P₀, P₁, P₂, P₃, ...]

Step 1

📊 验证集打分 → Pareto 前沿 → 选父本

用固定验证集（D_pareto）给候选池每个 prompt 打分。找出 Pareto 前沿——没有被任何其他 prompt 全面碾压的就在前沿上。

Prompt	Case 1	Case 2	Case 3	状态
P₀	0.9	0.4	0.7	✅ 前沿
P₁	0.6	0.8	0.7	✅ 前沿
P₂	0.7	0.7	0.9	✅ 前沿
P₃	0.5	0.5	0.5	❌ 被支配

➜ 从前沿 [P₀, P₁, P₂] 中选父本（优先选被选次数少的），假设选了 P₀

Step 2

🔍 反馈集暴露失败 → 反思（Reflection）

从反馈集（D_feedback）随机抽一批 case，用父本 P₀ 跑。收集失败 trace，喂给 Reflection LLM（GPT-5 / Claude Opus）：

💭 Reflection LLM 分析：
"P₀ 在跨源融合时太保守，经常 under-merge。
建议加入'当多个源指向同一 deliverable 时应合并'的指令。"

Step 3

🧬 定向变异（Mutation）

Reflection LLM 基于分析结果，有针对性地改写父本 P₀ → 生成子代 P₄。

P₀ 原文 → 基于反思定向改写 → P₄ 新版

⚡ 定向变异——不是随机改，是基于具体失败原因改。
🔀 多模块 prompt 还可杂交（Crossover）——取各候选最优模块组合。

Step 4

🎯 录取（Acceptance）

用同一批反馈集 case跑子代 P₄，跟父本 P₀ 比较：

✅

P₄ 比 P₀ 好
→ 加入候选池
[P₀, P₁, P₂, P₄]

❌

P₄ 不如 P₀
→ 丢弃
池子不变

Step 5

🔄 回到 Step 1，重复

下一轮重新给所有候选打分、算 Pareto、选父本……

⏱️ 达到迭代上限（如 15 轮）

📉 连续 N 轮分数变化 < 1%

🚫 候选池不再增长

🎯 所有 pass criteria 已达标

📋 三个数据集各司其职

🔒 验证集（D_pareto） — 固定不变，给所有 prompt 打分。像「统考试卷」。

🔀 反馈集（D_feedback） — 每轮随机抽样，暴露失败。像「平时作业」。

🔐 测试集（Hold-out） — 完全隔离，只在最终评估用。像「高考」。

💡 一句话总结
Pareto 选父本 → Reflection 分析失败 → Mutation 定向改写 → Acceptance 录取/淘汰 → 重复直到收敛。

关于收敛：收敛 ≠ 全局最优

收敛只是说"在当前搜索策略下，继续优化已经没有回报了"。不能证明找到的是全局最优 prompt——有可能存在完全不同思路的 prompt 更好，但 GEPA 的变异路径没有探索到。

三个收敛信号

📉 分数不动了

连续 N 轮最优 prompt 分数波动 < 1%

例：第 10-13 轮分数在 0.82-0.824 之间震荡

🚫 生不出更好的了

连续 N 轮新子代全被淘汰，候选池不增长

例：连续 3 轮 P₁₅、P₁₆、P₁₇ 都不如父本

🎯 已经够好了

所有 pass criteria 达标

例：Hallucination < 2% ✅，Task Accuracy > 90% ✅

为什么最后要用测试集验证（6c）

不是验证"是不是最优"，而是验证"优化出来的 prompt 在没见过的题上是不是真的好"：

✅

测试集 ≈ 验证集
没有过拟合，结果可信
→ 上线

⚠️

测试集 ≪ 验证集
Prompt "刷题刷出来的"
→ 调整数据集或策略

Per-Agent GEPA Configuration

Agent	Seed Prompt	Frozen Context	Eval Metrics	Pareto Dimensions
Rank	Current Rank prompt	Fixed Merge outputs	Priority Alignment, Recall@3, ExecutionPlan Quality & Alignment	Per-case aggregate score across eval metrics
Merge	Current Merge prompt	Fixed Grouping outputs	Pass A + B + C (all 16 metrics)	Per-case aggregate score across eval metrics
Grouping (×6)	Per-source prompts	Raw source signals	Coherence, Completeness, Coverage	Per-case aggregate score across eval metrics

Pareto dimensions = each golden case's aggregate score (not per-metric). This keeps the Pareto frontier meaningful — too many dimensions would make almost every prompt non-dominated.

Top-N Selection from Pareto Frontier

After GEPA converges, the frontier usually has multiple non-dominated prompts. Select top-N by average score across all golden cases:

Prompt	G1	G2	G3	Avg	Rank
P₂	0.7	0.7	0.9	0.77	🥇
P₅	0.8	0.6	0.8	0.73	🥈
P₁	0.6	0.8	0.7	0.70	🥉

Regression Check After Each Agent

After optimizing each agent, run a quick E2E eval to confirm other agents haven't regressed. If regression is significant, re-adapt the affected agent before moving on.

6b-2. Global Combination (Cross-Agent)

Local optima ≠ global optimum. Each agent was optimized with others frozen, but in production all prompts interact. This step finds the best combination.

🧩 Assemble Candidates

Grouping: Top-1 per source (6 sources independent, no cross-combination needed)
Merge: Top-3 from Pareto frontier → [M₁, M₂, M₃]
Rank: Top-3 from Pareto frontier → [R₁, R₂, R₃]

🔄 Combinatorial E2E Eval

3 × 3 = 9 combinations, each evaluated end-to-end on the validation set.
Cost: only 9 E2E eval runs — very manageable.

🏆 Select Global Optimum

The winning combination might be M₂ + R₁ (not M₁ + R₁).
A prompt that's #2 in isolation may pair better with another agent's top pick.

6c. Final Validation

Run the hold-out test set against the globally optimal prompt combination from 6b-2. Compare against 6a baseline. This is the test set's second and final use within this optimization round.

Confirm real improvement (not just validation-set overfitting)
If a second GEPA round is needed, supplement ≥ 30% new test cases

6d. Continuous Regression & Calibration

Automated Regression

Every prompt change / model version update / pipeline logic change → auto-run eval
Integrate into CI/CD — score drops trigger alerts

LLM Judge Calibration

Phase	Signal	Precision	Volume
Selfhost early	Manual eval scores	High	Small
Selfhost mid-late	User behavior (Dismiss, Pin, Thumbs, CTR)	Medium	Large
Post-GA	Large-scale behavior + periodic human audit	Medium→High	Large

Dataset Maintenance

Periodically add new cases (new scenarios, edge cases, data sources)
Track N/A rates — supplement coverage when needed
If test set reused across multiple optimization rounds → refresh ≥ 30%

Machine Eval + GEPA Iteration

The Six-Step Framework

Define Goals & Metrics

Build Eval Dataset

Generate Outputs

Human Labeling & Calibration

Build Automated Judge

Eval + GEPA Optimization