Step 1 — Define Goals & Metrics
Commercial Journeys is a multi-agent pipeline. Evaluation operates at two levels: per-agent output (fault isolation) and end-to-end output (holistic quality).
🌐 Browser History
3 metrics
16 metrics (3 passes)
10 metrics
⚡ Copilot Execution
13 metrics
N/A Handling
Conditional metrics marked N/A when preconditions aren't met. Don't count toward pass/fail. Tracked as n/a_rate per metric (target: < 40%).
Responsibility Boundary
Every failure tagged: cj_pipeline | m365_copilot | ambiguous. Only cj_pipeline counts toward CJ pass/fail.
Grouping Agent Output
3 metrics ▼Does each data source's clustering correctly identify task candidates from raw signals? Runs independently per source — Email, Teams, Calendar, Meeting Recaps, Files, Browser History.
Intra-source Grouping Coherence
Are all items in one cluster truly about the same task?
Intra-source Grouping Completeness
Are all items for the same task in the same cluster?
Per-source Coverage
Did sources with task signals produce candidates?
Merge Agent Output
16 metrics · 3 passes ▼Are candidates from different data sources correctly fused into complete, well-defined journeys? The most complex step — 16 metrics split into 3 focused evaluation passes to reduce LLM-as-Judge cognitive load.
Cross-source Fusion (Over-merging)
Were distinct tasks incorrectly merged?
Cross-source Fusion (Under-merging)
Were same-task candidates left separate?
Source Attribution Precision
No wrong source links?
Source Attribution Recall
Key sources included?
Source Role Assignment
Correct roles? (Email=trigger, Calendar=deadline, etc.)
Task Accuracy
Is this a real task the user needs to do?
Task Granularity
Right scope? Not too broad/narrow?
AI Actionability
Does this task benefit from AI?
Gate Condition Compliance
Passes all 7 gates?
Hallucination Detection
Zero traceable source signals?
Lifecycle Accuracy
NEW / ONGOING / RECURRING correctly assigned?
Incremental Update Fidelity
Original context preserved when updating?
Dismissed Topic Suppression
Dismissed journeys stay dismissed?
Privacy & Sensitivity Filtering
Sensitive topics blocked? PII scrubbed?
Staleness Rejection
Signals > 3 days filtered out?
Noise Rejection
Trivial signals suppressed?
Rank Agent Output
10 metrics ▼Are journeys correctly prioritized with high-quality execution plans?
Priority Signal Alignment
Top-ranked = highest urgency/importance?
Recall@3
How many human-labeled ideal Top-3 in system's Top-3?
Time Sensitivity Accuracy
Urgency score matches actual timestamps?
Ranking Stability
No erratic changes without new signals?
Task Category Distribution
Top-N spread across different action types?
Data Source Coverage
Top-N drawing from multiple sources?
Temporal Trigger Diversity
Varied urgency triggers?
Cross-user Consistency
Systemic issues vs. isolated?
ExecutionPlan Quality
Well-structured, specific, executable? Score 1-5.
ExecutionPlan-Journey Alignment
Plan matches journey's task?
Card Presentation Quality
4 metrics ▼Brief Clarity
Specific, actionable, immediately understood?
Label Accuracy
Label reflects actual urgency trigger?
Icon Relevance
Icon matches actual task?
Promise Accuracy
Card accurately represents execution output? Score 1-5.
Execution & Output Quality
6 metrics ▼Root-cause tagging required: cj_pipeline | m365_copilot | ambiguous
Execution Completion
Success vs. fail/timeout rate
Output Correctness
Factual claims grounded in user data? ≥ 95%
Output Completeness
Covers all key threads?
Effort Reduction
Synthesizes beyond raw data? Score 1-5.
Actionability
Concrete next steps? ≥ 1 action?
Scenario Fit
Output format matches action_type?
End-to-End Coherence & Task Resolution
3 metrics ▼Full-chain Topic Alignment
Primary topic consistent across all stages?
Task Resolution Score
Can user advance task from output alone? Score 1-5.
Recurrence Handling
Correct upcoming instance for RECURRING_WORK?
| Eval Scope | What | Sub-metrics |
|---|---|---|
| Grouping Agent | Per-source clustering quality | 3 |
| Merge Agent — Pass A | Fusion quality | 5 |
| Merge Agent — Pass B | Task definition | 5 |
| Merge Agent — Pass C | Filtering & lifecycle | 6 |
| Rank Agent | Priority, recall, diversity, ExecutionPlan | 10 |
| Card Presentation | Pre-click experience | 4 |
| Execution & Output | Post-click value | 6 |
| E2E Coherence | Full-chain consistency | 3 |
| Total | 42 |
| Manual Eval | Machine Eval Mapping | Coverage |
|---|---|---|
| C.1 Task Accuracy | Merge Pass B: Task Accuracy | ♻️ Both |
| C.2 Task Relevance | Merge Pass B: Gate Compliance | ♻️ Both |
| C.3 AI Actionability | Merge Pass B: AI Actionability | ♻️ Both |
| C.4 Privacy | Merge Pass C: Privacy Filtering | ♻️ Both |
| C.5-C.8 Card Display | Card: Clarity, Label, Icon | ♻️ Both |
| X.1 Top-3 Ranking | Rank: Priority + Recall@3 | ♻️ Both |
| X.2 Missing Journeys | Rank: Recall@3 + Grouping | ♻️ Both |
| X.3 Shouldn't Be There | Merge Pass B: Gate Compliance | ♻️ Both |
| O.1-O.5 Output Quality | Execution & Output (all) | ♻️ Both |
| U.1-U.4 UX | — | Manual only |
| G.1-G.2 Overall | — | Manual only |
| Consumer Dimension | Commercial Mapping | Status |
|---|---|---|
| L1 Relevance | Merge Pass B: Task Accuracy + Hallucination | ✅ |
| L1 Groundedness | Merge Pass A: Source Attribution | ✅ |
| L1 Helpfulness | Merge Pass B: AI Actionability + Gates | ✅ |
| L1 Safety & Privacy | Merge Pass C: Privacy Filtering | ✅ |
| L1 Technical Feasibility | Card: Promise Accuracy | ✅ |
| L2 Clarity | Card: Brief Clarity | ✅ |
| L2 Promise Accuracy | Card: Promise Accuracy | ✅ |
| L3 Correctness | Execution: Output Correctness | ✅ |
| L3 Completeness | Execution: Output Completeness | ✅ |
| L3 Effort Reduction | Execution: Effort Reduction | ✅ |
| L3 Actionability | Execution: Actionability | ✅ |
| L3 Scenario Fit | Execution: Scenario Fit | ✅ |
| L4 Self-Sufficiency | E2E: Task Resolution Score | ✅ |
| L4 E2E Coherence | E2E: Full-chain Topic Alignment | ✅ |
| L4 Engagement / Trust | Online-only | ⚪ N/A |