Step 1 — Define Goals & Metrics
Commercial Journeys is a multi-agent pipeline. Evaluation operates at two levels: per-agent output (fault isolation) and end-to-end output (holistic quality).
Browser History
3 metrics
16 metrics (3 passes)
10 metrics
+ Copilot Execution
N/A Handling: Conditional metrics (e.g., Recurrence Handling, Incremental Update, Dismissed Suppression) are marked N/A when preconditions are not met. N/A judgments do not count toward pass/fail. Tracked as n/a_rate per metric — target: N/A rate < 40%.
End-to-End Ownership: We evaluate the entire feature end-to-end — from signal ingestion through Copilot execution output. Even though the final execution is handled by M365 Copilot Chat, the overall experience is what matters. If the feature delivers poor results for any reason (bad prompt design, poor execution quality, or both), it reflects on the feature as a whole. Root-cause tagging (cj_pipeline | m365_copilot | ambiguous) is used for debugging and prioritization, not for excluding failures from the overall verdict.
Grouping Agent Output
3 metrics ▼Does each data source's clustering correctly identify task candidates from raw signals? Runs independently per source — Email, Teams, Calendar, Meeting Recaps, Files, Browser History.
| Sub-metric | What to measure | How |
|---|---|---|
| Intra-source Grouping Coherence | Within a single data source, are all items in one cluster truly about the same task? | AI reviews each cluster's item set → judges whether semantically unified. E.g., two unrelated email threads lumped into one candidate = failure |
| Intra-source Grouping Completeness | Within a single data source, are all items related to the same task assigned to the same cluster? | AI scans unclustered items → judges whether they should belong to an existing cluster. E.g., a key email about "Q3 Budget" left as orphan while other Q3 Budget emails are clustered = failure |
| Per-source Coverage | Did the Grouping Agent produce candidates from data sources that clearly contain task signals? | Compare active data sources vs. data sources that produced zero candidates → flag sources with obvious signals that were entirely missed |
Merge Agent Output
16 metrics · 3 passes ▼Are candidates from different data sources correctly fused into complete, well-defined journeys? The most complex step — 16 metrics split into 3 focused evaluation passes to reduce LLM-as-Judge cognitive load.
| Sub-metric | What to measure | How | Boundary definition |
|---|---|---|---|
| Cross-source Fusion Accuracy (Over-merging) | Were candidates from different sources that describe distinct tasks incorrectly merged? | AI reviews merged journey vs. contributing candidates → flags cases where semantically different tasks were collapsed | Two tasks are "distinct" when they have different owners, different deadlines, or produce different deliverables — even if they share a topic umbrella |
| Cross-source Fusion Accuracy (Under-merging) | Were candidates from different sources that describe the same task left as separate journeys? | AI clusters all output journeys by semantic similarity → flags near-duplicates from different sources | Two journeys are "same task" when the user would describe them as one item on their to-do list |
| Source Attribution Precision | Are the linked source items correct — no wrong links? | Check each attributed source → precision = correct sources / total attributed sources | A source is "correct" when its content is directly relevant to the journey's task definition. Tangentially related items = incorrect |
| Source Attribution Recall | Are the most important source signals included? | Check whether high-signal items are attributed → recall = included key sources / total key sources | Key source = (a) primary task trigger, (b) any source with deadline info, (c) any source from task owner's direct manager or above |
| Source Role Assignment | Are data sources assigned correct roles? (Email=trigger, Calendar=deadline, Browser=execution evidence, Files=context) | Extract source roles → AI validates role logic against specialized prompt definitions | N/A when journey has only 1 contributing source |
| Sub-metric | What to measure | How | Boundary definition |
|---|---|---|---|
| Task Accuracy | Is this a real task the user actually needs to complete? | AI evaluates: given the contributing source signals, is the generated task something the user would recognize as "yes, I need to do this"? | "Real task" = user was explicitly assigned, committed, or is the clear owner. "Not real" = AI inferred a task from peripheral signals (e.g., CC'd on FYI email → AI assumes user must act) |
| Task Granularity | Is the task scoped to a specific, actionable item? | AI evaluates scope. Score 1-5. | Score 5 = single clear deliverable. Score 3 = reasonable but could be more specific. Score 1 = "do your job" broad or trivially narrow |
| AI Actionability | Does this task genuinely benefit from AI help? | AI classifies: (a) Too simple — user can do faster without AI. (b) AI-appropriate. (c) Beyond AI capability. Flag (a) and (c) | "Too simple" = single click, one-line reply. "Beyond AI" = requires physical presence, confidential judgment, or inaccessible systems |
| Gate Condition Compliance | Does each journey pass the 7 gating conditions? | Programmatically check: IS_CONCRETE_TASK, IS_WORK_RELATED, HAS_USER_RELEVANCE, IS_NOT_COMPLETED, HAS_MEANINGFUL_COMPLEXITY, REQUIRES_BROWSER, IS_AI_EXECUTABLE | IS_NOT_COMPLETED: if user sent "done"/"completed"/"attached" AND no subsequent follow-up → completed. Ambiguous = not completed (benefit of doubt) |
| Hallucination Detection | Does the journey have zero traceable source signals — entirely fabricated? | Cross-reference journey description against all contributing candidates and raw signals → flag journeys with no traceable source | "Hallucination" = core task description has NO supporting evidence in ANY input source. "Reasonable inference" from strong multi-source signals ≠ hallucination |
| Sub-metric | What to measure | How | Boundary definition |
|---|---|---|---|
| Lifecycle Accuracy | Is the lifecycle state (NEW_WORK / ONGOING_WORK / RECURRING_WORK) correctly assigned? | For ONGOING: verify match to existing journey + continued activity. For NEW: verify no existing journey covers this task | N/A for first-run evaluations where no journey history exists |
| Incremental Update Fidelity | When updating an existing journey, is the original context preserved while new info is correctly incorporated? | Compare updated journey against previous version + new signals → flag lost context or incorrect overwrites | N/A when journey is NEW_WORK with no prior version |
| Dismissed Topic Suppression | Are manually dismissed journeys correctly excluded from regeneration? | Cross-reference output against dismissed titles list → flag resurfaced dismissed topics | N/A when no dismiss history exists |
| Privacy & Sensitivity Filtering | Are sensitive topics correctly blocked? Any PII leakage? | PII scan (regex + NER) across all journey titles, summaries, and source references + personal-vs-work classification | Any PII (email addresses in title, SSN, credit card) = ship-blocker. Personal-but-not-PII (e.g., "dentist appointment") = High |
| Staleness Rejection | Are journeys based on signals older than the 3-day freshness window filtered out? | Check source signal timestamps → flag journeys based entirely on expired signals | "Entirely expired" = ALL contributing sources >3 days old AND no new signals merged in |
| Noise Rejection | Are low-value / trivial signals correctly suppressed? | Classify signal value → flag low-value signals that produced journeys | "Noise" = auto-generated system emails, all-day calendar placeholders, newsletters. "Not noise" = routine meetings if user is organizer or has action items |
Rank Agent Output
10 metrics ▼Are journeys correctly prioritized, and do the top-N come with high-quality execution plans? Uses dual scoring: rule-based (40%) + model scoring (60%).
| Sub-metric | What to measure | How |
|---|---|---|
| Priority Signal Alignment | Do top-ranked journeys genuinely have the highest urgency/importance? | Extract rule-based + model scores → verify top-N score highest on timeliness, stakeholder importance, ownership strength |
| Recall@3 (Top-3 Hit Rate) | Among all active journeys, how many of the human-labeled "ideal Top-3" are captured by the system's Top-3? | Golden set includes human-annotated ideal Top-3. Recall@3 = (# human Top-3 in system Top-3) / 3 |
| Time Sensitivity Accuracy | For time-driven journeys, is the urgency score correctly calculated? | Compare urgency score against actual calendar/email timestamps |
| Ranking Stability | Do rankings change erratically between sessions without new signals? | Compare logs across sessions → flag unexplained rank changes |
| Task Category Distribution | Are the top-N spread across different task types or all clustered in one? | Classify by action_type → compute entropy / top-1 concentration ratio |
| Data Source Coverage | Are the top-N drawing from multiple M365 sources or over-reliant on one? | Tag contributing sources per journey → measure source distribution |
| Temporal Trigger Diversity | Are urgency triggers varied (deadline-driven, people-driven, recurrence-driven)? | Extract label types → check distribution across top-N |
| Cross-user Consistency | Is low diversity or ranking quality systemic or isolated? | Aggregate per-user scores → flag if >50% show same pattern |
| ExecutionPlan Quality | Is the generated ExecutionPlan well-structured, specific, and executable by Copilot? | AI evaluates: task clarity, context inclusion, output format, scope. Score 1-5 |
| ExecutionPlan-Journey Alignment | Does the ExecutionPlan match the journey's task definition? | Compare plan objective against journey title/summary → flag mismatches |
Card Presentation Quality
4 metrics ▼Does the card earn the click — and set honest expectations for what's behind it?
| Sub-metric | What to measure | How |
|---|---|---|
| Brief Clarity | Is the task brief (title + summary) specific, actionable, and immediately understood? | AI evaluates: specificity, actionability, grammar, length. Binary: "Can you predict what happens if you click?" |
| Label Accuracy | Does the label (type + name) correctly reflect the actual urgency trigger? | Cross-reference label against source signals. E.g., "Asked by VP Chen" — verify VP Chen actually made the request |
| Icon Relevance | Does the action_type icon match the actual task? | Map action_type to journey content → flag mismatches |
| Promise Accuracy | Does the card accurately represent what the Copilot execution will deliver? | Compare card title + summary vs. actual execution output → score promise-delivery alignment 1-5. Score ≤ 2 = critical failure |
Execution & Output Quality
6 metrics ▼After the user clicks "Open in Copilot," did the response actually deliver value? Root-cause tagging (cj_pipeline | m365_copilot | ambiguous) is applied for debugging — all failures count toward the overall feature verdict.
| Sub-metric | What to measure | How | Root-cause notes |
|---|---|---|---|
| Execution Completion Rate | What % of executions completed successfully vs. failed/timed out? | Parse execution traces → compute success rate | Timeout/crash = likely m365_copilot. Malformed prompt = cj_pipeline |
| Output Correctness | Are factual claims grounded in the user's actual data? | For each claim, verify against source signals. Claim precision = supported / total. Target ≥ 95% | Wrong facts from bad context in prompt = cj_pipeline. Correct context but Copilot hallucinated = m365_copilot |
| Output Completeness | Does the output cover all key threads from the journey's source signals? | Compare entities/topics in output vs. key entities in source signals | CJ prompt omitted key context = cj_pipeline. Context provided but Copilot ignored = m365_copilot |
| Effort Reduction | Does the output synthesize and organize beyond raw data? | AI rating 1-5. Score 1 = "just restated what I already know." Score 5 = "saves me 20+ min" | Typically m365_copilot |
| Actionability | Are there concrete next steps the user can immediately act on? | Count actionable elements: draft ready to send? Link to right doc? Binary: ≥ 1 concrete action? | Prompt didn't request actionable format = cj_pipeline. Requested but not delivered = m365_copilot |
| Scenario Fit | Is the output format appropriate for the action_type? | Map action_type → expected format (REPLY_EMAIL → email draft, PREPARE_MEETING → agenda). Flag mismatches | CJ prompt specified wrong format = cj_pipeline. Correct request, wrong delivery = m365_copilot |
End-to-End Coherence & Task Resolution
3 metrics ▼From raw signals to final output, does the entire chain hold together?
| Sub-metric | What to measure | How |
|---|---|---|
| Full-chain Topic Alignment | Does the primary topic stay consistent across all pipeline stages? | Extract primary topic at each stage → flag any stage that introduces a different topic or contradicts another |
| Task Resolution Score | Could the user meaningfully advance their task from this output alone? | Expert/AI rating 1-5. Score 1 = "useless, would start over." Score 5 = "could complete the task right here." The single offline metric most predictive of real-world satisfaction |
| Recurrence Handling | For RECURRING_WORK journeys, is the system recommending the correct upcoming instance? | Check recurrence metadata → validate instance timing. N/A for NEW_WORK and ONGOING_WORK |
Critical failures are flagged independently from dimension scores. Any critical failure triggers an overall Fail verdict regardless of other scores.
| Severity | Definition | Action |
|---|---|---|
| Ship-Blocker | Trust-destroying or compliance-violating. Zero tolerance — even one instance blocks release. | Blocks ship. Zero-tolerance threshold. |
| Critical | Fundamental value proposition broken. User sees wrong, fabricated, or misleading content. | Must fix before GA. Threshold: < 2% of evaluated journeys. |
| High | Meaningful quality gap. User gets some value but with significant friction. | Must be triaged. Acceptable only with documented mitigation plan. |
| Medium | Noticeable imperfection. User still gets value but experience is suboptimal. | Should fix. Can ship with known issues tracked. |
| Failure Mode | Agent / Stage | Description | Severity |
|---|---|---|---|
| Privacy Violation | Merge | Sensitive / personal content surfaces in journey | Ship-Blocker |
| PII Leakage | Merge | PII exposed in journey title, summary, or sources | Ship-Blocker |
| Resurfaced Dismissed Topic | Merge | A manually dismissed journey topic reappears | Ship-Blocker |
| Hallucinated Journey | Merge | Journey has no traceable source signal — entirely fabricated | Critical |
| Wrong Source Attribution | Merge | Journey links to incorrect M365 items | Critical |
| Phantom Task | Merge | Signals are real but the inferred task is not | Critical |
| Completed Task Surfaced | Merge | Journey recommends a task the user already finished | Critical |
| Promise-Delivery Gap | Output | Card promises something the execution output doesn't deliver | Critical |
| Over-Promise | Output | Card implies capability the Copilot execution can't support | Critical |
| Output Hallucination | Output | Factual claims not supported by any source signal | Critical |
| Topic Drift | Output | Primary topic changes between pipeline stages | High |
| Missed Source Entirely | Grouping | A data source with obvious task signals produced zero candidates | High |
| Mis-grouped Items | Grouping | Unrelated items from the same source lumped into one candidate | High |
| Orphaned Key Item | Grouping | A high-importance item not assigned to any candidate | High |
| Over-merged Tasks | Merge | Distinct tasks from different sources collapsed into one journey | High |
| Wrong Lifecycle | Merge | NEW_WORK tagged as ONGOING_WORK or vice versa | High |
| Scope Mismatch | Merge | Task is too broad or too narrow | High |
| AI Misfit | Merge | Task is too simple or beyond AI capability | High |
| Noise Leak | Merge | Low-value / trivial signals produced a journey | High |
| Stale Journey | Merge | Journey based on signals older than 3-day window | High |
| Priority Inversion | Rank | A clearly more urgent journey ranked below a less important one | High |
| Bad ExecutionPlan | Rank | Plan is vague, references wrong sources, or describes a different task | High |
| Missing Key Context | Output | Output omits a major thread the user clearly cares about | High |
| Wrong Help Type | Output | Correct topic but wrong output format | High |
| No Actionability | Output | Correct information but no concrete next step | High |
| Wrong Recurrence Instance | Output | Recurring task surfaces a past instance instead of upcoming one | High |
| Under-merged Duplicates | Merge | Same task from different sources left as separate journeys | Medium |
| Monoculture Recommendations | Rank | All top-N journeys are the same action_type or trigger type | Medium |
| Vague Card | Output | Title/summary too generic — user can't predict what they'll get | Medium |
| Redundant Output | Output | Output just restates source data — no synthesis | Medium |
| Condition | Overall Verdict |
|---|---|
| Any Ship-Blocker failure | Fail — blocks ship regardless of scores |
| Any Critical failure rate ≥ 2% | Fail — must fix before GA |
| Critical < 2% AND ≥ 2 High failures | Conditional — requires mitigation plan |
| Critical < 2% AND 0-1 High AND all pass criteria met | Pass — ship-ready |
| Manual Eval | Machine Eval Mapping | Coverage |
|---|---|---|
| C.1 Task Accuracy | Merge Pass B: Task Accuracy | Both |
| C.2 Task Relevance | Merge Pass B: Gate Compliance | Both |
| C.3 AI Actionability | Merge Pass B: AI Actionability | Both |
| C.4 Privacy | Merge Pass C: Privacy Filtering | Both |
| C.5-C.8 Card Display | Card: Clarity, Label, Icon | Both |
| X.1 Top-3 Ranking | Rank: Priority + Recall@3 | Both |
| X.2 Missing Journeys | Rank: Recall@3 + Grouping | Both |
| X.3 Shouldn't Be There | Merge Pass B: Gate Compliance | Both |
| O.1-O.5 Output Quality | Execution & Output (all) | Both |
| U.1-U.4 UX | — | Manual only |
| G.1-G.2 Overall | — | Manual only |