Step 1 — Define Goals & Metrics
Commercial Journeys is a multi-agent pipeline: M365 Data + Browser → Grouping Agent → Merging Agent → Ranking Agent → M365 Copilot Chat. We evaluate 40 metrics across 2 levels: per-agent output (fault isolation for each sub-agent) and final presentation (card quality + execution delivery + E2E coherence).
Browser History
→ Per-source grouped task information
→ Cross-source merged journey candidates
→ (1) Final journeys with priority ranking
→ (2) Optimized prompt with ExecutionPlan per journey
→ Final execution output (email drafts, meeting prep, summaries, etc.)
N/A Handling: Conditional metrics (e.g., Recurrence Handling, Incremental Update, Dismissed Suppression) are marked N/A when preconditions are not met. N/A judgments do not count toward pass/fail. Tracked as n/a_rate per metric — target: N/A rate < 40%.
End-to-End Ownership: We evaluate the entire feature end-to-end — from signal ingestion through Copilot execution output. Even though the final execution is handled by M365 Copilot Chat, the overall experience is what matters. If the feature delivers poor results for any reason (bad prompt design, poor execution quality, or both), it reflects on the feature as a whole. Root-cause tagging (cj_pipeline | m365_copilot | ambiguous) is used for debugging and prioritization, not for excluding failures from the overall verdict.
Grouping Agent Output
3 metrics ▼Output: By-source grouped tasks. Each data source independently produces task candidates from raw signals. Sources: Email, Teams, Calendar, Meeting Recaps, Files, Browser History.
| Sub-metric | What to measure | How |
|---|---|---|
| Intra-source Grouping Coherence | Within a single data source, are all items in one cluster truly about the same task? | AI reviews each cluster's item set → judges whether semantically unified. E.g., two unrelated email threads lumped into one candidate = failure |
| Intra-source Grouping Completeness | Within a single data source, are all items related to the same task assigned to the same cluster? | AI scans unclustered items → judges whether they should belong to an existing cluster. E.g., a key email about "Q3 Budget" left as orphan while other Q3 Budget emails are clustered = failure |
| Per-source Coverage | Did the Grouping Agent produce candidates from data sources that clearly contain task signals? | Compare active data sources vs. data sources that produced zero candidates → flag sources with obvious signals that were entirely missed |
Merging Agent Output
17 metrics · 3 passes ▼Output: Cross-source merged journey candidates. Fuses task candidates from different data sources into complete, well-defined journeys. The most complex step — 16 metrics split into 3 focused evaluation passes to reduce LLM-as-Judge cognitive load.
| Sub-metric | What to measure | How | Boundary definition |
|---|---|---|---|
| Cross-source Fusion Accuracy (Over-merging) | Were candidates from different sources that describe distinct tasks incorrectly merged? | AI reviews merged journey vs. contributing candidates → flags cases where semantically different tasks were collapsed | Two tasks are "distinct" when they have different owners, different deadlines, or produce different deliverables — even if they share a topic umbrella |
| Cross-source Fusion Accuracy (Under-merging) | Were candidates from different sources that describe the same task left as separate journeys? | AI clusters all output journeys by semantic similarity → flags near-duplicates from different sources | Two journeys are "same task" when the user would describe them as one item on their to-do list |
| Source Attribution Precision | Are the linked source items correct — no wrong links? | Check each attributed source → precision = correct sources / total attributed sources | A source is "correct" when its content is directly relevant to the journey's task definition. Tangentially related items = incorrect |
| Source Attribution Recall | Are the most important source signals included? | Check whether high-signal items are attributed → recall = included key sources / total key sources | Key source = (a) primary task trigger, (b) any source with deadline info, (c) any source from task owner's direct manager or above |
| Source Role Assignment | Are data sources assigned correct roles? (Email=trigger, Calendar=deadline, Browser=execution evidence, Files=context) | Extract source roles → AI validates role logic against specialized prompt definitions | N/A when journey has only 1 contributing source |
| Data Source Coverage | Are journey candidates drawing from multiple M365 sources, or over-reliant on a single source? | Tag contributing sources per journey → measure source distribution across all qualified journeys | Flag if all journeys originate from a single source type when multiple sources contain task signals |
| Sub-metric | What to measure | How | Boundary definition |
|---|---|---|---|
| Task Accuracy | Is this a real task the user actually needs to complete? | AI evaluates: given the contributing source signals, is the generated task something the user would recognize as "yes, I need to do this"? | "Real task" = user was explicitly assigned, committed, or is the clear owner. "Not real" = AI inferred a task from peripheral signals (e.g., CC'd on FYI email → AI assumes user must act) |
| Task Granularity | Is the task scoped to a specific, actionable item? | AI evaluates scope. Score 1-5. | Score 5 = single clear deliverable. Score 3 = reasonable but could be more specific. Score 1 = "do your job" broad or trivially narrow |
| AI Actionability | Does this task genuinely benefit from AI help? | AI classifies: (a) Too simple — user can do faster without AI. (b) AI-appropriate. (c) Beyond AI capability. Flag (a) and (c) | "Too simple" = single click, one-line reply. "Beyond AI" = requires physical presence, confidential judgment, or inaccessible systems |
| Gate Condition Compliance | Does each journey pass the 7 gating conditions? | Programmatically check: IS_CONCRETE_TASK, IS_WORK_RELATED, HAS_USER_RELEVANCE, IS_NOT_COMPLETED, HAS_MEANINGFUL_COMPLEXITY, REQUIRES_BROWSER, IS_AI_EXECUTABLE | IS_NOT_COMPLETED: if user sent "done"/"completed"/"attached" AND no subsequent follow-up → completed. Ambiguous = not completed (benefit of doubt) |
| Hallucination Detection | Does the journey have zero traceable source signals — entirely fabricated? | Cross-reference journey description against all contributing candidates and raw signals → flag journeys with no traceable source | "Hallucination" = core task description has NO supporting evidence in ANY input source. "Reasonable inference" from strong multi-source signals ≠ hallucination |
| Sub-metric | What to measure | How | Boundary definition |
|---|---|---|---|
| Lifecycle Accuracy | Is the lifecycle state (NEW_WORK / ONGOING_WORK / RECURRING_WORK) correctly assigned? | For ONGOING: verify match to existing journey + continued activity. For NEW: verify no existing journey covers this task | N/A for first-run evaluations where no journey history exists |
| Incremental Update Fidelity | When updating an existing journey, is the original context preserved while new info is correctly incorporated? | Compare updated journey against previous version + new signals → flag lost context or incorrect overwrites | N/A when journey is NEW_WORK with no prior version |
| Dismissed Topic Suppression | Are manually dismissed journeys correctly excluded from regeneration? | Cross-reference output against dismissed titles list → flag resurfaced dismissed topics | N/A when no dismiss history exists |
| Privacy & Sensitivity Filtering | Are sensitive topics correctly blocked? Any PII leakage? | PII scan (regex + NER) across all journey titles, summaries, and source references + personal-vs-work classification | Any PII (email addresses in title, SSN, credit card) = ship-blocker. Personal-but-not-PII (e.g., "dentist appointment") = High |
| Staleness Rejection | Are journeys based on signals older than the 3-day freshness window filtered out? | Check source signal timestamps → flag journeys based entirely on expired signals | "Entirely expired" = ALL contributing sources >3 days old AND no new signals merged in |
| Noise Rejection | Are low-value / trivial signals correctly suppressed? | Classify signal value → flag low-value signals that produced journeys | "Noise" = auto-generated system emails, all-day calendar placeholders, newsletters. "Not noise" = routine meetings if user is organizer or has action items |
Ranking Agent Output
5 metrics · 2 passes ▼The Rank Agent produces two distinct outputs: (1) final journeys with ranking, and (2) an optimized prompt with ExecutionPlan for each journey, to be sent to M365 Copilot Chat. These are evaluated as two separate passes.
| Sub-metric | What to measure | How |
|---|---|---|
| Priority Signal Alignment | Do top-ranked journeys genuinely have the highest urgency/importance? | Extract rule-based + model scores → verify top-N score highest on timeliness, stakeholder importance, ownership strength |
| Missing or Under-Ranked Detection | Are there important tasks that should be in the Top-3 but are either missing entirely or buried too low? | AI Judge reviews all input signals holistically → identifies high-urgency/high-importance tasks → checks whether they appear in the system's Top-3. Flags tasks that are absent or ranked significantly lower than their urgency warrants |
| Ranking Stability | Do rankings change erratically between sessions without new signals? | Compare logs across sessions → flag unexplained rank changes |
| Sub-metric | What to measure | How |
|---|---|---|
| ExecutionPlan Quality | Is the generated ExecutionPlan well-structured, specific, and executable by Copilot? | AI evaluates: task clarity, context inclusion, output format, scope. Score 1-5 |
| ExecutionPlan-Journey Alignment | Does the ExecutionPlan match the journey's task definition? | Compare plan objective against journey title/summary → flag mismatches |
Card Presentation Quality
5 metrics ▼What the user sees on the NTP before clicking. Does the card earn the click — and set honest expectations?
| Sub-metric | What to measure | How |
|---|---|---|
| Brief Clarity | Is the task brief (title + summary) specific, actionable, and immediately understood? | AI evaluates: specificity, actionability, grammar, length. Binary: "Can you predict what happens if you click?" |
| Label Accuracy | Does the label (type + name) correctly reflect the actual urgency trigger? | Cross-reference label against source signals. E.g., "Asked by VP Chen" — verify VP Chen actually made the request |
| Icon Relevance | Does the action_type icon match the actual task? | Map action_type to journey content → flag mismatches |
| Promise Accuracy | Does the card accurately represent what the Copilot execution will deliver? | Compare card title + summary vs. actual execution output → score promise-delivery alignment 1-5. Score ≤ 2 = critical failure |
| Task Category Distribution | Are the displayed journey cards spread across different action types, or all clustered in one category? | Classify displayed cards by action_type → flag if any single type exceeds 60% |
| Trigger Diversity | Are the displayed journey cards varied in urgency trigger types (deadline-driven, people-driven, recurrence-driven)? | Extract label/trigger types across displayed cards → check distribution is not dominated by a single trigger type |
Execution & Delivery
5 metrics ▼After the user clicks, M365 Copilot Chat executes the prompt generated by Rank Agent. Did the response deliver value? Root-cause tagging (cj_pipeline | m365_copilot | ambiguous) is applied for debugging — all failures count toward the overall feature verdict.
| Sub-metric | What to measure | How | Root-cause notes |
|---|---|---|---|
| Execution Completion Rate | What % of executions completed successfully vs. failed/timed out? | Parse execution traces → compute success rate | Timeout/crash = likely m365_copilot. Malformed prompt = cj_pipeline |
| Output Correctness | Are factual claims grounded in the user's actual data? | For each claim, verify against source signals. Claim precision = supported / total. Target ≥ 95% | Wrong facts from bad context in prompt = cj_pipeline. Correct context but Copilot hallucinated = m365_copilot |
| Output Completeness | Does the output cover all key threads from the journey's source signals? | Compare entities/topics in output vs. key entities in source signals | CJ prompt omitted key context = cj_pipeline. Context provided but Copilot ignored = m365_copilot |
| Actionability | Are there concrete next steps the user can immediately act on? | Count actionable elements: draft ready to send? Link to right doc? Binary: ≥ 1 concrete action? | Prompt didn't request actionable format = cj_pipeline. Requested but not delivered = m365_copilot |
| Scenario Fit | Is the output format appropriate for the action_type? | Map action_type → expected format (REPLY_EMAIL → email draft, PREPARE_MEETING → agenda). Flag mismatches | CJ prompt specified wrong format = cj_pipeline. Correct request, wrong delivery = m365_copilot |
End-to-End Coherence & Task Resolution
5 metrics ▼From raw signals to final execution output, does the entire chain hold together?
| Sub-metric | What to measure | How |
|---|---|---|
| Journey Coverage (Missing Journeys) | Are there important tasks from the input signals that the pipeline failed to surface as journeys? | AI reviews all input signals holistically → identifies task-worthy signal clusters that produced no journey. Cross-references with Grouping and Merge outputs to locate where the signal was lost |
| Full-chain Topic Alignment | Does the primary topic stay consistent across all pipeline stages? | Extract primary topic at each stage → flag any stage that introduces a different topic or contradicts another |
| Task Resolution Score | Could the user meaningfully advance their task from this output alone? | Expert/AI rating 1-5. Score 1 = "useless, would start over." Score 5 = "could complete the task right here." The single offline metric most predictive of real-world satisfaction |
| Recurrence Handling | For RECURRING_WORK journeys, is the system recommending the correct upcoming instance? | Check recurrence metadata → validate instance timing. N/A for NEW_WORK and ONGOING_WORK |
Critical failures are flagged independently from dimension scores. Any critical failure triggers an overall Fail verdict regardless of other scores.
| Severity | Definition | Action |
|---|---|---|
| Ship-Blocker | Trust-destroying or compliance-violating. Zero tolerance — even one instance blocks release. | Blocks ship. Zero-tolerance threshold. |
| Critical | Fundamental value proposition broken. User sees wrong, fabricated, or misleading content. | Must fix before GA. Threshold: < 2% of evaluated journeys. |
| High | Meaningful quality gap. User gets some value but with significant friction. | Must be triaged. Acceptable only with documented mitigation plan. |
| Medium | Noticeable imperfection. User still gets value but experience is suboptimal. | Should fix. Can ship with known issues tracked. |
| Failure Mode | Agent / Stage | Description | Severity |
|---|---|---|---|
| Privacy Violation | Merging | Sensitive / personal content surfaces in journey | Ship-Blocker |
| PII Leakage | Merging | PII exposed in journey title, summary, or sources | Ship-Blocker |
| Resurfaced Dismissed Topic | Merging | A manually dismissed journey topic reappears | Ship-Blocker |
| Hallucinated Journey | Merging | Journey has no traceable source signal — entirely fabricated | Critical |
| Wrong Source Attribution | Merging | Journey links to incorrect M365 items | Critical |
| Phantom Task | Merging | Signals are real but the inferred task is not | Critical |
| Completed Task Surfaced | Merging | Journey recommends a task the user already finished | Critical |
| Promise-Delivery Gap | Card Presentation | Card promises something the execution output doesn't deliver | Critical |
| Over-Promise | Card Presentation | Card implies capability the Copilot execution can't support | Critical |
| Execution Hallucination | Execution | Factual claims not supported by any source signal | Critical |
| Topic Drift | E2E | Primary topic changes between pipeline stages | High |
| Missed Source Entirely | Grouping | A data source with obvious task signals produced zero candidates | High |
| Mis-grouped Items | Grouping | Unrelated items from the same source lumped into one candidate | High |
| Orphaned Key Item | Grouping | A high-importance item not assigned to any candidate | High |
| Over-merged Tasks | Merging | Distinct tasks from different sources collapsed into one journey | High |
| Wrong Lifecycle | Merging | NEW_WORK tagged as ONGOING_WORK or vice versa | High |
| Scope Mismatch | Merging | Task is too broad or too narrow | High |
| AI Misfit | Merging | Task is too simple or beyond AI capability | High |
| Noise Leak | Merging | Low-value / trivial signals produced a journey | High |
| Stale Journey | Merging | Journey based on signals older than 3-day window | High |
| Missing Important Journey | E2E | A clearly important task from input signals is not surfaced as any journey | High |
| Priority Inversion | Ranking | A clearly more urgent journey ranked below a less important one | High |
| Bad ExecutionPlan | Ranking | Plan is vague, references wrong sources, or describes a different task | High |
| Missing Key Context | Execution | Execution output omits a major thread the user clearly cares about | High |
| Wrong Help Type | Execution | Correct topic but wrong output format | High |
| No Actionability | Execution | Correct information but no concrete next step | High |
| Wrong Recurrence Instance | E2E | Recurring task surfaces a past instance instead of upcoming one | High |
| Under-merged Duplicates | Merging | Same task from different sources left as separate journeys | Medium |
| Monoculture Recommendations | Card Presentation | All displayed journeys are the same action_type or trigger type | Medium |
| Vague Card | Card Presentation | Title/summary too generic — user can't predict what they'll get | Medium |
| Redundant Output | Execution | Execution output just restates source data — no synthesis | Medium |
| Condition | Overall Verdict |
|---|---|
| Any Ship-Blocker failure | Fail — blocks ship regardless of scores |
| Any Critical failure rate ≥ 2% | Fail — must fix before GA |
| Critical < 2% AND ≥ 2 High failures | Conditional — requires mitigation plan |
| Critical < 2% AND 0-1 High AND all pass criteria met | Pass — ship-ready |