Goodhart’s Law Is Eating Your Voice AI Rollout

May 4, 2026

Amber Sahdev

When a metric becomes a goal, it ceases to be a good metric. -- Goodhart's Law

Over the past few months, I've spoken with hundreds of vendors, buyers, and operator teams using voice artificial intelligence to support customers, sell to them, and sometimes to negotiate outcomes that matter (refunds, cancellations, pricing exceptions, payment plans, freight routes, and pricing). There's a clear pattern in the teams that win: They keep a tight monitoring system, pick a select few metrics, and hunt for 1 percent improvements every week.

The teams that struggle do the opposite. They track everything, chase whatever's loudest, and end up optimizing for activity instead of outcomes. They'll still move numbers, but the failures show up later in recontacts, escalations, and downstream defects.

This is a field guide to the most common voice AI key performance indicators that teams track, where they fail, and the guardrails that keep them aligned with customer success.

But first, consider why you should track KPIs at all.

KPIs are a small set of quantitative metrics that tell you how healthy your operation is. The health part matters because it's easy to confuse activity with progress.

KPIs force you to confront whether performance is improving or you're just keeping busy.

The following are a few rules of thumb to keep your KPIs honest:

If a metric is easy to move but doesn't represent value, it's a vanity metric.
If the metric can go up while customers get worse outcomes, it should not be your primary KPI.
Your primary KPI should move often enough, ideally weekly. If it only moves quarterly, it's too slow to course-correct.

The KPIs describe your business. Track them closely and find a 1 percent win this week; it will compound.

Any operator in your team, including you, can probably find a 1 percent win this week.

Common Metrics and Failure Modes

1) Containment Rate (Percentage of Automated Calls)

What it is:

The share of interactions completed without a human.

Why teams choose it:

It's immediate, it trends quickly, and it produces a clean headline: Automation is up./p>

Where it fails:

Containment can rise because the system refuses to escalate, loops the customer, or confidently completes the wrong action.

Guardrails to add:

First-contact success (FCS): resolved and no recontact in X days and no critical defect.
Repeat contact rate (7-14 days): catches brittle resolution, which is subtly different from FCS and might make more sense depending on your use-case.
Handoff success rate: escalations land cleanly (context captured, minimal repetition, time-to-human service level agreement.
Critical defect rate: wrong action, policy/compliance error, incorrect refund/cancel/IDV.

2) Average Handle Time (AHT)

What it is:

Average time per customer interaction.

Why teams choose it:

It's familiar, it moves quickly, and it's historically how contact centers prove efficiency.

Where it fails:

AHT improves when you rush, transfer early, or end calls prematurely. Voice AI makes this failure worse because it can be confidently wrong fast. Also, when did we decide it's good to talk less to our customers?

Guardrails to add:

Time-to-outcome on successful calls only: measure speed conditioned on success (FCS).
Repeat contact rate: fast wrong answers create rework.
Containment rate and handoff success rate: make sure faster isn't just the bot ending calls or refusing escalations; AHT improvements should not come with inflated containment and worse outcomes.

3) Sentiment

What it is:

Sentiment scoring across the interaction.

Why teams choose it:

It feels closer to the customer experience than operational stats, and it's available at scale.

Where it fails:

Sentiment is not truth; it's a proxy. It varies by customer segment, language, and context. And today's models are still shaky at granular scoring (65 percent vs. 85 percent); they're usually more reliable at coarse letter-grade buckets (A/B/C or good/neutral/bad). Teams that force precise numeric sentiment often end up with noisy analytics. Also at it's worst, sentiment can be fine while the system commits critical defects (wrong cancellations, incorrect refunds, policy misstatements).

Guardrails to add:

Sentiment shift: start-of-call vs. end-of-call change (did we make things worse?)
Escalation triggered by frustration: percentage of escalations preceded by repeated clarifications/interruptions.
Critical defect rate: sentiment cannot override correctness.

4) Customer Satisfaction (CSAT)/Net Promoter Score (NPS)

What it is:

Survey feedback after the interaction.

Why teams choose it:

Leadership recognizes it, and it's easy to report.

Where it fails:

CSAT/NPS are lagging, noisy, and biased. They're useful as a check-engine lights, but they don't tell you what to fix this week.

With AI call analytics we can do much better now. Modern call analytics can surface operationally useful signals (repeat contacts, defect clusters, friction patterns) without waiting for surveys.

Guardrails to add:

Survey response rate and bias by intent/segment.
Behavioral proxies: repeat contacts, abandonment, complaints.
Friction indicators: interruptions, "please repeat," loops, long holds.

5) Conversion Rate (sales, retention, collections)

What it is:

The percent of interactions that convert (purchase, upsell, payment, retention, appointment booked).

Why teams choose it:

It's a real business outcome.

Where it fails:

Conversion can be inflated by dirty wins, like pushing hallucinated discounts, missetting expectations, or optimizing for quick closes that lead to downstream refunds, cancellations, or disputes.

Guardrails to add:

Refund/cancellation within X days (post-conversion truth).
Chargebacks/disputes/complaints.
Critical defects (disclosures, compliance, incorrect commitments).
Upsell attempt rate and average order value (AOV) (ensure you'e not just closing cheap, low-value sales).

Conclusion

Voice AI is pushing contact centers into a software cadence: weekly releases, fast iteration, and fast failure. The good news is that AI has also made call analytics at scale practical. With modern speech and large language model tooling, you can measure deeper metrics than calls handled or minutes used, and you can do it across thousands of conversations without manually QA-ing 1 percent of calls.

The right structure is simple and includes the following:

Pick one primary KPI that represents customer success for your business.
Add a handful of guardrails so you can't game it.

AI is moving from pilots to production, and the teams that win will run it like production software.

Amber Sahdev is an engineer focused on voice AI and call analytics for contact centers.

Goodhart’s Law Is Eating Your Voice AI Rollout

Common Metrics and Failure Modes

1) Containment Rate (Percentage of Automated Calls)

What it is:

Why teams choose it:

Where it fails:

2) Average Handle Time (AHT)

What it is:

Why teams choose it:

Where it fails:

Guardrails to add:

3) Sentiment

What it is:

Why teams choose it:

Where it fails:

Guardrails to add:

4) Customer Satisfaction (CSAT)/Net Promoter Score (NPS)

What it is:

Why teams choose it:

Where it fails:

Guardrails to add:

5) Conversion Rate (sales, retention, collections)

What it is:

Why teams choose it:

Where it fails:

Guardrails to add:

Conclusion

Connect with SCS

Email Newsletter

Webinars

Content Library

	Smart Customer Service eWeekly Receive customer service news, trends, and analysis, plus expert advice.
	Smart Customer Service Bulletin Periodically, get important offers from SmartCustomerService.com or our advertising partners.