Evaluating AI conversations across text and voice

evaluating AI conversations
  • Insight
  • 10 minute read
  • 01/04/25
Sebastian Ahrens

Sebastian Ahrens

AI Center of Excellence Leader, PwC Switzerland

Today artificial intelligence is being woven into almost every facet of customer engagement—from text-based chatbots to voice-powered agents in call centers. The real measure of success isn’t just about having AI assist your users; it’s about ensuring the AI can do so with consistent quality and reliability. That’s where a thorough evaluation framework comes into play.

Why a unified evaluation matters

In many organizations, leaders find themselves having to compare apples to oranges: a text-based agent operating on large language models (LLMs) and a voice-based agent taking phone calls. Both are crucial yet differ in their inputs, outputs, and constraints. When it comes down to it, these “apples and oranges” do share a lot in common. They both rely on the same core conversational abilities—understanding context, following instructions, and delivering coherent answers.

Hence, a unified evaluation where you judge text and voice output by the same content standards can be highly effective. But let’s face it, the voice environment brings unique wrinkles: background noise, speech recognition quirks, and the need to maintain a polished tone under time pressure. By combining a shared core method of assessment (using transcripts, for example) with an additional set of voice-specific tests, you can maintain fairness while still recognizing the distinctive demands of audio interaction.

Tailoring evaluation for voice

Evaluating a voice-based agent requires diving deeper into elements that text interactions don’t typically encounter. Consider the clarity of speech, intonation, and the dreaded lag that can disrupt a caller’s experience. You may find that, even when the AI’s factual correctness is solid, the overall user sentiment plummets if the response is too robotic or if the system’s speech recognition keeps failing.

Some organizations adopt specialized algorithms to measure voice-specific traits directly from audio signals rather than text transcripts. By examining prosody or intonation patterns, you catch nuances that a purely text-based score would miss. It’s a move that might require additional technology investment but can pay huge dividends in customer satisfaction down the road.

Holistic criteria for quality

Regardless of whether you’re dealing with text or speech, the yardstick of “quality” should be transparent and thorough. You’ll want to measure factual accuracy, coherence, context retention, and adherence to corporate policies—like privacy or compliance guidelines. Equally vital is the agent’s tone: the best agent will be polite and empathetic without coming off as unnatural. In a call center environment, for instance, even the most accurate response can feel like a flop if delivered in a monotone.

Think of these criteria like overlapping puzzle pieces. If one piece—say factual accuracy—is missing, the whole picture isn’t complete. A balanced scorecard that encapsulates correctness, manner, policy compliance, and grammar or fluency ensures that nothing slips through the cracks.


Comprehensive Evaluation Framework for Effective Communication

Building a ground truth

As the old saying goes, “What gets measured gets managed.” But how do we define what’s correct? That’s where a well-curated set of ground truth data comes in. This is your collection of “ideal” or expected responses. It might consist of transcripts from your best customer service representatives or expertly crafted answers for a variety of typical (and not-so-typical) queries.

Leading organizations often keep expanding this library of test scenarios, incorporating fresh challenges that arise in real interactions. A robust ground truth lets you pinpoint exactly how your AI is performing, from general inquiries to edge cases, and fosters continuous improvement.

Harnessing LLMs for automated feedback

The concept of an “LLM jury” has gained traction for one good reason: it’s scalable. Rather than rely solely on human evaluators (who can get tired, overworked, or inconsistent), you can enlist one or more large language models to review the AI’s output. These models can give real-time feedback on correctness, coherence, and tone—offloading some of the grunt work from your team.

It does raise interesting considerations. What if the evaluator model has its own biases or knowledge gaps? A best practice is to calibrate the model by giving it example evaluations, then cross-checking a small sample of its results with your team. By verifying alignment on that sample, you gain confidence that the LLM’s scoring is consistent enough to trust for a broader range of test cases.

Implementing a formal test suite

A truly sophisticated AI evaluation program is more than just a one-off check. It’s a recurring, systematic process. Leading companies often establish test suites that include both “typical usage” and “stress scenarios.” In the call center world, that might mean a barrage of billing questions, escalations about a refund, or even an angry caller who’s tough to please. Each test scenario is then fed to the agent, and the outputs get scored automatically.

Results often feed into dashboards that highlight pass rates, average scores, and anomalies—like major policy violations or repeated factual errors. Over time, you spot trends: Are there certain question types that consistently trip up your AI? Do specific policy filters need refinement? These insights guide your teams as they tweak prompts, retrain models, or even rewrite sections of knowledge bases to fill in knowledge gaps.

One word of caution: in voice scenarios, you must also measure the performance of speech recognition (ASR) and text-to-speech (TTS) systems. Poor transcription can derail even the most brilliant reasoning pipeline. This is why some frameworks measure word error rates or have a separate stream of tests focused on voice fidelity and clarity.

Challenges and the road ahead

No framework is perfect. An LLM-based evaluator may deviate over time, especially if it’s an external model that gets periodic updates. You’ll also contend with the complexity of multi-turn interactions, ensuring that the agent maintains continuity and doesn’t contradict itself five exchanges in.


Navigating LLM Evaluation Challenges

Nevertheless, overcoming these challenges is well worth it. With a disciplined approach to data collection, thorough test suites, and automated scoring, you can sustain a high-performing AI solution that evolves gracefully with your business. As user expectations soar—particularly in customer-facing environments—there’s real competitive advantage in offering fast, accurate, and empathetic AI support across multiple channels.

Finally, remember that real-time user feedback remains a critical piece of the puzzle. Scores and metrics are invaluable, but never discount what direct user surveys or call recordings might reveal. In many respects, the marriage of quantifiable test results with real human sentiment is what keeps your evaluation pipeline honest.

Conclusion

For executives aiming to harness the full potential of AI in both text and voice channels, a sophisticated evaluation strategy is non-negotiable. It empowers you to benchmark performance, identify gaps, and continuously refine your AI’s capabilities—without risking a slip in customer trust or satisfaction. By unifying the core assessment principles and tailoring the details for each modality, you create a robust system that propels your AI initiatives forward responsibly and competitively.

If there’s one takeaway from this blueprint, it’s that strong governance paired with continuous testing is the key to reliable, high-quality AI interactions. In the end, that reliability is what will shape your organization’s reputation—and your bottom line—for years to come.

Contact us

https://pages.pwc.ch/view-form?id=701Vl00000dxMuJIAU&embed=true&lang=en

Contact us

Sebastian Ahrens

AI Center of Excellence Leader, PwC Switzerland

+41 58 792 16 28

Email

Gianfranco Mautone

Partner and Forensic Services and Financial Crime Leader, Zurich, PwC Switzerland

+41 58 792 17 60

Email