- ServiceNow发布EVA框架,用于全面评估语音助手性能。EVA采用端到端评估方式,通过模拟真实对话场景,同时衡量任务准确性(EVA-A)和对话体验(EVA-X)。该框架首次将任务完成度与用户体验联合评分,解决了现有方法将两者割裂评估的局限。研究团队基于航空领域构建了包含50个场景的数据集,涵盖航班改签、取消处理、代金券使用等任务,并计划扩展至更多领域。
EVA在20个级联系统和原生音频模型上的基准测试显示,存在显著的准确性与体验之间的权衡:任务完成率高的系统往往用户体验较差,反之亦然。这表明优化单一指标可能损害整体交互质量。框架代码、数据集及评估提示已全部开源,推动行业建立更全面的语音助手评估标准。
端到端评估语音助手表现
联合评分任务准确与体验
揭示准确性与体验权衡
开源框架与航空数据集
- A New Framework for Evaluation of Voice Agents (EVA)
Researchers have introduced EVA, an end-to-end evaluation framework designed to assess conversational voice agents by jointly measuring accuracy (EVA-A) and conversational experience (EVA-X). Unlike existing methods that evaluate task success and user experience separately, EVA evaluates complete, multi-turn spoken interactions using a bot-to-bot architecture, simulating realistic user-agent conversations. The framework addresses critical challenges in voice interaction, such as speech recognition errors, response delays, and overwhelming verbal outputs, which can undermine otherwise accurate reasoning. EVA is released with an initial dataset of 50 airline-related scenarios involving flight rebooking, cancellations, and voucher handling, marking the first domain in a planned expansion. Benchmark results across 20 systems—including cascade models and audio-native large language models—reveal a consistent tradeoff between accuracy and experience: high-performing task completion often correlates with poorer conversational quality, and vice versa. The framework, including code, dataset, and evaluation prompts, is fully open-sourced to support further research. This development addresses a significant gap in holistic voice agent assessment, offering a standardized method to improve real-world usability.
Key Takeaways:
EVA evaluates voice agents on both accuracy and conversational experience simultaneously
A consistent tradeoff exists between task success and user experience in current systems
The framework includes an open-source airline dataset and benchmark results for 20 models
Source: Original Article
查看原文 →
View Original →