Shriberg E, Bates R, Stolcke A, Taylor P, Jurafsky D, Ries K, Coccaro N, Martin R, Meteer M, van Ess-Dykema C
SRI International, Menlo Park, CA 94025, USA.
Lang Speech. 1998 Jul-Dec;41 ( Pt 3-4):443-92. doi: 10.1177/002383099804100410.
Identifying whether an utterance is a statement, question, greeting, and so forth is integral to effective automatic understanding of natural dialog. Little is known, however, about how such dialog acts (DAs) can be automatically classified in truly natural conversation. This study asks whether current approaches, which use mainly word information, could be improved by adding prosodic information. The study is based on more than 1000 conversations from the Switchboard corpus. DAs were hand-annotated, and prosodic features (duration, pause, F0, energy, and speaking rate) were automatically extracted for each DA. In training, decision trees based on these features were inferred; trees were then applied to unseen test data to evaluate performance. Performance was evaluated for prosody models alone, and after combining the prosody models with word information--either from true words or from the output of an automatic speech recognizer. For an overall classification task, as well as three subtasks, prosody made significant contributions to classification. Feature-specific analyses further revealed that although canonical features (such as F0 for questions) were important, less obvious features could compensate if canonical features were removed. Finally, in each task, integrating the prosodic model with a DA-specific statistical language model improved performance over that of the language model alone, especially for the case of recognized words. Results suggest that DAs are redundantly marked in natural conversation, and that a variety of automatically extractable prosodic features could aid dialog processing in speech applications.
识别一句话是陈述、疑问、问候等对于有效地自动理解自然对话至关重要。然而,对于如何在真正的自然对话中自动分类这些对话行为(DAs),人们了解得很少。本研究探讨了主要使用单词信息的当前方法是否可以通过添加韵律信息得到改进。该研究基于Switchboard语料库中的1000多个对话。对话行为进行了人工标注,并为每个对话行为自动提取了韵律特征(时长、停顿、基频、能量和语速)。在训练中,基于这些特征推断决策树;然后将树应用于未见过的测试数据以评估性能。单独评估韵律模型的性能,并在将韵律模型与单词信息(来自真实单词或自动语音识别器的输出)相结合之后进行评估。对于总体分类任务以及三个子任务,韵律对分类做出了重大贡献。特定特征分析进一步表明,虽然典型特征(如疑问的基频)很重要,但如果去除典型特征,不太明显的特征也可以起到补偿作用。最后,在每个任务中,将韵律模型与特定对话行为的统计语言模型相结合,相比于单独的语言模型提高了性能,特别是对于识别出的单词的情况。结果表明,对话行为在自然对话中被冗余标记,并且各种可自动提取的韵律特征可以帮助语音应用中的对话处理。