【信息技术】【2014.12】复音：定义、模型与检测

梅花香——苦寒来

于 2019-05-17 18:58:57 发布

阅读量298

点赞数

在这里插入图片描述
本文为澳大利亚格拉茨技术大学（作者：Dipl.-Ing. Philipp Aichinger）的博士论文，共154页。

语音障碍需要更好地理解，因为这可能会减少工作机会、导致社会孤立。要解决这些问题，需要正确的治疗表征和效果衡量，必须依靠强有力的临床干预研究结果。复音是一种严重的、经常被误解的声音障碍症状。根据其潜在的病因，复音患者通常接受的治疗如言语矫正疗法或发音疗法。在目前的临床实践中，复音是由医生在听觉上确定的，从循证医学和科学方法论的角度来看，这是存在问题的。本论文的目的是为了实现复音症状的自动检测。

本文选择了40名正常发音、40名复音和40名发音困难的受试者，收集的材料包括喉部高速视频和同步的高质量录音。所有材料均已标注数据质量，并应用了无损数据预选。对双音声带振动模式（即声门双音）进行了识别，提出了从喉部高速视频中自动检测的方法。频率图像双峰性是基于像素强度时间序列的频率分析。该方法能完全自动工作，对悦耳音调阴性组的分类准确率为78%，对发音困难阴性组分类准确率为75%。频率图双峰性是基于声门边缘轨迹的频率分析，能够处理空间分割的视频，这些视频是通过人工干预获得的。频率图双峰性对悦耳音调阴性组的分类精度略高，达到82.9%，对发音困难阴性组的分类精度达到77.5%。

提出并评价了一种分析声门区和声门区二音波形的双振波形模型。该模型用于建立波形中二次振荡器的检测算法，并定义了生理上可解释的“双音图”。在区分二音和严重发音困难时，二音图的分类准确率为87.2%。相比之下，传统的声音嘶哑特征在这项任务中的表现较差。隐类分析是从概率的角度来评价实践中的真实性，使用的专家注释具有很高的灵敏度（96.5%）和完美的特异性（100%）。二音图是从语音中检测二次发声间隔的最有效的自动方法。

二音图是基于模型结构优化、音频波形建模和综合分析的结果，它比传统的声音嘶哑特征更适合描述二音信号。综合分析和波形建模已经在语音研究中得到应用，但对感知语音质量的模型结构优化进行系统研究是一个新课题。对于双重发声来说，一个和两个振荡器之间的切换至关重要。最优模型结构是一种定性的结果，可以从生理上解释，推测模型结构优化对于描述除双音以外的其他语音现象也很有用。由此得到的描述符可能比传统的描述符更容易被临床医生接受。

双重发音的有用定义集中在感知、声学和声门振动的水平上。由于其主观性，建议在临床语音评价中避免单纯使用感知定义。声门振动水平与远端原因有关，其临床意义重大，但难以评估。通过两个振荡器波形模型在声级上定义是有利的，并可用于体内测试。建议根据不同的描述级别更新语音现象的定义和术语。

Voice disorders need to be better understood because they may lead to reduced job chances and social isolation. Correct treatment indication and treatment effect measurements are needed to tackle these problems. They must rely on robust outcome measures for clinical intervention studies. Diplophonia is a severe and often misunderstood sign of voice disorders. Depending on its underlying etiology, diplophonic patients typically receive treatment such as logopedic therapy or phonosurgery. In the current clinical practice diplophonia is determined auditively by the medical doctor, which is problematic from the viewpoints of evidence-based medicine and scientific methodology. The aim of this thesis is to work towards objective (i.e., automatic) detection of diplophonia. A database of 40 euphonic, 40 diplophonic and 40 dysphonic subjects has been acquired. The collected material consists of laryngeal high-speed videos and simultaneous high-quality audio recordings. All material has been annotated for data quality and a non-destructive data pre-selection is applied. Diplophonic vocal fold vibration patterns (i.e., glottal diplophonia) are identified and procedures for automated detection from laryngeal high-speed videos are proposed. Frequency Image Bimodality is based on frequency analysis of pixel intensity time series. It is obtained fully automatically and yields classification accuracies of 78 % for the euphonic negative group and 75 % for the dysphonic negative group. Frequency Plot Bimodality is based on frequency analysis of glottal edge trajectories. It processes spatially segmented videos, which are obtained via manual intervention. Frequency Plot Bimodality obtains slightly higher classification accuracies of 82.9 % for the euphonic negative group and 77.5 % for the dysphonic negative group. A two-oscillator waveform model for analyzing acoustic and glottal area diplophonic waveforms is proposed and evaluated. The model is used to build a detection algorithm for secondary oscillators in the waveform and to define the physiologically interpretable ”Diplophonia Diagram”. The Diplophonia Diagram yields a classification accuracy of 87.2 % when distinguishing diplophonia from severely dysphonic voices. In contrast, the performance of conventional hoarseness features is low on this task. Latent class analysis is used to evaluate the used ground truth from a probabilistic point of view. The used expert annotations achieve very high sensitivity (96.5 %) and perfect specificity (100 %). The Diplophonia Diagram is the best available automatic method for detecting diplophonic phonation intervals from speech. The Diplophonia Diagram is based on model structure optimization, audio waveform modeling and analysis-by-synthesis, which enables a more suitable description of diplophonic signals than conventional hoarseness features. Analysis-by-synthesis and waveform modeling had already been carried out in voice research, but systematic investigation of model structure optimization with respect to perceived voice quality is novel. For diplophonia, the switch between one and two oscillators is crucial. Optimal model structure is a qualitative outcome that may be interpreted physiologically and one may conjecture that model structure optimization is also useful for describing other voice phenomena than diplophonia. The obtained descriptors might be more easily accepted by clinicians than the conventional ones. Useful definitions of diplophonia focus on the levels of perception, acoustics and glottal vibration. Due to its subjectivity, it is suggested to avoid the sole use of the perceptual definition in clinical voice assessment. The glottal vibration level connects with distal causes, which is of high clinical interest but difficult to assess. The definition at the acoustic level via two-oscillator waveform models is favored and used for in vivo testing. Updating definitions and terminology of voice phenomena with respect to different levels of description is suggested.

1 引言

2 一个同步录制高质量音频的喉部高速视频数据库

3 声带振动的空间分析与模型

4 分析二音的两个振荡器波形模型

5 诊断测试及其解释

6 讨论、结论与未来研究展望

下载英文原文地址：

http://page2.dfpan.com/fs/fl7c3jd26211f2e9160/

更多精彩文章请关注微信号：在这里插入图片描述

梅花香——苦寒来

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
【信息技术】【2014.12】复音：定义、模型与检测

本文为澳大利亚格拉茨技术大学（作者：Dipl.-Ing. Philipp Aichinger）的博士论文，共154页。语音障碍需要更好地理解，因为这可能会减少工作机会、导致社会孤立。要解决这些问题，需要正确的治疗表征和效果衡量，必须依靠强有力的临床干预研究结果。复音是一种严重的、经常被误解的声音障碍症状。根据其潜在的病因，复音患者通常接受的治疗如言语矫正疗法或发音疗法。在目前的临床实践中，复音...
复制链接

扫一扫