语音识别中深度神经网络目标值优化

TrainingTargetOptimizationofDeepNeuralNetworkinSpeechRecognition

作者：陈梦喆(中国科学院语言声学与内容理解重点实验室)；张晴晴(中国科学院语言声学与内容理解重点实验室)；潘接林(中国科学院语言声学与内容理解重点实验室)；颜永红(中国科学院语言声学与内容理解重点实验室)

Author：Chen Mengzhe(KeyLab.ofSpeechAcousticsandContentUnderstanding,ChineseAcademyofSciences)；ZhangQingqing(KeyLab.ofSpeechAcousticsandContentUnderstanding,ChineseAcademyofSciences)；PanJielin(KeyLab.ofSpeechAcousticsandContentUnderstanding,ChineseAcademyofSciences)；YanYonghong(KeyLab.ofSpeechAcousticsandContentUnderstanding,ChineseAcademyofSciences)

收稿日期：2015-04-12 年卷（期）页码：2016,48(1):166-172

期刊名称：工程科学与技术

Journal Name：Advanced Engineering Sciences

关键字：语音识别；深度神经网络；前后向算法；目标值优化

Key words：speechrecognition;deepneuralnetwork;forward-backwardalgorithm;targetoptimization

基金项目：国家自然科学基金资助项目（11161140319;91120001;61271426）;中国科学院战略性先导科技专项项目（XDA06030100;XDA06030500）；国家“863”计划资助项目（2012AA012503）;中科院重点部署项目资助（KGZD-EW-103-2）

中文摘要

训练深度神经网络声学模型时，所采用的强制对齐得到的目标值存在无法精准地表示出语音实际状况的问题。针对这一问题，提出一种利用前后向算法得到非0-1分布目标值的方法。由于用于强制对齐的模型可能与处理语句不完全匹配，以及发音连续性导致的过渡边界难以分离等问题，强制对齐得到的目标值存在不合理性。新的目标值可以表示某一帧以一定概率属于邻近各状态的分布情况，更详细地描述建模单元之间的过渡，进一步还原语音的原貌，提升模型的鲁棒性。同时，为寻求模型鲁棒性和建模单元区分度之间的平衡，对算法得到的目标值进行加窗处理。在中文客服问答领域进行实验，在小数据量上验证了目标值对于训练的较大影响，并且选取窗长宽度这一参数。最后将训练数据量提升至60、80以及100 h，结果显示，新的目标值优化方法训练得到的模型在识别性能上获得提升，相对字错误率下降为1.10%～3.65%。多组实验验证新的目标值优化方法对模型训练有一定效果，在训练数据量上升的情况下依然具有有效性。

英文摘要

In order to improve the targets for training acoustic model which cannot reflect the nature of speech exactly,a new kind of target obtained by forward backward algorithm was proposed.In the proposed target,a speech frame was aligned to several adjacent states with different probabilities.The new target improved the robustness of the model,as could describe the transition boundary and reflect the nature of speech much more exactly.Meanwhile,for a trade off between the model robustness and the distinction among modeling units,the targets obtained by forward backward algorithm were windowed.The experiments were carried out on Mandarin conversational speech recognition in the customer service domain. In the experiments,a small set of training data were used to verify the importance of the targets in the training and determine the parameter of the window length.Finally,the durations of training data were increased to 60,80 and 100 hours.The results showed that the proposed system achieved consistent improvements,and the relative character error rate reduction ranged from 1.10% to 3.65%.All of the experiments verified the effectiveness of the proposed target.

【关闭】

论文摘要

语音识别中深度神经网络目标值优化

TrainingTargetOptimizationofDeepNeuralNetworkinSpeechRecognition