Search by item HOME > Access full text > Search by item

JBE, vol. 25, no. 5, pp.742-749, September, 2020


Performance Enhancement of Phoneme and Emotion Recognition by Multi-task Training of Common Neural Network

Jaewon Kim and Hochong Park

C.A E-mail:


This paper proposes a method for recognizing both phoneme and emotion using a common neural network and a multi-task training method for the common neural network. The common neural network performs the same function for both recognition tasks, which corresponds to the structure of multi-information recognition of human using a single auditory system. The multi-task training conducts a feature modeling that is commonly applicable to multiple information and provides generalized training, which enables to improve the performance by reducing an overfitting occurred in the conventional individual training for each information. A method for increasing phoneme recognition performance is also proposed that applies weight to the phoneme in the multi-task training. When using the same feature vector and neural network, it is confirmed that the proposed common neural network with multi-task training provides higher performance than the individual one trained for each task.

Keyword: deep neural network, common recognition, multi-task training, emotion recognition, phoneme recognition

[1] A. Graves, A. R. Mohamed, and G. Hinton, "Speech recognition with deep recurrent neural networks," Proc. on IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 6645-6649, May 2013, doi:10.1109/ICASSP.2013.6638947.
[2] T. L. Nwe, S. W. Foo, and L. C. De Silva, "Speech emotion recognition using hidden Markov models," Speech communication, vol. 41, no. 4, pp. 603-623, Nov. 2003, doi:10.1016/S0167-6393(03)00099-2.
[3] J. P. Campbell, "Speaker recognition: A tutorial," Proceedings of the IEEE, vol. 85, no. 9, pp. 1437-1462, Sept. 1997, doi:10.1109/ 5.628714.
[4] W. J. Jang, H. W. Yun, S. H. Shin, H. J. Cho, W. Jang, and H. Park, "Music genre classification using spikegram and deep neural network," J. of Broadcast Engineering, vol. 22, no. 6, pp. 693-701, Nov. 2017, doi:10.5909/JBE.2017.22.6.693.
[5] S. H. Shin, H. W. Yun, W. J. Jang, and H. Park, "Extraction of acoustic features based on auditory spike code and its application to music genre classification," IET Signal Processing, vol. 13, no. 2, pp. 230-234, Apr. 2019, doi:10.1049/iet-spr.2018.5158.
[6] S. Han, J. Kim, S. An, S. Shin, and H. Park, "Speech feature extraction based on spikegram for phoneme recognition," J. of Broadcast Engineering, vol. 24, no. 5, pp. 735-742, Sept. 2019, doi:10.5909/ JBE.2019.24.5.735.
[7] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, Cambridge and London, 2016.
[8] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Dropout: a simple way to prevent neural networks from overfitting," The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929-1958, Jan 2014, doi:10.5555/2627435.2670313.
[9] R. Caruana, "Multitask learning," Machine Learning, vol. 28, no. 1, pp.41-75, 1997, doi:10.1023/A:1007379606734.
[10] E. Cakir, T. Heittola, H. Huttunen, and T. Virtanen, "Polyphonic sound event detection using multi label deep neural networks," Proc. on Int. Joint Conf. on Neural Networks, pp. 1-7, July 2015, doi:10.1109/ IJCNN.2015.7280624.
[11] S. J. Pan, and Q. Yang, "A survey on transfer learning," IEEE Trans. on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345-1359, Oct. 2009, doi:10.1109/TKDE.2009.191.
[12] B. Logan, "Mel frequency cepstral coefficients for music modeling," ISMIR, vol. 270, pp. 1-11, Oct. 2000.
[13] ETSI, Speech processing, transmission and quality aspects (STQ)
Distributed speech recognition
Extended front-end feature extraction algorithm
Compression algorithm
Back-end speech reconstruction algorithm, ETSI ES 202 211, v1.1.1, Nov. 2003.
[14] X. Huang, A. Acero, and H. Hon, Spoken Language Processing: A Guide to Theory, Algorithm, and System Development, Prentice Hall, pp. 423-424, 2001.
[15] C. Busso, M. Bulut, C. C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J, N. Chang, S. Lee, and S. S. Narayanan, "IEMOCAP: interactive emotional dyadic motion capture database," Language Resources and Evaluation, vol. 42, no. 4, pp. 335, Dec. 2008, doi:10.1007/s10579- 008-9076-6.
[16] V. Zue, S. Seneff, and J. Glass, "Speech database development at MIT: TIMIT and beyond," Speech Communication, vol. 9, no. 4, pp. 351-356, Aug. 1990, doi:10.1016/0167-6393(90)90010-7.


Editorial Office
1108, New building, 22, Teheran-ro 7-gil, Gangnam-gu, Seoul, Korea
Homepage: TEL: +82-2-568-3556 FAX: +82-2-568-3557
Copyrightⓒ 2012 The Korean Institute of Broadcast and Media Engineers
All Rights Reserved