Search by item HOME > Access full text > Search by item

JBE, vol. 22, no. 5, pp.632-642, September, 2017

DOI: https://doi.org/10.5909/JBE.2017.22.5.632

DNN based Speech Detection for the Media Audio

Inseon Jang, ChungHyun Ahn, Jeongil Seo, and Younseon Jang

C.A E-mail: jangys@cnu.ac.kr

Abstract:

In this paper, we propose a DNN based speech detection system using acoustic characteristics and context information of media audio. The speech detection for discriminating between speech and non-speech included in the media audio is a necessary preprocessing technique for effective speech processing. However, since the media audio signal includes various types of sound sources, it has been difficult to achieve high performance with the conventional signal processing techniques. The proposed method improves the speech detection performance by separating the harmonic and percussive components of the media audio and constructing the DNN input vector reflecting the acoustic characteristics and context information of the media audio. In order to verify the performance of the proposed system, a data set for speech detection was made using more than 20 hours of drama, and an 8-hour Hollywood movie data set, which was publicly available, was further acquired and used for experiments. In the experiment, it is shown that the proposed system provides better performance than the conventional method through the cross validation for two data sets.



Keyword: Speech Detection, Voice Activity Detection

Reference:
[1] D. Lee, S. Kim, and Y. Kay, “A speech recognition system based on a new endpoint estimation method jointly using audio/video informations,” Journal of Broadcast Engineering, Vol. 8, No.2, pp.198-203, 2003.
[2] G. Kim, J. Ryu, and N. Cho, “Voice activity detection using motion and variation of intensity in the mouth region," Journal of Broadcast Engineering, Vol. 17, No.3, pp.519-528, 2012.
[3] DARPA Broadcast News Transcription and Understanding Workshop, 1998.
[4] T. Hain, P. C. Woodland, “Segmentation and classification of broadcast news audio,” Proceeding of International Conference on Spoken Language Processing (ICSLP), pp. 2727–2730, 1998.
[5] L. Lu, H. J. Zhang, and S. Z. Li, “Content-based audio classification and segmentation by using support vector machines,” Multimedia Systems, Vol. 8, No. 6, pp. 482-492, 2003.
[6] T. L. Nwe and H. Li, “Broadcast news segmentation by audio type analysis,” Proceeding of 2005 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2005.
[7] A. Misra, “Speech/nonspeech segmentation in web video,” Proceeding of 13th Annual Conference of the International Speech Communication Association (INTERSPEECH 2012), September 9-13, Portland, Oregon, USA, pp. 1977-1980, 2012.
[8] N. Ryant, M. Libeman, J. Yuan, “Speech activity detection on YouTube using deep neural network,” Proceeding of 14th Annual Conference of the International Speech Communication Association (INTERSPEECH 2013), August 25-29, Lyon, France, pp. 728-731, 2013.
[9] F. Eyben, F. Weninger, S. Squartini and B. Schuller, “Real-life voice activity detection with LSTM recurrent neural networks and an application to Hollywood movies,” Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 483-487, 2013.
[10] M.A. Pitt, L. Dilley, K. Johnson, S. Kiesling, W. Raymond, E. Hume, and E. Fosler-Lussier, Buckeye Corpus of Conversational Speech (2nd release), Department of Psychology, Ohio State University (Distribu- tor), Columbus, OH, USA, 2007, www.buckeyecorpus. osu.edu (accessed Aug. 18, 2017).
[11] J.S. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, N.L. Dahlgrena, and V. Zue, “TIMIT acoustic-phonetic continuous speech corpus,” 1993, https://catalog.ldc.upenn.edu/ldc93s1 (accessed Aug. 18, 2017).
[12] B. Lehner, G. Widmer and R. Sonnleitner, “Improving voice activity detection in movies,” Proceeding of 16th Annual Conference of the International Speech Communication Association (INTERSPEECH 2015), September 6-10, Dresden, Germany, pp. 2942-2946, 2015. [13] I. Jang, C. Ahn, Y. Jang, “Non-dialog section detection for the descriptive video service contents authoring,” Journal of Broadcast Engineering, Vol. 19, No. 3, pp. 296-306, 2014.
[14] I. Jang, C. Ahn, J. Seo, Y. Jang, “Enhanced feature extraction for speech detection in media audio,” Proceeding of 18th Annual Conference of the International Speech Communication Association (INTERSPEECH 2017), August 20-24, Stockholm, Sweden, pp. 479-483, 2017.
[15] D. FitzGerald, “Harmonic/percussive separation using median filtering,” Proceeding of the 13th International Conference on Digital Audio Effects (DAFx-10), 2010.
[16] C. Hsu, D “A tandem algorithm for singing pitch extraction and voice separation from music accompaniment,” IEEE Transactions on Audio, Speech, and Language Processing, Vol. 20, No. 5, pp. 1482-1491, 2012.
[17] R. Füg, A. Niedermeier, J. Driedger, S. Disch, M. Müller "Harmonic- percussive-residual sound separation using the structure tensor on spectrograms," Proceeding of Acoustics, Speech and Signal Processing (ICASSP), 2016.
[18] D. FitzGerald and M. Gainza, “Single channel vocal separation using median filtering and factorisation techniques,” ISAST Transactions on Electronic and Signal Processing, Vol. 4, No. 1, pp. 62-73, 2010.
[19] S. Leglaive, R. Hennequin, R. Badeau. "Singing voice detection with deep recurrent neural networks," Proceeding of Acoustics, Speech and Signal Processing (ICASSP), pp.121-125, 2015.

Comment


Editorial Office
1108, New building, 22, Teheran-ro 7-gil, Gangnam-gu, Seoul, Korea
Homepage: www.kibme.org TEL: +82-2-568-3556 FAX: +82-2-568-3557
Copyrightⓒ 2012 The Korean Institute of Broadcast and Media Engineers
All Rights Reserved