Search by item HOME > Access full text > Search by item

JBE, vol. 24, no. 3, pp., May, 2019

DOI: https://doi.org/10.5909/JBE.2019.24.3.387

Audio High-Band Coding based on Autoencoder with Side Information

Hyo-Jin Cho, Seong-Hyeon Shin, Seung Kwon Beack, Taejin Lee, and Hochong Park

C.A E-mail: hcpark@kw.ac.kr

Abstract:

In this study, a new method of audio high-band coding based on autoencoder with side information is proposed. The proposed method operates in the MDCT domain, and improves the performance by using additional side information consisting of the previous and current low bands, which is different from the conventional autoencoder that only inputs information to be encoded. Moreover, the side information in a time-frequency domain enables the high-band coder to utilize temporal characteristics of the signal. In the proposed method, the encoder transmits a 4-dimensional latent vector computed by the autoencoder and a gain variable using 12 bits for each frame. The decoder reconstructs the high band by applying the decoded low bands in the previous and current frames and the transmitted information to the autoencoder. Subjective evaluation confirms that the proposed method provides equivalent performance to the SBR at approximately half the bit rate of the SBR.



Keyword: autoencoder, neural network, audio high-band coding, side information

Reference:
[1] ISO/IEC 11172-3, “Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s - Part 3,” 1993.
[2] M. Dietz, L. Liljeryd, K. Kjörling, and O. Kunz, “Spectral band replication, a novel approach in audio coding,” 112th Conv. Audio Eng. Soc., May 2002.
[3] C. R. Helmrich, et al., “Spectral envelope reconstruction via IGF for audio transform coding,” Proc. of IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Brisbane, Australia, pp. 389-393, 2015.
[4] L. Jiang, R. Hu, X. Wang, W. Tu, and M. Zhang, “Nonlinear prediction with deep recurrent neural networks for non-blind audio bandwidth extension,” China Communication, vol. 15, no. 1, pp. 72-85. Jan. 2018.
[5] K. Schmidt and B. Edler, “Blind bandwidth extension based on convolutional and recurrent deep neural networks,” Proc. of IEEE Int. Conf. on Acoustics, Speech and Signal Processing, Calgary, Canada, pp. 5444-5448, 2018.
[6] G. E. Hinton and R. Salakhutdinov, "Reducing the dimensionality of data with neural networks," Science, 313.5786, pp. 504-507, 2006.
[7] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, 521.7553, pp. 436-444, 2015.
[8] Y. N. Dauphin, et al., “Language modeling with gated convolutional networks,” Proc. of the 34th Int. Conf. on Machine Learning, vol 70, Sydney, Australia, pp. 933-941, 2017.
[9] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization,” Proc. of Int. Conf. on Learning Representation, San Diego, USA, 2015.
[10] C. Veaux, et al., “Superseded-CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit,” 2016.
[11] M. Goto, “Development of the RWC music database,” Proc. of Int. Congress on Acoustics, vol. 1, pp. 553-556, April 2004.
[12] ISO/IEC JTC1/SC29/WG11 N9927, “Workplan for subjective testing of Unified Speech and Audio Coding proposals,” April 2008.
[13] S. Beack, et al., “Single-mode-based Unified Speech and Audio Coding by extending the linear prediction domain coding mode,” ETRI Journal, vol. 39, no. 3, pp. 310-318, 2017.
[14] ITU-R BS.1534-3, “Method for the subjective assessment of intermediate quality level of audio systems,” 2015.

Comment


Editorial Office
1108, New building, 22, Teheran-ro 7-gil, Gangnam-gu, Seoul, Korea
Homepage: www.kibme.org TEL: +82-2-568-3556 FAX: +82-2-568-3557
Copyrightⓒ 2012 The Korean Institute of Broadcast and Media Engineers
All Rights Reserved