Deep Image Features in Music Information Retrieval

Grzegorz Gwardys, Daniel Michał Grzywczak


Applications of Convolutional Neural Networks (CNNs) to various
problems have been the subject of a number of recent studiesranging from image classification and object detection to scene parsing, segmentation 3D volumetric images and action recognition in videos.
In this study, the CNNs were applied to a Music Information Retrieval (MIR),
in particular to musical genre recognition.
The model was trained on ILSVRC-2012 (more than 1 million natural images) to perform image classification and was reused to perform genre classification using spectrograms images. Harmonic and percussion separation was applied, because it is characteristic formusical genre.
At final stage, the evaluation of various strategies of merging Support Vector Machines (SVMs) was performed on well known in MIR community - GTZAN dataset.
Even though, the model was trained on natural images, the results achieved in this studywere close to the state-of-the-art. 

Full Text:



M. Kassler, “Toward Musical Information,” Perspectives of

New Music, vol. 4, no. 2, pp. 59–66, 1966. [Online].



Y. Song, S. Dixon, and M. Pearce, “A survey of

music recommendation systems and future perspectives,” 9th

International Symposium on Computer Music Modeling and

Retrieval, 2012. [Online]. Available:


J. Futrelle and J. S. Downie, “Interdisciplinary Communi-

ties and Research Issues in Music Information Retrieval,”

Library and Information Science, pp. 215–221, 2002. [On-

line]. Available:


J. S. Downie, K. West, A. F. Ehmann, and E. Vincent, “The 2005 Music

Information Retrieval Evaluation Exchange (MIREX 2005): Preliminary

Overview,” International Conference on Music Information Retrieval,

no. Mirex, pp. 320–323.

J. S. Downie, A. F. Ehmann, M. Bay, and M. C. Jones, “The

music information retrieval evaluation exchange: Some observations

and insights.” in Advances in Music Information Retrieval, ser. Studies

in Computational Intelligence, Z. W. Ras and A. Wieczorkowska,

Eds. Springer, 2010, vol. 274, pp. 93–115. [Online]. Available:

P. Rao, “Audio signal processing,” in Speech, Audio, Image and

Biomedical Signal Processing using Neural Networks, ser. Studies in

Computational Intelligence, B. Prasad and S. Prasanna, Eds. Springer

Berlin Heidelberg, 2008, vol. 83, pp. 169–189.

D. Grzywczak and G. Gwardys, “Audio features in music information

retrieval,” in Active Media Technology, ser. Lecture Notes in Computer

Science, D. lzak, G. Schaefer, S. Vuong, and Y.-S. Kim, Eds. Springer

International Publishing, 2014, vol. 8610, pp. 187–199.

B. Zhen, X. Wu, Z. Liu, and H. Chi, “On the importance of components

of the mfcc in speech and speaker recognition.” in INTERSPEECH.

ISCA, 2000, pp. 487–490.

K. Lee, “Automatic chord recognition from audio using enhanced pitch

class profile,” in ICMC Proceedings, 2006.

X. Yu, J. Zhang, J. Liu, W. Wan, and W. Yang, “An audio retrieval

method based on chromagram and distance metrics,” in Audio Language

and Image Processing (ICALIP), 2010 International Conference on.

IEEE, 2010, pp. 425–428.

J. Serr, E. Gmez, P. Herrera, and X. Serra, “Chroma binary similarity

and local alignment applied to cover song identification,” IEEE Trans.

on Audio, Speech, and Language Processing, 2008.

P. Baldi, “Autoencoders, unsupervised learning, and deep architectures.”

in ICML Unsupervised and Transfer Learning, ser. JMLR Proceedings,

I. Guyon, G. Dror, V. Lemaire, G. W. Taylor, and D. L. Silver,

Eds., vol. 27., 2012, pp. 37–50. [Online]. Available:

D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Neurocom-

puting: Foundations of research,” J. A. Anderson and E. Rosenfeld,

Eds. Cambridge, MA, USA: MIT Press, 1988, ch. Learning Internal

Representations by Error Propagation, pp. 673–695.

D. Yu and M. Seltzer, “Improved bottleneck features using pretrained

deep neural networks,” in Interspeech. International Speech Commu-

nication Association, August 2011.

P. Smolensky, “Parallel distributed processing: Explorations in the

microstructure of cognition, vol. 1,” D. E. Rumelhart, J. L. McClelland,

and C. PDP Research Group, Eds. Cambridge, MA, USA: MIT

Press, 1986, ch. Information Processing in Dynamical Systems:

Foundations of Harmony Theory, pp. 194–281. [Online]. Available:

H. Lee, P. T. Pham, Y. Largman, and A. Y. Ng, “Unsupervised fea-

ture learning for audio classification using convolutional deep belief

networks.” in NIPS, vol. 9, 2009, pp. 1096–1104.

P. Hamel, S. Lemieux, Y. Bengio, and D. Eck, “Temporal pooling

and multiscale learning for automatic annotation and ranking of music

audio.” in ISMIR, 2011, pp. 729–734.

P. Hamel, Y. Bengio, and D. Eck, “Building musically-relevant audio

features through multiple timescale representations.” in ISMIR, 2012,

pp. 553–558.

E. M. Schmidt and Y. E. Kim, “Learning rhythm and melody features

with deep belief networks,” in ISMIR, 2013.

S. Cherla, T. Weyde, A. S. d’Avila Garcez, and M. Pearce, “A

distributed model for multiple-viewpoint melodic prediction.” in ISMIR,

A. de Souza Britto Jr., F. Gouyon, and S. Dixon, Eds., 2013,

pp. 15–20. [Online]. Available:


E. Battenberg and D. Wessel, “Analyzing drum patterns using condi-

tional deep belief networks.” in ISMIR, 2012, pp. 37–42.

J. Schmidhuber, “Multi-column deep neural networks for image

classification,” in Proceedings of the 2012 IEEE Conference on

Computer Vision and Pattern Recognition (CVPR), ser. CVPR ’12.

Washington, DC, USA: IEEE Computer Society, 2012, pp. 3642–3649.

[Online]. Available:

K. Fukushima, “Neocognitron: A self-organizing neural network model

for a mechanism of pattern recognition unaffected by shift in position,”

Biological Cybernetics, vol. 36, pp. 193–202, 1980.

Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning

applied to document recognition,” in Proceedings of the IEEE, 1998, pp.


D. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber, “Deep

neural networks segment neuronal membranes in electron microscopy

images,” in Advances in neural information processing systems, 2012,

pp. 2843–2851.

P. H. Pinheiro and R. Collobert, “Recurrent convolutional neural net-

works for scene parsing,” arXiv preprint arXiv:1306.2795, 2013.

G. W. Taylor, R. Fergus, Y. LeCun, and C. Bregler, “Convolutional

learning of spatio-temporal features,” in Computer Vision–ECCV 2010.

Springer, 2010, pp. 140–153.

P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. Le-

Cun, “Overfeat: Integrated recognition, localization and detection using

convolutional networks,” arXiv preprint arXiv:1312.6229, 2013.

“Ilsvrc 2014,”, ac-

cessed: 2014-08-31.

“Ilsvrc 2012 results,”

results.html, accessed: 2014-08-31.

A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification

with deep convolutional neural networks,” Advances in Neural

Information, pp. 1–9, 2012. [Online]. Available:

papers/files/nips25/NIPS2012 0534.pdf

“Mnist dataset,”, accessed: 2014-08-

S. J. Pan and Q. Yang, “A survey on transfer learning,” Knowledge and

Data Engineering, IEEE Transactions on, vol. 22, no. 10, pp. 1345–

, Oct 2010.

W. Dai, G. rong Xue, Q. Yang, and Y. Yu, “Transferring naive bayes

classifiers for text classification,” in In Proceedings of the 22nd AAAI

Conference on Artificial Intelligence, 2007, pp. 540–545.

J. na Meng, H. fei Lin, and Y. hai Yu, “Transfer learning based on svd

for spam filtering,” in Intelligent Computing and Cognitive Informatics

(ICICCI), 2010 International Conference on, June 2010, pp. 491–494.

H. Wang, F. Nie, H. Huang, and C. Ding, “Dyadic transfer learning for

cross-domain image classification,” in Computer Vision (ICCV), 2011

IEEE International Conference on, Nov 2011, pp. 551–556.

A Practical Transfer Learning Algorithm for Face Verification.

International Conference on Computer Vision (ICCV), 2013.

[Online]. Available:


N. Ono, K. Miyamoto, J. Le Roux, H. Kameoka, and S. Sagayama, “Sep-

aration of a monaural audio signal into harmonic/percussive components

by complementary diffusion on spectrogram,” in Proc. EUSIPCO, 2008.


  • There are currently no refbacks.

International Journal of Electronics and Telecommunications
is a periodical of Electronics and Telecommunications Committee
of Polish Academy of Sciences

eISSN: 2300-1933