THESIS
2016
xi, 48 pages : illustrations (some color) ; 30 cm
Abstract
Automatic music recommendation requires music information retrieval tasks
ranging from classifying genre and mood, to artist relatedness. In this thesis, we
propose using deep learning models for music classification tasks and provide both
the motivation and analysis of why deep learning models are suitable for these tasks.
Similar to many tasks that require supervised learning, music information retrieval
required an elaborate feature engineering procedure that depends heavily on
human labor. Many previous research has been focused on discovering the most
"salient" features for music. Such quest has been proven to be challenging. In this
thesis, we investigate neural networks in order to explore its ability to automatically
engineer features without human intervention. Con...[
Read more ]
Automatic music recommendation requires music information retrieval tasks
ranging from classifying genre and mood, to artist relatedness. In this thesis, we
propose using deep learning models for music classification tasks and provide both
the motivation and analysis of why deep learning models are suitable for these tasks.
Similar to many tasks that require supervised learning, music information retrieval
required an elaborate feature engineering procedure that depends heavily on
human labor. Many previous research has been focused on discovering the most
"salient" features for music. Such quest has been proven to be challenging. In this
thesis, we investigate neural networks in order to explore its ability to automatically
engineer features without human intervention. Convolutional Neural Networks have
been shown to classify images directly from pixels in the image recognition area.
It has been shown that "neurons' in an CNN can be driven by the raw input data
adapt the final model to a desirable state, in a supervised learning paradigm.
In this thesis, we show for the first time that deep learning models can learn
directly from raw time-domain audio data and bypass both signal processing and
feature engineering steps in music classification tasks. We carry out our experiments
both on audio clips and lyrics. We compare the results from models learned from
raw data to those from feature-engineered data, showing slightly better results with
raw audio input. Moreover, we give a full analysis on the output of each convolution
layer showing that the CNN is indeed performing signal processing and feature
extractor automatically. We also show that CNN with word embeddings, without
lexicon features, can be used to directly classify lyrics from words. Our results on
using CNN to classify music from raw time-domain audio data is later on applied
to a task of speech emotion recognition, showing its generality.
Post a Comment