Jinglecraft

CS 7641 Project Proposal(Team 04)

Anirudh Prabakaran, Himanshu Mangla, Shaili Dalmia, Vaibhav Bhosale, Vijay Saravana Jaishanker

Proposal Video

Introduction/Background:

“You are what you stream” - Music is often thought to be one of the best and most fun ways to get to know a person. Spotify with a net worth of over $26 Billion is reigning the music streaming industry. Machine Learning is at the core of their research and they have invested significant engineering effort to answer the ultimate question “What kind of music are you into?”

There is also an interesting correlation between music and user mood. Certain types of music can be attributed to certain moods among people. This information can be potentially exploited by ML researchers, to understand more about their users.

The use of Machine Learning to retrieve such valuable information from music has been well-explored in academia. Much of the highly cited prior work [1,2] in this area dates to over two decades ago when significant work was done through pre-processing raw audio tracks for features like timbre, rhythm and pitch among others.

The advent of machine learning has created new opportunities in music analysis through the visualizations of audio achieved through spectral transformations. This has opened the frontiers to apply the knowledge developed for image classification into this problem.

Problem Definition:

In this project, we want to compare music analysis done using the audio metadata based approach to the spectral image content based approach using multiple machine learning techniques such as logistic regression, decision trees, etc. We would be using the following tasks for our analysis:

Genre Classification[1,2]
Mood Prediction[3]

Methods:

For both the tasks - genre classification and mood prediction, we want to compare the outcome of both the metadata-based approach and the content-based approach using multiple ML techniques. For the metadata-based approach, we plan to use the following algorithms:
Supervised: Logistic Regression, Decision Trees, Random Forest, Support Vector Machines, K-Nearest Neighbor
Unsupervised: K-Means, Gaussian Mixture Models, DBSCAN, Hierarchical Clustering

Similarly, we will utilize both supervised and unsupervised methods for the content-based approach. These include:
Supervised: Convolutional Neural Networks, Convolutional Recurrent Neural Network
Unsupervised: Self-Organizing Maps

We plan to use the following datasets(and APIs) for our project:

Potential results and Discussion:

Using the techniques discussed above, we hope to answer the key question of comparing the performance of a metadata-based approach with the content-based one for both the tasks (genre classification and mood prediction). We will be using the standard metrics like precision, recall and the F1 score for each genre/mood to achieve this comparison.

Additionally, we are also interested in analyzing the correlation between the music genres identified by the model with the moods predicted.

Proposed Timeline

References

[1] Tzanetakis, G., & Cook, P. (2002). Musical genre classification of audio signals. IEEE Transactions on speech and audio processing, 10(5), 293-302.
[2] Scaringella, N., Zoia, G., & Mlynek, D. (2006). Automatic genre classification of music content: a survey. IEEE Signal Processing Magazine, 23(2), 133-141.
[3] Delbouys, R., Hennequin, R., Piccoli, F., Royo-Letelier, J., & Moussallam, M. (2018). Music mood detection based on audio and lyrics with deep neural net. arXiv preprint arXiv:1809.07276.
[4] Kaggle Million Song Dataset : https://www.kaggle.com/c/msdchallenge
[5] Music Genre Classification | GTZAN Dataset : https://www.kaggle.com/andradaolteanu/gtzan-dataset-music-genre-classification
[6] Spotify APIs : https://developer.spotify.com/discover/

Stretch Goals

If we are able to finish the above mentioned analysis sooner, we would like to explore the following two questions as our stretch goals:

Can we address the difference between the metadata-based approach and the content-based one in terms of the information captured?
Can we use a multimodal approach taking elements from both the metadata-based and content-based approaches to develop a more sophisticated model.?