This is a subset of the Million Song Dataset from Columbia University, New York:
http://labrosa.ee.columbia.edu/millionsong
a collaboration between LabROSA (Columbia University) and The Echo Nest.
Prepared by T. Bertin-Mahieux
The database can be downloaded also from:
http://www.samyzaf.com/ML/song_year/song_year.zip
It consists of 515345 records of songs that were composed during the years 1922-2011. Each record consists of 91 features. The first feature is the year in which the song was composed, and the remaining 90 features are various quantities (float) related to the song audio. More information can be obtained from:
The interesting question raised with regards to this database was: is there a strong relation between the musical features of a song to the year it was composed? In other words: can we design a small neural-network that can predict the year from the other 90 musical features? A positive answer to this question would reveal a profound insight on the nature of a musical composition, and more importantly: this insight will be mathematically proved.
According to the database authors, we should respect the following train/test split:
It avoids the 'producer effect' by making sure no song from a given artist ends up in both the train and test set.
Quote from the README file: "Each song owns 90 attributes, 12 = timbre average, 78 = timbre covariance. The first value is the year (target), ranging from 1922 to 2011. Features extracted from the 'timbre' features from The Echo Nest API. We take the average and covariance over all 'segments', each segment being described by a 12-dimensional timbre vector."
As we are not musical experts, the 90 features are of no interest to us, so we will simply name them: t1, t2, t3, ..., t90.
We assume you have access to a Python 2.7 environment equiped with all required modules.
The only module which is not yet public is our private kerutils module which can be
downloaded from:
http://www.samyzaf.com/cgi-bin/view_file.py?file=ML/lib/kerutils.py
import pandas as pd
import matplotlib.pyplot as plt
from pandas.tools.plotting import scatter_matrix
from matplotlib import rcParams
rcParams['figure.figsize'] = 10,7
rcParams['axes.grid'] = True
# Our deep learning library is Keras
from keras.models import Sequential
from keras.layers.core import Dense, Dropout
from keras.models import load_model
from keras.regularizers import l2
from keras.utils import np_utils
import numpy as np
# fixed random seed for reproducibility
np.random.seed(0)
import sys
sys.path.append("c:/ml/lib")
from kerutils import *
%matplotlib inline
# These are css/html styles for good looking ipython notebooks
from IPython.core.display import HTML
css = open('c:/ml/style-notebook.css').read()
HTML('<style>{}</style>'.format(css))
features = ['year', 't1', 't2', 't3', 't4', 't5', 't6', 't7', 't8', 't9', 't10', 't11', 't12', 't13', 't14', 't15', 't16', 't17', 't18', 't1 9', 't20', 't21', 't22', 't23', 't24', 't25', 't26', 't27', 't28', 't29', 't30', 't31', 't32', 't33', 't34', 't35', 't36 ', 't37', 't38', 't39', 't40', 't41', 't42', 't43', 't44', 't45', 't46', 't47', 't48', 't49', 't50', 't51', 't52', 't53' , 't54', 't55', 't56', 't57', 't58', 't59', 't60', 't61', 't62', 't63', 't64', 't65', 't66', 't67', 't68', 't69', 't70', 't71', 't72', 't73', 't74', 't75', 't76', 't77', 't78', 't79', 't80', 't81', 't82', 't83', 't84', 't85', 't86', 't87', 't88', 't89', 't90']
# Note that our classes (which we have to predict from those 90 features), are all
# the years from 1922 to 2011: 1922, 1923, 1924, 1925, ..., 2011
# Theare exactly 90 years, so we also have 90 classes:
nb_classes = 90
# The following line loads our database into a Pandas DataFrame object
data = pd.read_csv('YearPredictionMSD.csv', names=features)
This is a huge table (515345 rows and 91 columns!). To get an idea of how it looks, lets peak at a random part of it: rows 100-110 and columns 0-10.
Here is the Pandas command for doing this
data.ix[100:110, 0:10]
# How many rows and columns do we have?
data.shape
It is always a good idea to fiddle with the data in order to get acquainted with it before we start grinding it in a neural-network. Let's count how many songs we have from each year?
nsongs = {}
for y in range(1922,2012):
nsongs[y] = len(data[data.year==y])
print "Year=%d, nsongs=%d" % (y, nsongs[y])
Actually, it will be more useful to draw a histogram
years = range(1922,2012)
values = [nsongs[y] for y in years]
plt.bar(years, values, align='center')
plt.xlabel("Year")
plt.ylabel("Number of songs")
Lets get an idea about the type of values the 90 features get. We'll just peak at the first 10 features and see how wildly their sorted values are distributed accross the songs:
for t in features[1:11]:
values = data[t].as_matrix()
plt.plot(sorted(values), label=t, linewidth=1)
plt.legend(loc='upper center', bbox_to_anchor=(0.4, 1.04), ncol=3, fancybox=True, shadow=True)
Most neural practiotioners suggest that for a more efficient neural-network we should normalize these values to the unit interval [0,1]. We should also normalize our target feature, the year to a set of integers starting from 0.
This is easily done with the Numpy package:
X = data.ix[:,1:].as_matrix() # this is the 90 columns without the year
Y = data.ix[:,0].as_matrix() # this is the year column
# data normalizations (scaling down all values to the interval [0,1])
# The years 1922-2011 are scaled down to integers [0,1,2,..., 89]
a = X.min()
b = X.max()
X = (X - a) / (b - a) # all values now between 0 and 1 !
Y = Y - Y.min() # The years 1922-2011 are mapped to 0-89
# Training data set
X_train = X[0:463715]
y_train = Y[0:463715]
# Validation data set
X_test = X[463715:]
y_test = Y[463715:]
Let's take a look if all is fine. Let's plot 40 samples of X_train (from item 1600 to item 1639)
for i in range(1600, 1640):
plt.plot(X_train[i], label='song_' + str(i))
plt.xlabel("Feature")
plt.ylabel("Value")
plt.legend(loc='upper center', bbox_to_anchor=(0.8, 0.9), ncol=5, fancybox=True, shadow=True, fontsize=7)
Indeed, the values seem to be well normalized in the 0.0-1.0 scale. Here is how our classes (years) look like. Let's sort them and plot them and see if they range between 0 and 89:
plt.plot(sorted(y_train))
As we saw in previous examples, Keras does not work with integer classes, and we will have to convert our list of years to one-hot vectors:
0 → [1,0,0,0,...,0] 1 → [0,1,0,0,...,0] 2 → [0,0,1,0,...,0] ... 89 → [0,0,0,0,...,1]
This is easily done with the Keras numpy_utils, to_categorical function
Y_train = np_utils.to_categorical(y_train, nb_classes)
Y_test = np_utils.to_categorical(y_test, nb_classes)
We're now all set with our data, so we can proceed to build a neural network using Keras for finding out if there is any mysteriously hidden connection between the 90 tone/sound attributes to the year in which a song was composed?
Let's start with a simple neural network that consists of
# Our first Keras Model
model1 = Sequential()
model1.add(Dense(90, input_shape=(90,)))
model1.add(Dense(90, init='uniform', activation='softmax'))
model1.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
To be able to capture and save the best state of this model, we will use the FitMonitor
callbak from our kerutils module, which can be downloaded from
http://www.samyzaf.com/cgi-bin/view_file.py?file=ML/lib/kerutils.py
We will also use the following utility for printing scores and displaying graphs of the training and validation accuracy and loss values. h is the history object returned by the training function (see below).
def show_scores(model, h):
loss, acc = model.evaluate(X_train, Y_train, verbose=0)
print "Training: accuracy = %.6f loss = %.6f" % (acc, loss)
loss, acc = model.evaluate(X_test, Y_test, verbose=0)
print "Validation: accuracy = %.6f loss = %.6f" % (acc, loss)
print "Over fitting score = %.6f" % over_fitting_score(h)
print "Under fitting score = %.6f" % under_fitting_score(h)
view_acc(h)
plt.show()
view_loss(h)
plt.show()
fmon = FitMonitor(thresh=0.02, minacc=0.90, filename="model1.h5")
h = model1.fit(
X_train,
Y_train,
batch_size=128,
nb_epoch=20,
verbose=0,
validation_data=(X_test, Y_test),
callbacks = [fmon]
)
show_scores(model1, h)
Training and validation accuracy of 98.88% at first try is quite good! Looks like the best model was captured on epoch 0 !? (this is what the fmon callback reported above), so we didn't have to use too many epochs (20). From the first graph we see that the training and validation graphs are in complete agreement, which means that our training set was fully validated on the testing set (which was separated from the beginning from the data, and is not supposed to depend on it) with respect to the accuracy values.
Let's iterate again the significance of this result: with a neural network of 270 neurons we were able to predict the year of a given song, from its 90 audio attributes, at a 98.88% precision level! It has been tested on 51,630 cases, so it cannot be dismissed as a coincidence or a lucky strike. This is serious ...
# Validating the accuracy and loss of our training set
loss, accuracy = model1.evaluate(X_train, Y_train, verbose=0)
print "Train: accuracy=%f loss=%f" % (accuracy, loss)
# Validating the accuracy and loss of our testing set
loss, accuracy = model1.evaluate(X_test, Y_test, verbose=0)
print "Test: accuracy=%f loss=%f" % (accuracy, loss)
# We have already saved our model (to an hdf5 file), but you can always do it manually
# with the 'save' command, so it can be later be loaded with load_model function and used
model1.save('model1_final.h5')
In our second attempt, we have multiplied the first hidden layer, and added two new hidden layers, each with 360 nerons. The activation function of the output layer was changed to 'sigmoid'. In addition, we have inserted thre Droupout layers in order to reduce the chance of overfitting. You can read about all these features in Keras documentation site: https://keras.io However, this has not improved the accracy at all.
We leave this challenge to reader: try to find a neural-network whose accuracy exceeds 99.5%. If you fail to improve accuracy, you may need to look at the histogram above and see what happens in the years 1922 t0 1950? there are hardly any songs at this period, could be that there are not enough data for learning?
# Second Keras Model
model2 = Sequential()
model2.add(Dense(180, input_shape=(90,)))
model2.add(Dropout(0.2))
model2.add(Dense(360, init='uniform', activation='relu'))
model2.add(Dropout(0.2))
model2.add(Dense(360, init='uniform', activation='relu'))
model2.add(Dropout(0.2))
model2.add(Dense(90, init='uniform', activation='sigmoid'))
model2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
fmon = FitMonitor(thresh=0.02, minacc=0.99, filename="model2.h5")
h = model2.fit(
X_train,
Y_train,
batch_size=128,
nb_epoch=10,
verbose=0,
validation_data=(X_test, Y_test),
callbacks = [fmon]
)
show_scores(model2, h)
# Let's save our model just in case we'll need later for testing (or for post mortem analysis)
model2.save('model2_final.h5')
P = model2.predict_classes(X_train)
Failed = []
for i in range(len(P)):
if y_train[i] != P[i]:
Failed.append(i)
loss = model2.evaluate(X_train, Y_train)