Year Prediction - Million Songs Database

This is a subset of the Million Song Dataset from Columbia University, New York:
http://labrosa.ee.columbia.edu/millionsong
a collaboration between LabROSA (Columbia University) and The Echo Nest.
Prepared by T. Bertin-Mahieux .

The database can be downloaded also from:
http://www.samyzaf.com/ML/song_year/song_year.zip

It consists of 515345 records of songs that were composed during the years 1922-2011. Each record consists of 91 features. The first feature is the year in which the song was composed, and the remaining 90 features are various quantities (float) related to the song audio. More information can be obtained from:

  1. https://archive.ics.uci.edu/ml/datasets/YearPredictionMSD#
  2. http://labrosa.ee.columbia.edu/millionsong
  3. https://en.wikipedia.org/wiki/Timbre

The interesting question raised with regards to this database was: is there a strong relation between the musical features of a song to the year it was composed? In other words: can we design a small neural-network that can predict the year from the other 90 musical features? A positive answer to this question would reveal a profound insight on the nature of a musical composition, and more importantly: this insight will be mathematically proved.

Data Set Information

According to the database authors, we should respect the following train/test split:

  1. Training set: first 463,715 examples
  2. Validation set: last 51,630 examples

It avoids the 'producer effect' by making sure no song from a given artist ends up in both the train and test set.

Attribute Information

Quote from the README file: "Each song owns 90 attributes, 12 = timbre average, 78 = timbre covariance. The first value is the year (target), ranging from 1922 to 2011. Features extracted from the 'timbre' features from The Echo Nest API. We take the average and covariance over all 'segments', each segment being described by a 12-dimensional timbre vector."

As we are not musical experts, the 90 features are of no interest to us, so we will simply name them: t1, t2, t3, ..., t90.

Loading Keras, Pandas, Numpy, and Matplotlib

We assume you have access to a Python 2.7 environment equiped with all required modules. The only module which is not yet public is our private kerutils module which can be downloaded from:
http://www.samyzaf.com/cgi-bin/view_file.py?file=ML/lib/kerutils.py

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from pandas.tools.plotting import scatter_matrix
from matplotlib import rcParams
rcParams['figure.figsize'] = 10,7
rcParams['axes.grid'] = True

# Our deep learning library is Keras
from keras.models import Sequential
from keras.layers.core import Dense, Dropout
from keras.models import load_model
from keras.regularizers import l2
from keras.utils import np_utils
import numpy as np
# fixed random seed for reproducibility
np.random.seed(0)
import sys
sys.path.append("c:/ml/lib")
from kerutils import *
%matplotlib inline
Using Theano backend.
DEBUG: nvcc STDOUT mod.cu
   Creating library C:/Users/samy/AppData/Local/Theano/compiledir_Windows-10-10.0.14393-Intel64_Family_6_Model_94_Stepping_3_GenuineIntel-2.7.12-64/tmpjlmly3/265abc51f7c376c224983485238ff1a5.lib and object C:/Users/samy/AppData/Local/Theano/compiledir_Windows-10-10.0.14393-Intel64_Family_6_Model_94_Stepping_3_GenuineIntel-2.7.12-64/tmpjlmly3/265abc51f7c376c224983485238ff1a5.exp

Using gpu device 0: GeForce GTX 950 (CNMeM is enabled with initial size: 80.0% of memory, cuDNN 5103)
c:\anaconda2\lib\site-packages\theano\sandbox\cuda\__init__.py:600: UserWarning: Your cuDNN version is more recent than the one Theano officially supports. If you see any problems, try updating Theano or downgrading cuDNN to version 5.
  warnings.warn(warn)
In [2]:
# These are css/html styles for good looking ipython notebooks
from IPython.core.display import HTML
css = open('c:/ml/style-notebook.css').read()
HTML('<style>{}</style>'.format(css))
Out[2]:
In [3]:
features = ['year', 't1', 't2', 't3', 't4', 't5', 't6', 't7', 't8', 't9', 't10', 't11', 't12', 't13', 't14', 't15', 't16', 't17', 't18', 't1 9', 't20', 't21', 't22', 't23', 't24', 't25', 't26', 't27', 't28', 't29', 't30', 't31', 't32', 't33', 't34', 't35', 't36 ', 't37', 't38', 't39', 't40', 't41', 't42', 't43', 't44', 't45', 't46', 't47', 't48', 't49', 't50', 't51', 't52', 't53' , 't54', 't55', 't56', 't57', 't58', 't59', 't60', 't61', 't62', 't63', 't64', 't65', 't66', 't67', 't68', 't69', 't70', 't71', 't72', 't73', 't74', 't75', 't76', 't77', 't78', 't79', 't80', 't81', 't82', 't83', 't84', 't85', 't86', 't87', 't88', 't89', 't90']

# Note that our classes (which we have to predict from those 90 features), are all
# the years from 1922 to 2011: 1922, 1923, 1924, 1925, ..., 2011
# Theare exactly 90 years, so we also have 90 classes:
nb_classes = 90
In [4]:
# The following line loads our database into a Pandas DataFrame object

data = pd.read_csv('YearPredictionMSD.csv', names=features)

This is a huge table (515345 rows and 91 columns!). To get an idea of how it looks, lets peak at a random part of it: rows 100-110 and columns 0-10.

Here is the Pandas command for doing this

In [5]:
data.ix[100:110, 0:10]
Out[5]:
year t1 t2 t3 t4 t5 t6 t7 t8 t9
100 2008 36.57341 40.76354 -1.37479 18.26958 -7.85506 -12.91930 15.95326 13.05346 -10.98357
101 2009 47.57913 44.00216 -20.86854 15.68389 0.28155 -12.98692 8.40916 8.45960 -15.20321
102 2009 50.01514 85.99753 -10.86962 -16.04817 -5.21407 -19.06423 0.60184 4.70044 -3.45756
103 2009 47.46427 38.57579 -16.62851 -8.74569 -17.57472 -20.53645 8.71748 -2.62025 0.75556
104 2008 42.40756 30.87384 -53.31370 75.21764 13.45603 10.56135 4.32810 20.33905 2.81100
105 2008 42.93189 17.33496 -14.89457 30.36314 -30.71208 -14.58057 -0.86303 -6.77871 7.45277
106 2008 47.25722 76.94691 33.35870 22.21808 24.60699 -13.97433 -1.26071 4.90241 -7.64389
107 2008 51.47089 24.49163 10.80634 -5.68605 -2.73621 -23.88744 -3.49063 -1.26504 5.07964
108 2008 45.90578 74.56383 -3.78965 12.39632 23.83092 -26.01724 5.72196 -3.25670 5.24608
109 2008 50.54486 44.66363 21.59233 0.63973 18.81458 -31.82536 11.59790 8.54721 -6.26381
110 2007 44.90167 5.02397 -4.65443 -15.14147 -11.70534 -0.29816 -11.07893 -0.04177 14.06437
In [6]:
# How many rows and columns do we have?

data.shape
Out[6]:
(515345, 91)

It is always a good idea to fiddle with the data in order to get acquainted with it before we start grinding it in a neural-network. Let's count how many songs we have from each year?

In [7]:
nsongs = {}
for y in range(1922,2012):
    nsongs[y] = len(data[data.year==y])
    print "Year=%d, nsongs=%d" % (y, nsongs[y])
Year=1922, nsongs=6
Year=1923, nsongs=0
Year=1924, nsongs=5
Year=1925, nsongs=7
Year=1926, nsongs=19
Year=1927, nsongs=42
Year=1928, nsongs=52
Year=1929, nsongs=93
Year=1930, nsongs=40
Year=1931, nsongs=35
Year=1932, nsongs=11
Year=1933, nsongs=6
Year=1934, nsongs=29
Year=1935, nsongs=24
Year=1936, nsongs=25
Year=1937, nsongs=28
Year=1938, nsongs=19
Year=1939, nsongs=35
Year=1940, nsongs=52
Year=1941, nsongs=32
Year=1942, nsongs=24
Year=1943, nsongs=14
Year=1944, nsongs=15
Year=1945, nsongs=30
Year=1946, nsongs=29
Year=1947, nsongs=57
Year=1948, nsongs=43
Year=1949, nsongs=60
Year=1950, nsongs=83
Year=1951, nsongs=74
Year=1952, nsongs=77
Year=1953, nsongs=133
Year=1954, nsongs=123
Year=1955, nsongs=275
Year=1956, nsongs=565
Year=1957, nsongs=597
Year=1958, nsongs=583
Year=1959, nsongs=592
Year=1960, nsongs=424
Year=1961, nsongs=571
Year=1962, nsongs=605
Year=1963, nsongs=902
Year=1964, nsongs=945
Year=1965, nsongs=1120
Year=1966, nsongs=1377
Year=1967, nsongs=1718
Year=1968, nsongs=1867
Year=1969, nsongs=2210
Year=1970, nsongs=2349
Year=1971, nsongs=2131
Year=1972, nsongs=2288
Year=1973, nsongs=2596
Year=1974, nsongs=2184
Year=1975, nsongs=2482
Year=1976, nsongs=2179
Year=1977, nsongs=2502
Year=1978, nsongs=2926
Year=1979, nsongs=3108
Year=1980, nsongs=3101
Year=1981, nsongs=3162
Year=1982, nsongs=3597
Year=1983, nsongs=3386
Year=1984, nsongs=3368
Year=1985, nsongs=3578
Year=1986, nsongs=4219
Year=1987, nsongs=5122
Year=1988, nsongs=5611
Year=1989, nsongs=6670
Year=1990, nsongs=7256
Year=1991, nsongs=8647
Year=1992, nsongs=9543
Year=1993, nsongs=10525
Year=1994, nsongs=12121
Year=1995, nsongs=13257
Year=1996, nsongs=14130
Year=1997, nsongs=15182
Year=1998, nsongs=15814
Year=1999, nsongs=18238
Year=2000, nsongs=19285
Year=2001, nsongs=21590
Year=2002, nsongs=23451
Year=2003, nsongs=27382
Year=2004, nsongs=29607
Year=2005, nsongs=34952
Year=2006, nsongs=37534
Year=2007, nsongs=39404
Year=2008, nsongs=34760
Year=2009, nsongs=31038
Year=2010, nsongs=9396
Year=2011, nsongs=1

Actually, it will be more useful to draw a histogram

In [8]:
years = range(1922,2012)
values = [nsongs[y] for y in years]
plt.bar(years, values, align='center')
plt.xlabel("Year")
plt.ylabel("Number of songs")
Out[8]:
<matplotlib.text.Text at 0x1d879860>

Lets get an idea about the type of values the 90 features get. We'll just peak at the first 10 features and see how wildly their sorted values are distributed accross the songs:

In [9]:
for t in features[1:11]:
    values = data[t].as_matrix()
    plt.plot(sorted(values), label=t, linewidth=1)

plt.legend(loc='upper center', bbox_to_anchor=(0.4, 1.04), ncol=3, fancybox=True, shadow=True)
Out[9]:
<matplotlib.legend.Legend at 0x1d781978>

Most neural practiotioners suggest that for a more efficient neural-network we should normalize these values to the unit interval [0,1]. We should also normalize our target feature, the year to a set of integers starting from 0.

This is easily done with the Numpy package:

In [10]:
X = data.ix[:,1:].as_matrix()  # this is the 90 columns without the year
Y = data.ix[:,0].as_matrix()   # this is the year column

# data normalizations (scaling down all values to the interval [0,1])
# The years 1922-2011 are scaled down to integers [0,1,2,..., 89] 
a = X.min()
b = X.max()
X = (X - a) / (b - a)  # all values now between 0 and 1 !
Y = Y - Y.min()        # The years 1922-2011 are mapped to 0-89

# Training data set
X_train = X[0:463715]
y_train = Y[0:463715]

# Validation data set
X_test = X[463715:]
y_test = Y[463715:]

Let's take a look if all is fine. Let's plot 40 samples of X_train (from item 1600 to item 1639)

In [11]:
for i in range(1600, 1640):
    plt.plot(X_train[i], label='song_' + str(i))
    
plt.xlabel("Feature")
plt.ylabel("Value")
plt.legend(loc='upper center', bbox_to_anchor=(0.8, 0.9), ncol=5, fancybox=True, shadow=True, fontsize=7)
Out[11]:
<matplotlib.legend.Legend at 0x9bf13470>

Indeed, the values seem to be well normalized in the 0.0-1.0 scale. Here is how our classes (years) look like. Let's sort them and plot them and see if they range between 0 and 89:

In [12]:
plt.plot(sorted(y_train))
Out[12]:
[<matplotlib.lines.Line2D at 0x9d5fcc50>]

As we saw in previous examples, Keras does not work with integer classes, and we will have to convert our list of years to one-hot vectors:


    0   →  [1,0,0,0,...,0]
    1   →  [0,1,0,0,...,0]
    2   →  [0,0,1,0,...,0]
    ...
    89  →  [0,0,0,0,...,1]

This is easily done with the Keras numpy_utils, to_categorical function

In [13]:
Y_train = np_utils.to_categorical(y_train, nb_classes)
Y_test = np_utils.to_categorical(y_test, nb_classes)

Deep Learning with Keras

We're now all set with our data, so we can proceed to build a neural network using Keras for finding out if there is any mysteriously hidden connection between the 90 tone/sound attributes to the year in which a song was composed?

Keras Model Defintion

Let's start with a simple neural network that consists of

  1. An input layer of 90 neurons (corresponding to our 90 audio features)
  2. A hidden layer of 90 neurons (just a guess)
  3. An ouput layer of 90 neurons (one for each year between 1922 to 2011)
In [14]:
# Our first Keras Model
model1 = Sequential()
model1.add(Dense(90, input_shape=(90,)))
model1.add(Dense(90, init='uniform', activation='softmax'))

Compiling the model

In [15]:
model1.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

To be able to capture and save the best state of this model, we will use the FitMonitor callbak from our kerutils module, which can be downloaded from
http://www.samyzaf.com/cgi-bin/view_file.py?file=ML/lib/kerutils.py

We will also use the following utility for printing scores and displaying graphs of the training and validation accuracy and loss values. h is the history object returned by the training function (see below).

In [16]:
def show_scores(model, h):
    loss, acc = model.evaluate(X_train, Y_train, verbose=0)
    print "Training: accuracy   = %.6f loss = %.6f" % (acc, loss)
    loss, acc = model.evaluate(X_test, Y_test, verbose=0)
    print "Validation: accuracy = %.6f loss = %.6f" % (acc, loss)
    print "Over fitting score   = %.6f" % over_fitting_score(h)
    print "Under fitting score  = %.6f" % under_fitting_score(h)
    view_acc(h)
    plt.show()
    view_loss(h)
    plt.show()

Training our Model

In [17]:
fmon = FitMonitor(thresh=0.02, minacc=0.90, filename="model1.h5")

h = model1.fit(
    X_train,
    Y_train,
    batch_size=128,
    nb_epoch=20,
    verbose=0,
    validation_data=(X_test, Y_test),
    callbacks = [fmon]
)

show_scores(model1, h)
Train begin: 2016-11-03 14:37:59
batch_size = 128
do_validation = True
metrics = ['loss', 'acc', 'val_loss', 'val_acc']
nb_epoch = 20
nb_sample = 463715
verbose = 0

Saving model: 0 0.988888859749 0.988888859749
.. 10% epoch=2 acc=0.988889 loss=0.049063 max_acc=0.988889 max_val_acc=0.988889
.. 20% epoch=4 acc=0.988889 loss=0.048966 max_acc=0.988889 max_val_acc=0.988889
.. 30% epoch=6 acc=0.988889 loss=0.048888 max_acc=0.988889 max_val_acc=0.988889
.. 40% epoch=8 acc=0.988889 loss=0.048843 max_acc=0.988889 max_val_acc=0.988889
.. 50% epoch=10 acc=0.988889 loss=0.048810 max_acc=0.988889 max_val_acc=0.988889
.. 60% epoch=12 acc=0.988889 loss=0.048781 max_acc=0.988889 max_val_acc=0.988889
.. 70% epoch=14 acc=0.988889 loss=0.048754 max_acc=0.988889 max_val_acc=0.988889
.. 80% epoch=16 acc=0.988889 loss=0.048732 max_acc=0.988889 max_val_acc=0.988889
.. 90% epoch=18 acc=0.988889 loss=0.048710 max_acc=0.988889 max_val_acc=0.988889
. 95% epoch=19 acc=0.988889 loss=0.048690
Train end: 2016-11-03 14:40:15
Total run time: 135.89 seconds
max_acc = 0.988889  epoc = 0
max_val_acc = 0.988889  epoc = 0
Best model saved in file: model1.h5
Checkpoint: epoch=0, acc=0.988889, val_acc=0.988889
Training: accuracy   = 0.988889 loss = 0.048660
Validation: accuracy = 0.988889 loss = 0.048594
Over fitting score   = 0.000000
Under fitting score  = 0.000000

Training and validation accuracy of 98.88% at first try is quite good! Looks like the best model was captured on epoch 0 !? (this is what the fmon callback reported above), so we didn't have to use too many epochs (20). From the first graph we see that the training and validation graphs are in complete agreement, which means that our training set was fully validated on the testing set (which was separated from the beginning from the data, and is not supposed to depend on it) with respect to the accuracy values.

Let's iterate again the significance of this result: with a neural network of 270 neurons we were able to predict the year of a given song, from its 90 audio attributes, at a 98.88% precision level! It has been tested on 51,630 cases, so it cannot be dismissed as a coincidence or a lucky strike. This is serious ...

In [18]:
# Validating the accuracy and loss of our training set

loss, accuracy = model1.evaluate(X_train, Y_train, verbose=0)
print "Train: accuracy=%f loss=%f" % (accuracy, loss)
Train: accuracy=0.988889 loss=0.048660
In [19]:
# Validating the accuracy and loss of our testing set

loss, accuracy = model1.evaluate(X_test, Y_test, verbose=0)
print "Test: accuracy=%f loss=%f" % (accuracy, loss)
Test: accuracy=0.988889 loss=0.048594
In [20]:
# We have already saved our model (to an hdf5 file), but you can always do it manually
# with the 'save' command, so it can be later be loaded with load_model function and used

model1.save('model1_final.h5')

In our second attempt, we have multiplied the first hidden layer, and added two new hidden layers, each with 360 nerons. The activation function of the output layer was changed to 'sigmoid'. In addition, we have inserted thre Droupout layers in order to reduce the chance of overfitting. You can read about all these features in Keras documentation site: https://keras.io However, this has not improved the accracy at all.

We leave this challenge to reader: try to find a neural-network whose accuracy exceeds 99.5%. If you fail to improve accuracy, you may need to look at the histogram above and see what happens in the years 1922 t0 1950? there are hardly any songs at this period, could be that there are not enough data for learning?

In [21]:
# Second Keras Model
model2 = Sequential()
model2.add(Dense(180, input_shape=(90,)))
model2.add(Dropout(0.2))
model2.add(Dense(360, init='uniform', activation='relu'))
model2.add(Dropout(0.2))
model2.add(Dense(360, init='uniform', activation='relu'))
model2.add(Dropout(0.2))
model2.add(Dense(90, init='uniform', activation='sigmoid'))

model2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
fmon = FitMonitor(thresh=0.02, minacc=0.99, filename="model2.h5")
h = model2.fit(
    X_train,
    Y_train,
    batch_size=128,
    nb_epoch=10,
    verbose=0,
    validation_data=(X_test, Y_test),
    callbacks = [fmon]
)

show_scores(model2, h)
Train begin: 2016-11-03 14:40:39
batch_size = 128
do_validation = True
metrics = ['loss', 'acc', 'val_loss', 'val_acc']
nb_epoch = 10
nb_sample = 463715
verbose = 0
. 10% epoch=1 acc=0.988889 loss=0.049329 max_acc=0.988889 max_val_acc=0.988889
. 20% epoch=2 acc=0.988889 loss=0.049231 max_acc=0.988889 max_val_acc=0.988889
. 30% epoch=3 acc=0.988889 loss=0.049173 max_acc=0.988889 max_val_acc=0.988889
. 40% epoch=4 acc=0.988889 loss=0.049135 max_acc=0.988889 max_val_acc=0.988889
. 50% epoch=5 acc=0.988889 loss=0.049108 max_acc=0.988889 max_val_acc=0.988889
. 60% epoch=6 acc=0.988889 loss=0.049089 max_acc=0.988889 max_val_acc=0.988889
. 70% epoch=7 acc=0.988889 loss=0.049055 max_acc=0.988889 max_val_acc=0.988889
. 80% epoch=8 acc=0.988889 loss=0.049043 max_acc=0.988889 max_val_acc=0.988889
. 90% epoch=9 acc=0.988889 loss=0.049006 max_acc=0.988889 max_val_acc=0.988889
 90% epoch=9 acc=0.988889 loss=0.049006
Train end: 2016-11-03 14:42:42
Total run time: 122.89 seconds
max_acc = 0.988889  epoc = 1
max_val_acc = 0.988889  epoc = 0
No checkpoint model found.
Saving the last state: model2.h5
Training: accuracy   = 0.988889 loss = 0.048835
Validation: accuracy = 0.988889 loss = 0.048757
Over fitting score   = 0.000000
Under fitting score  = 0.000023
In [22]:
# Let's save our model just in case we'll need later for testing (or for post mortem analysis)
model2.save('model2_final.h5')
In [23]:
P = model2.predict_classes(X_train)
463392/463715 [============================>.] - ETA: 0s
In [24]:
Failed = []
for i in range(len(P)):
    if y_train[i] != P[i]:
        Failed.append(i)
In [27]:
loss = model2.evaluate(X_train, Y_train)
463520/463715 [============================>.] - ETA: 0s