Text Classification for Sentiment Analysis

The Internet Movie DataBase (IMDb) is a huge repository for image and text data which is an excellent source for data analytics and deep learning practice and research. A quick Google search yields dozens of such examples if needed.

In this tutorial we will use the Large Movie Review Dataset (from Stanford University) which consists of 50,000 movie reviews (50% negative and 50% positive). The set is divided to training and validation datasets (each with 25000 movie reviews with equal number of positive and negative reviews).

Our objective is to create a neural network (as a Keras model) that can predict if a given movie review is positive or negative. This type of computation is also called Sentiment Analysis, as we try to train our neural network to predict the sentimental value of a movie review (negative or positive?).

The seminal research paper on this subject was published by Yoon Kim on 2014. In this paper Yoon Kim has laid the foundations for how to model and process text by convolutional neural networks for the purpose of sentiment analysis. He has shown that by simple one-dimentional convolutional networks, one can develops very simple neural networks that reach 90% accuracy very quickly. There's still very high research activity in this field as we write this tutorial ...
https://github.com/alexander-rakhlin/CNN-for-Sentence-Classification-in-Keras/blob/master/docs/1408.5882v2.pdf

To demonstrate Kim's ideas, we will closely follow the code in the imdb_cnn.py example from the Keras github repository:
https://github.com/fchollet/keras/blob/master/examples/imdb_cnn.py.
We will try to complement it with more explanations and annotations and add complementary code.

The IMDb datasets are automatically downloaded by the code below, but if needed (for other tests), it can also be downloaded (as a pickle file) from:

  1. https://s3.amazonaws.com/text-datasets/imdb_full.pkl
  2. https://s3.amazonaws.com/text-datasets/imdb_word_index.pkl

These datasets are already encoded to numbers. The actual text datasets can be downloaded from:
http://ai.stanford.edu/~amaas/data/sentiment

In [1]:
# These are css/html styles for good looking ipython notebooks
from IPython.core.display import HTML
css = open('style-notebook.css').read()
HTML('<style>{}</style>'.format(css))
Out[1]:
In [2]:
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import Convolution1D, GlobalMaxPooling1D
from keras.datasets import imdb

from matplotlib import rcParams
rcParams['axes.grid'] = True
rcParams['figure.figsize'] = 10,7
%matplotlib inline

# fixed random seed for reproducibility
np.random.seed(1337)
Using Theano backend.
DEBUG: nvcc STDOUT mod.cu
   Creating library C:/Users/samy/AppData/Local/Theano/compiledir_Windows-10-10.0.14393-Intel64_Family_6_Model_94_Stepping_3_GenuineIntel-2.7.11-64/tmp9a4ttk/265abc51f7c376c224983485238ff1a5.lib and object C:/Users/samy/AppData/Local/Theano/compiledir_Windows-10-10.0.14393-Intel64_Family_6_Model_94_Stepping_3_GenuineIntel-2.7.11-64/tmp9a4ttk/265abc51f7c376c224983485238ff1a5.exp

Using gpu device 0: GeForce GTX 950 (CNMeM is enabled with initial size: 80.0% of memory, cuDNN 5103)
c:\anaconda2\lib\site-packages\theano\sandbox\cuda\__init__.py:600: UserWarning: Your cuDNN version is more recent than the one Theano officially supports. If you see any problems, try updating Theano or downgrading cuDNN to version 5.
  warnings.warn(warn)

Parameters

In [3]:
max_features = 5000
maxlen = 400
batch_size = 32
embedding_dims = 50
nb_filter = 250
filter_length = 3
hidden_dims = 250
nb_epoch = 4

Loading data ...

In [4]:
print('Loading data ...')
(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=max_features)
print(len(X_train), 'train sequences')
print(len(X_test), 'test sequences')
Loading data ...
(25000, 'train sequences')
(25000, 'test sequences')

Lets take a look on item 0 of this data

In [5]:
print(X_train[0])
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 2, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 2, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 2, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 2, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 2, 19, 178, 32]

What we see is a list of integers. Each integer represents one word in a movie review. That is, our movie review has been converted to a list of integers. Each word in our movie review vocabulary has been indexed by a unique integer. Here is the text of an example review from our dataset:

A simple parsing and indexing can map each word in our movie review database to a unique integer like in this example:

It should be noted that only words with a minimal level of frequency are entering the vocabulary. Words with low frequency are simply ignored (and are masked by index 0). This process is described in Yoon Kim's paper above and also in this tutorial:
http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/

Sequence Padding

Obviously, movie reviews are not expected to have the same text length (number of words). However, neural networks are expecting a fixed size of input vector. We will therefore have to either truncate long reviews or pad short reviews with special word paddings (like with a null word such as "<PAD>" that does not have any meaning, and can be encoded by 0). We use the Keras sequence.pad_sequences utility for this task.

In [6]:
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)
('X_train shape:', (25000L, 400L))
('X_test shape:', (25000L, 400L))

Create a Keras Neural Network Model

We will define a Keras sequential model (feed forward neural network) with four hidden layers as in the following diagram

In [7]:
model = Sequential()

Word Embedding (word2vec)

After creating the initial model, we need to populate it with the four hidden layers that we need. With text classification, the first layer is usually an embedding layer in which word indices that come from the input layer are converted to word vectors (word2vec). This is an important conversion which enables a more efficient and faster processing of textual data. Each word integer is mapped into a one dimensional vector of floats which captures its syntactical properties within the movie reviews text corpus. This subject may be covered in a little more depth at the course if time allows. At the moment, more explanation can be found in the Tensorflow and Google code archives and other blogs:

  1. https://code.google.com/archive/p/word2vec
  2. https://www.tensorflow.org/tutorials/word2vec
  3. https://www.youtube.com/watch?v=T8tQZChniMk&list=PLR2RxXcwFe533vpJhgiDOAONyRzj3lzbJ
  4. http://colah.github.io/posts/2014-07-NLP-RNNs-Representations

Here is how this done in Keras:

In [8]:
model.add(Embedding(max_features, embedding_dims, input_length=maxlen, dropout=0.2))

Adding a Convolution1d Layer

After the embedding layer, comes convolutional layer. Since our word vectors are one dimensional, we only need 1-dim convolutions. Keras provides us with a built in method for doing it elegantly. Note that we need to specify a convolution kernel length and number of filters to use (nb_filter). More info about parameters and usage at: https://keras.io/layers/convolutional/

In [9]:
model.add(
    Convolution1D(
        nb_filter=nb_filter,
        filter_length=filter_length,
        border_mode='valid',
        activation='relu',
        subsample_length=1,
    )
)
In [10]:
model.add(GlobalMaxPooling1D())

Dense Layer

In [11]:
model.add(Dense(hidden_dims))
model.add(Dropout(0.2))
model.add(Activation('relu'))

Output Layer

Our output layer consists of one neuron. The sigmoid activation will produce a float number between 0 and 1. We can round it to 0 or 1 to conclude if the movie review is negative or positive, or we can interpret it as a fuzzy value of how much the review is positive or negative.

In [12]:
model.add(Dense(1))
model.add(Activation('sigmoid'))

Compilation

Now we compile our model

In [13]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Training

And finally we train the model with our training dataset (X_train and y_train). It turns out that with only 3 epochs we are able to reach 89% validation accuracy! Which takes less than 5 minutes on a standard home pc!

In [14]:
h = model.fit(
    X_train,
    y_train,
    batch_size=batch_size,
    nb_epoch=nb_epoch,
    validation_data=(X_test, y_test),
    verbose=1,
)
Train on 25000 samples, validate on 25000 samples
Epoch 1/4
25000/25000 [==============================] - 19s - loss: 0.4241 - acc: 0.7928 - val_loss: 0.3012 - val_acc: 0.8756
Epoch 2/4
25000/25000 [==============================] - 19s - loss: 0.2843 - acc: 0.8809 - val_loss: 0.2914 - val_acc: 0.8760
Epoch 3/4
25000/25000 [==============================] - 19s - loss: 0.2406 - acc: 0.9017 - val_loss: 0.2674 - val_acc: 0.8884
Epoch 4/4
25000/25000 [==============================] - 19s - loss: 0.2022 - acc: 0.9176 - val_loss: 0.2616 - val_acc: 0.8916
In [15]:
loss, accuracy = model.evaluate(X_train, y_train, verbose=0)
print("Training: accuracy = %f  ;  loss = %f" % (accuracy, loss))
Training: accuracy = 0.975560  ;  loss = 0.090883
In [16]:
loss, accuracy = model.evaluate(X_test, y_test, verbose=0)
print("Validation: accuracy1 = %f  ;  loss1 = %f" % (accuracy, loss))
Validation: accuracy1 = 0.891560  ;  loss1 = 0.261638

The training accuracy reached 97.55% which is a a bit overfitting. What counts is the validation accuracy which is 89.16%, which is of course not bad considering the minimal effort it took to obtain it.

Course Project

The following two files consists of 10,662 movie reviews (half positive, half negative). Each single line in this file is one movie review (so you'll need to do some parsing).

  1. http://www.samyzaf.com/ML/imdb/pos.txt
  2. http://www.samyzaf.com/ML/imdb/neg.txt

  3. Create two data sets, one for training which consists of 8662 samples (50% positive, 50% negative), and a second data set for validation with 2000 samples.

  4. Define a neural network for predicting a review sentiment as above
  5. Try to obtain 95% precision accuracy.
  6. Feel free to experiment with the number of filters and filter size in you convolutional layer. You can also experiment with several convolutional layers and maxpooling layers, etc.
  7. Try also to experiment with different activation functions such as SReLU, leaky relu, or PReLU, etc ...