This database is used as a cannonical example for several topics in data science and machine learning. The Pima indian diabetes dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict if a patient has diabetes based on diagnostic measurements of eight simple features. We will also use this dataset for learning the basics of the Pandas library (which is essential for data science and machine learning).
Let's first start with listing the required Python modules for this project. We use the Keras deep learning library which makes things simple and practical. In a subsequent study unit we will also show how this can be done with Tensorflow.
Also note that we are using a designated module kerutils which contains some useful Keras and Numpy utilities for all our course units. It can be downloaded from: https://github.com/samyzaf/kerutils
from keras.models import Sequential, load_model
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers.advanced_activations import PReLU, SReLU, ELU, LeakyReLU, ThresholdedReLU, ParametricSoftplus
from keras.layers.noise import GaussianNoise
from keras.utils.visualize_util import plot
import matplotlib.pyplot as plt
import matplotlib.cm
import pandas as pd
from pandas.tools.plotting import scatter_matrix
from kerutils import *
from matplotlib import rcParams
rcParams['axes.grid'] = True
rcParams['figure.figsize'] = 10,7
%matplotlib inline
# fixed random seed for reproducibility
np.random.seed(0)
from kerutils import *
# These are css/html styles for good looking ipython notebooks
from IPython.core.display import HTML
css = open('style-notebook.css').read()
HTML('<style>{}</style>'.format(css))
features = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pd.read_csv('pima.csv', names=features)
# Lets view the first 10 rows of the data set
# See below what these names mean
data.head(10)
More information: https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
# How many rows do we have?
data.index
# Try2: How many rows do we have?
len(data.index)
# How many columns do we have?
len(data.columns)
# How many records do we have in our data set?
data.size # 9 * 768
# You can get everything in one line !
data.shape
# View the last 10 records of our data set
data.tail(10)
# The plot() method plots all features
# This is TMI (too much information)
data.plot()
# Better try one at a time
# Here is the plasma concentration level per record distribution
# (less is more!)
data.plot(y='plas')
# The graph above does not look very useful.
# Let's sort the plas values
plt.plot(sorted(data.plas.as_matrix()))
# This is the age distribution (sorted!)
plt.plot(sorted(data.age.as_matrix()))
# Age range
print("Min age = %d, Max age = %d" % (data.age.min(), data.age.max()))
# Age average
data.age.mean()
# Age median
data.age.median()
top = plt.subplot2grid((4,4), (0, 0), rowspan=2, colspan=4)
top.scatter(data['plas'], data['age'])
plt.title('Age against Plasma')
bottom = plt.subplot2grid((4,4), (2,0), rowspan=2, colspan=4)
bottom.bar(data['preg'], data['plas'])
plt.title('Plasma against Pregnancies')
plt.tight_layout()
# Items with class == 0 (negative diabetes check)
data0 = data[data['class'] == 0]
len(data0.index)
# Items with class == 1 (persons with a positive diabtetes check)
data1 = data[data['class'] == 1]
len(data1.index)
# The Pandas groupby method computes the distribution of one feature
# with respect to the others
# We see 8 histograms distrubuted against a negative diabetes chck
# and other 8 histograms with distribution against a positive diabetes check
data.groupby('class').hist(figsize=(8,8), xlabelsize=7, ylabelsize=7)
sm = scatter_matrix(data, alpha=0.2, figsize=(7.5, 7.5), diagonal='kde')
[plt.setp(item.yaxis.get_majorticklabels(), 'size', 6) for item in sm.ravel()]
[plt.setp(item.xaxis.get_majorticklabels(), 'size', 6) for item in sm.ravel()]
plt.tight_layout(h_pad=0.15, w_pad=0.15)