Artificial Neural Networks — Analysis of Bank Marketing Dataset

Sachchithananthan Thanusan
10 min readJun 26, 2021

--

Analysis to predict if the client will subscribe a term deposit…

Use case: The dataset is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The goal is to predict if the client will subscribe a term deposit.

Data set: View Data Set

To do practice you can download the data set from above link.

First we have to import some libraries. Because most of python libraries are having a set of useful functions which can simply eliminate the need for writing codes from scratch. So that I have imported below libraries

import numpy as np
import pandas as pd
import scipy.stats as stats
from matplotlib import pyplot as plt
%matplotlib inline
from sklearn.preprocessing import StandardScaler

Pandas is mainly used for data analysis with various file formats such as comma-separated values, JSON, SQL, Microsoft Excel. Moreover it can do some data manipulation operations such as merging, reshaping, selecting, cleaning, etc. Further libraries i will explain in wherever required.

I have added below codes for mount the google drive account to google colaboratory to access the files which is available on drive. When we execute below code we have to go to the generated URL and get authorization code to enter here.

from google.colab import drive
drive.mount("/content/gdrive")

After enter the authorization code. you will get a message as ‘Mounted at /content/drive’. There after by below code you can access the files which is available on your connected google drive.

data = pd.read_csv('/content/gdrive/My Drive/Colab Notebooks/data/bank-full.csv',sep=';')

Now u have successfully loaded your data in to the variable called ‘data’. for the modification purposes I have copied that data in to another variable called ‘X’

X = data.copy() #dataset has been copied to  X

By following below code you can see the top 10 stored data rows in that variable. Pandas library provide a method called head() is widely used to return top n rows of a data frame or series. that method by default return top 5 rows of stored data set.

X.head(10)

To get histograms

plt.rcParams["figure.figsize"] = (24, 12)
X.hist()

Step 01 Data Pre-Processing

Data pre-processing is a main step in Machine Learning as the useful information which can be derived it from data set directly affects the model quality so it is extremely important to do at least necessary preprocess for our data before feeding it into our model.

Step 01 A -> Data Cleansing- Identifying and handling the Missing Values, Duplicates Records

By following below codes we can identify the Missing Values in the data each columns

X.isnull().sum()  #No null values

In above output , You can see that all are zero values so now we can come to conclusion that we don’t have any nan values in data set.

By following below codes we can check for the duplicate row in data set.

print(X.duplicated().value_counts()) # To check duplicated values

From above output we can come to conclusion that there were no duplicated rows in data set

Step 01 B -> Data Cleansing — Identifying and handling the Outliers

An outlier is called as an object which deviates significantly from the other objects. It can be caused by different types of errors. So that we should have to do the analysis of outlier in given data set to get the model with good quality.

By following below codes you can get numeric column names in the data set,

plt.rcParams["figure.figsize"] = (22, 3)
X._get_numeric_data().columns.tolist()

By following below code you can get the boxplot for given column,

temp_df = pd.DataFrame(X, columns=['age'])
temp_df.boxplot(vert=False)

like above codes you can perform for other columns to check and identify the outliers. If you want more explanation you can refer my machine learning data pre-processing article,

Step 01 B -> Apply Suitable Feature Coding Techniques

ANN models are require all input and output values should to be numerical. So if your dataset have categorical data, you must have to encode it into the numbers before fit and evaluate a model. There are several methods available such as One-hot Encoding, Integer (Label) Encoding to do the task. Here i have used Integer (Label) Encoding because One-hot Encoding does not handle new categories in the test set automatically.

By following below code you can add the label Encoder

from sklearn.preprocessing import LabelEncoder
labelencoder =LabelEncoder()

By following below codes you can get a copy of X variable to do necessary encode techniques. as well as you can check is there any null values in the data set.

xa_Encode = X.copy()
xa_Encode.isnull().sum()

By following below codes you can label the multiple column’s in single loop with your own values, Specially we can follow a suitable order in this type of encoding. for an example we can use 0 for unknown values. When we use a auto encoding for an example it can assign a 0 value to ‘jan’ but when we give values like below mentioned code snippet then we can give a 1 for ‘jan’ with a meaning. as well as we can give 0 for unknowns

Further more we can give good values for education according to the level. Because of those reasons i have selected this way of coding features

features = ['default' ,'housing', 'loan','month', 'y','contact','education','poutcome','marital','job']
feature_label_dict = {
'default':{'no':0,'yes':1},
'housing':{'no':0,'yes':1},
'loan':{'no':0,'yes':1},
'month':{'jan':1,'feb':2,'mar':3,'apr':4,'may':5,'jun':6,'jul':7,'aug':8,'sep':9,'oct':10,'nov':11,'dec':12},
'y':{'no':0,'yes':1},
'contact':{'unknown':0, 'cellular':1, 'telephone':2},
'education':{'unknown':0, 'primary':1, 'secondary':2,'tertiary':3},
'poutcome':{'unknown':0, 'other':1, 'failure':2, 'success':3},
'marital':{ 'divorced':0,'single':1,'married':2},
'job':{'unknown':0,'unemployed':1, 'student':2,'management':3, 'technician':4, 'entrepreneur':5, 'blue-collar':6, 'retired':7, 'admin.':8, 'services':9, 'self-employed':10, 'housemaid':11}
}
for f in features:
xa_Encode = xa_Encode.replace({f:feature_label_dict[f]})

print ("Labelled as: ",feature_label_dict[f])

From above picture you can see how we performed label encoding with our own values, thereafter we can check the data frame output also by following below code,

xa_Encode.head()

By following below codes we can define a bins in a meaningful way Because always we can focus on the available values on train data set. For example in earlier code snippet you can see that i have given a ordered values for months but if we have considered only the data set values to encode we may loose the some months codes which is not available in the data set.

bins = [18, 30, 40, 50, 60, 70, 120]
labels = ['18-29', '30-39', '40-49', '50-59', '60-69', '70+']
xa_Encode['agerange'] = pd.cut(xa_Encode['age'], bins, labels = labels,include_lowest = True)
xa_Encode['age']=labelencoder.fit_transform(xa_Encode['agerange'])
xa_Encode=xa_Encode.drop(['agerange'], axis = 1)
bins = [0, 300, 600, 900,1200, 1500, 1800,2100,2400,2700,3000,3300,3600,3900,4200,4500,4800,5100]
labels = ['0-299', '300-599','600-899','900-1199','1200-1499','1500-1799','1800-2099','2100-2399','2400-2699','2700-2999','3000-3299','3300-3599','3600-2899','3900-4199','4200-4499','4500-4799','4800+']
xa_Encode['durationrange'] = pd.cut(xa_Encode['duration'], bins, labels = labels,include_lowest = True)
xa_Encode['duration']=labelencoder.fit_transform(xa_Encode['durationrange'])
xa_Encode=xa_Encode.drop(['durationrange'], axis = 1)
bins = [-1,0,100,200,300,400,500,600,700,800,900]
labels = ['-1', '0-99','100-199','200-299','300-399','400-499','500-599','600-699','700-799','800+']
xa_Encode['pdaysrange'] = pd.cut(xa_Encode['pdays'], bins, labels = labels,include_lowest = True)
xa_Encode['pdays']=labelencoder.fit_transform(xa_Encode['pdaysrange'])
xa_Encode=xa_Encode.drop(['pdaysrange'], axis = 1)
bins = [0,25,50,75,100,125,150,175,200,225,250,275,300]
labels = ['0-24','25-49','50-74','75-99','100-124','125-149','150-174','175-199','200-224','225-249','250-274','275+']
xa_Encode['previousrange'] = pd.cut(xa_Encode['previous'], bins, labels = labels,include_lowest = True)
xa_Encode['previous']=labelencoder.fit_transform(xa_Encode['previousrange'])
xa_Encode=xa_Encode.drop(['previousrange'], axis = 1)
bins = [0,10,20,30,40,50,60,70]
labels = ['0-9','10-19','20-29','30-39','40-49','50-59','60+']
xa_Encode['campaignrange'] = pd.cut(xa_Encode['campaign'], bins, labels = labels,include_lowest = True)
xa_Encode['campaign']=labelencoder.fit_transform(xa_Encode['campaignrange'])
xa_Encode=xa_Encode.drop(['campaignrange'], axis = 1)
bins = [-10000, 0, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000]
labels = ['-10000--1', '0-9999', '10000-19999', '20000-29999', '30000-39999', '40000-49999', '50000-59999', '60000-69999', '70000-79999', '80000-89999', '90000-99999', '100000+']
xa_Encode['balancerange'] = pd.cut(xa_Encode['balance'], bins, labels = labels,include_lowest = True)
xa_Encode['balance']=labelencoder.fit_transform(xa_Encode['balancerange'])
xa_Encode=xa_Encode.drop(['balancerange'], axis = 1)

Step 01 C -> Checking for class Imbalance

By following below codes we can check the class imbalance also,

import seaborn as sns
plt.rcParams["figure.figsize"] = (10, 8)
xa_Encode['y'].value_counts()
sns.countplot(x='y', data=xa_Encode)
plt.show()

From the above plot you can see that class is imbalance. For that we can use library called SMOTE. SMOTE works by utilizing a k-nearest neighbor algorithm to create synthetic data

from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
os = SMOTE(random_state=0)
X_class_train, X_test, y_class_train, y_test = train_test_split(X, y_true, test_size=0.3, random_state=0)
columns = X_class_train.columns
data_X, data_y = os.fit_sample(X_class_train, y_class_train)smoted_X = pd.DataFrame(data=data_X,columns=columns )
smoted_y= pd.DataFrame(data=data_y,columns=['y'])

By following below codes we can split the data set into the two sections,

from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import train_test_split
Y = xa_Encode['y']
X = xa_Encode.drop(['y','duration'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, Y,random_state=1, test_size=0.2)

By following below code we can scale the our data set, Here we should have to fit the X_train only to avoid the data leakage issues.

sc_X = StandardScaler()
sc_X.fit(X_train)
X_trainscaled=sc_X.fit_transform(X_train)
X_testscaled=sc_X.transform(X_test)

To get the columns name to make the scaled output as data frame,

columns_value_new=X_train.columns
test_X_Scaled_Except = pd.DataFrame(X_trainscaled, columns=columns_value_new)

By following below codes you can Identify significant and independent features using correlation matrix without target variable

import seaborn as sns
plt.rcParams["figure.figsize"] = (24, 8)
sns.heatmap(test_X_Scaled_Except.corr(),annot=True);

By following below codes you can Identify significant and independent features using correlation matrix with target variable. Here we can see that education and age have considerable amount of correlation as well as housing and age. Moreover we can see that we can see that education and job also have correlation. More than those we can see that poutcome and pdays have more correlation compare to any other features. so we can remove the one of the features in pdays, poutcome. by consider the other values of poutcome we can omit that features due to many unknown values.

plt.rcParams["figure.figsize"] = (24, 8)
sns.heatmap(xa_Encode.corr(),annot=True);

Here each feature has a considerable amount of correlation with our target variable y. So we can keep the all features.

Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.

By following below codes you can fit your scaled data set to perform Principal Component Analysis,

from sklearn.decomposition import PCA
PCA_data_train =X_trainscaled
PCA_data_test =X_testscaled
pca = PCA(n_components=15)
pca.fit(PCA_data_train)
X_pca_train= pca.fit_transform(PCA_data_train)
X_pca_test= pca.fit_transform(PCA_data_test)
principalDf_train = pd.DataFrame(data = X_pca_train)
principalDf_test = pd.DataFrame(data = X_pca_test)

By following below codes you can view the values after PCA for n Components,

principalDf_train.head(10)

By following below codes you can get the PCA explained variance ratio,

pca.explained_variance_ratio_

The pca.explained_variance_ratio_ parameter is returns a vector of the variance explained by each dimension. here pca.explained_variance_ratio_[i] will give the variance which is explained by the i+1st dimension.

By following below codes you can fit our data set in to MLPClassifier which is a multilayer perceptron. In the below code, there are four hidden layers modelled, that’s mean in the first hidden layer there are 256 neurons, in the second hidden layer of the model contain 128 neurons and on the third hidden layer 64 neurons and finally 32 neurons in the last layer,

from sklearn.neural_network import MLPClassifier 
clf = MLPClassifier(hidden_layer_sizes=(256,128,64,32),activation="relu",random_state=1).fit(principalDf_train, y_train)
y_pred=clf.predict(principalDf_test)
print(clf.score(principalDf_test, y_test))

From above output you can see the accuracy of the model.

from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

By following above code we can get the below classification report of the model.

By following below code we can plot the confusion matrix.

from sklearn.metrics import plot_confusion_matrix
plt.rcParams["figure.figsize"] = (8, 10)
fig=plot_confusion_matrix(clf, principalDf_test, y_test,display_labels=["0",'1'],cmap=plt.cm.Blues,values_format = '.2f')
fig.figure_.suptitle("Confusion Matrix ")
plt.show()

Above image and classification report shows for our developed model. True Positive count is more than the False positive count. Actually, it means our developed model does perform not well enough. You can see precision for the 1 (yes) class is 0.15. It is lower when compared to the 0 (no) class. above results is without applying SMOTE. So we can get good results compare to this results if we have used SMOTE to handle the class imbalances. As a summary we can see, our model has 0.52 macro average precision and 0.81 weighted average precision.

Thank You For Reading My Article — — — -| ) By Thanusan S.

--

--

Sachchithananthan Thanusan
Sachchithananthan Thanusan

Written by Sachchithananthan Thanusan

Final year Undergraduate, Faculty of Information Technology, University of Moratuwa.

No responses yet