Analysis of Bank Marketing Dataset By using Support Vector Machine (SVM)

Sachchithananthan Thanusan
Nerd For Tech
Published in
14 min readJun 20, 2021

--

Analysis to predict if the client will subscribe a term deposit…

Use case: The dataset is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The goal is to predict if the client will subscribe a term deposit. Download Data Set.

Step 01 — Data Pre-Processing

Step 02 — Apply Suitable Data Transformations

Step 03 — Apply Suitable Feature Discretization

Step 04 — Apply Suitable Feature Coding Techniques

Step 05 — Scale and/or standardized the features

Step 06 — Correlation Matrix And Principal Component Analysis ( PCA )

Step 07 — Checking for class Imbalance and Handling class Imbalance

Step 08 — Applying Support Vector Machine (SVM)

First we have to import some libraries. Because most of python libraries are having a set of useful functions which can simply eliminate the need for writing codes from scratch. So that i have imported below libraries

import numpy as np
import pandas as pd
import scipy.stats as stats
from matplotlib import pyplot as plt
%matplotlib inline
from sklearn.preprocessing import StandardScaler

I have added below codes for mount the google drive account to google colaboratory to access the files which is available on drive. When we execute below code we have to go to the generated URL and get authorization code to enter here.

from google.colab import drivedrive.mount(“/content/gdrive”)

After enter the authorization code. you will get a message as ‘Mounted at /content/drive’. There after by below code you can access the files which is available on your connected google drive.

data = pd.read_csv('/content/gdrive/My Drive/Colab Notebooks/data/banking.csv')

By below code we can check the number of records which is available on the data set

len(data)

Now u have succesfully loaded your data in to the varaible called ‘data’. for the modifictaion purposes I have copied that data in to another variable called ‘X’

X = data.copy() #dataset has been copied to  X

By following below code you can see the top 10 stored data rows in that variable. Pandas libary provide a method called head() is widely used to return top n rows of a data frame or series. that method by default return top 5 rows of stored data set.

X.head(10)

Lets look each steps in detail,

Step 01 — Data Pre-Processing

Data pre-processing is a main step in Machine Learning as the useful information which can be derived it from data set directly affects the model quality so it is extremely important to do at least necessary preprocess for our data before feeding it into our model.

Step 01 A -> Data Cleansing- Identifying and handling the Missing Values, Duplicates Records

By following below codes we can identify the Missing Values in the data each columns

X.isnull().sum()  #No null values

In above output all are zero values so now we can come to conclusion that we don’t have any nan values in data set.

By following below codes we can check for the duplicate row in data set.

print(X.duplicated().value_counts()) # To check duplicated values

From above output we can come to conclusion that there were 12 duplicated rows in data set but we can’t taken them as duplicate values because there were no unique column for customer ID or anything. so that values may be a another persons values. so I didn't remove those values

Step 01 B -> Data Cleansing — Identifying and handling the Outliers

By following below codes we can get the numeric columns names

X._get_numeric_data().columns.tolist()

A ) Feature Age Q-Q Plots and Box Plot

plt.rcParams["figure.figsize"] = (10, 6)
stats.probplot(X["age"], dist="norm", plot=plt)
plt.show()
plt.rcParams["figure.figsize"] = (22, 3)
temp_df = pd.DataFrame(X, columns=['age'])
temp_df.boxplot(vert=False)

No considerable issues on this feature so we can use it without any preprocessing

B ) Feature Duration Q-Q Plots and Box Plot

plt.rcParams["figure.figsize"] = (10, 6)
stats.probplot(X["duration"], dist="norm", plot=plt)
plt.show()
plt.rcParams["figure.figsize"] = (22, 3)
temp_df = pd.DataFrame(X, columns=['duration'])
temp_df.boxplot(vert=False)

this attribute highly affects the output target (e.g., if duration=0 then y=’no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

X = X.drop(['duration'], axis = 1)

By following above code we can drop the Duration Column.

C ) Feature Campaign Q-Q Plots and Box Plot

plt.rcParams["figure.figsize"] = (10, 6)
stats.probplot(X["campaign"], dist="norm", plot=plt)
plt.show()
plt.rcParams["figure.figsize"] = (22, 3)
temp_df = pd.DataFrame(X, columns=['campaign'])
temp_df.boxplot(vert=False)

campaign which represent the number of contacts performed during this campaign and for this client. In above boxplot we can see that only 1 value have more than 50. So i have removed that record.

X=X[X['campaign']<50]
X=X.reset_index(drop=True)

Box plot after removing the outlier,

D ) Feature Pdays Q-Q Plots and Box Plot

Number of days that passed by after the client was last contacted from a previous campaign (999 means client was not previously contacted)

plt.rcParams["figure.figsize"] = (10, 6)
stats.probplot(X["pdays"], dist="norm", plot=plt)
plt.show()
plt.rcParams["figure.figsize"] = (22, 3)
temp_df = pd.DataFrame(X, columns=['pdays'])
temp_df.boxplot(vert=False)
len(X[X['pdays']==999])  # -> 39672

No considerable issues on this feature so we can use it without any preprocessing. We have to treat feature by using label encoding. because have 999 in 39672 values to mean client was not previously contacted.

E ) Feature Previous Q-Q Plots and Box Plot

Number of contacts performed before this campaign and for Particular client

plt.rcParams["figure.figsize"] = (10, 6)
stats.probplot(X["previous"], dist="norm", plot=plt)
plt.show()
plt.rcParams["figure.figsize"] = (22, 3)
temp_df = pd.DataFrame(X, columns=['previous'])
temp_df.boxplot(vert=False)
X['previous'].value_counts()

No considerable issues on this feature so we can use it without any preprocessing

F ) Feature Emp_var_rate Q-Q Plots and Box Plot

Employment variation rate - quarterly indicator

plt.rcParams["figure.figsize"] = (10, 6)
stats.probplot(X["emp_var_rate"], dist="norm", plot=plt)
plt.show()
plt.rcParams["figure.figsize"] = (22, 3)
temp_df = pd.DataFrame(X, columns=['emp_var_rate'])
temp_df.boxplot(vert=False)

No considerable issues on this feature so we can use it without any preprocessing

G ) Feature Cons_price_idx Q-Q Plots and Box Plot

Consumer price index — monthly indicator

plt.rcParams["figure.figsize"] = (10, 6)
stats.probplot(X["cons_price_idx"], dist="norm", plot=plt)
plt.show()
plt.rcParams["figure.figsize"] = (22, 3)
temp_df = pd.DataFrame(X, columns=['cons_price_idx'])
temp_df.boxplot(vert=False)

No considerable issues on this feature so we can use it without any preprocessing

H ) Feature Cons_conf_idx Q-Q Plots and Box Plot

Consumer confidence index — monthly indicator

plt.rcParams["figure.figsize"] = (10, 6)
stats.probplot(X["cons_conf_idx"], dist="norm", plot=plt)
plt.show()
plt.rcParams["figure.figsize"] = (22, 3)
temp_df = pd.DataFrame(X, columns=['cons_conf_idx'])
temp_df.boxplot(vert=False)

No considerable issues on this feature so we can use it without any preprocessing

I ) Feature Euribor3m Q-Q Plots and Box Plot

Daily indicator

plt.rcParams["figure.figsize"] = (10, 6)
stats.probplot(X["euribor3m"], dist="norm", plot=plt)
plt.show()
plt.rcParams["figure.figsize"] = (22, 3)
temp_df = pd.DataFrame(X, columns=['euribor3m'])
temp_df.boxplot(vert=False)

No considerable issues on this feature so we can use it without any preprocessing

J ) Feature Nr_employed Q-Q Plots and Box Plot

plt.rcParams["figure.figsize"] = (10, 6)
stats.probplot(X["nr_employed"], dist="norm", plot=plt)
plt.show()
plt.rcParams["figure.figsize"] = (22, 3)
temp_df = pd.DataFrame(X, columns=['nr_employed'])
temp_df.boxplot(vert=False)

No considerable issues on this feature so we can use it without any preprocessing.

Step 02 — Apply Suitable Data Transformations

Mostly we prefer if data comes from Normal Distribution to train the model. Because usually features in real datasets will follow more a skewed distribution. By applying different type of transformations according to the skewness to these variables, we can map skewed distribution into the normal distribution.

cols_ForHist= X._get_numeric_data().columns.to_list()
cols_ForHist.remove('y')
cols_ForHist

[‘age’, ‘campaign’, ‘pdays’, ‘previous’, ‘emp_var_rate’, ‘cons_price_idx’, ‘cons_conf_idx’, ‘euribor3m’, ‘nr_employed’]

plt.rcParams["figure.figsize"] = (24, 12)
X[cols_ForHist].hist()
plt.rcParams["figure.figsize"] = (8, 6)
from sklearn.preprocessing import FunctionTransformer

I have considered some features as Categorical because of days and some unique code. for an example Pdays have 999 to mean client was not previously contacted. Because of those type reasons i have considerd them as Categorical

Left Skewed Distributions

We can use Exponential or power transformation to reduce left-skewed distributions.

A) Transformation of Column - Euribor3m

exponential_transformer = FunctionTransformer(np.exp,validate=True)
exponential_transformer.fit(X[['euribor3m']])
X['euribor3m'] = exponential_transformer.transform(X[['euribor3m']])
X['euribor3m'].hist()

B) Transformation of Column - Emp_var_rate

exponential_transformer = FunctionTransformer(lambda x: x ** 2)
exponential_transformer.fit(X[['emp_var_rate']])
X['emp_var_rate'] = exponential_transformer.transform(X[['emp_var_rate']])
X['emp_var_rate'].hist()

C) Transformation of Column - Nr_employed

logarithm_transformer = FunctionTransformer(lambda x: x ** 2, validate=True)
logarithm_transformer.fit(X[['nr_employed']])
X['nr_employed'] = logarithm_transformer
.transform(X[['nr_employed']])
X['nr_employed'].hist()

Step 03 — Apply Suitable Feature Discretization

The feature discretization is refers to the process of partition or convert the continuous values of features or variables to discretized or nominal features or intervals.

here I have created bin for age feature. because we can consider young age people who are from age 10 to 30 as one category.

bins = [18, 30, 40, 50, 60, 70, 120]
labels = ['18-29', '30-39', '40-49', '50-59', '60-69', '70+']
X['age'] = pd.cut(X['age'], bins, labels = labels,include_lowest = True)

Step 04 — Apply Suitable Feature Coding Techniques

ML models are require all input and output values should to be numerical. So if your dataset have categorical data, you must have to encode it into the numbers before fit and evaluate a model. There are several methods available such as One-hot Encoding, Integer (Label) Encoding to do the task. Here i have used One-hot Encoding

cols = X.columns
num_cols = X._get_numeric_data().columns
cat_cols= list(set(cols) - set(num_cols))
add_cat = ['pdays','previous','campaign']
for x in add_cat:
cat_cols.append(x)
for col in X[cat_cols]:
print(col,"--->",X[col].unique())
print("")
for col in X[cat_cols]:
print(col+'-Values ')
print(X[col].value_counts())
print("")

From above image you can see the unique values and counts of those values which is available in the data set.

X = X.drop(['default'], axis = 1)cat_cols.remove('default')from sklearn.preprocessing import LabelEncoder
labelencoder =LabelEncoder()
dummies= []for col in X[cat_cols]:
temp_dummies = pd.get_dummies(X[col],prefix=col)
dummies += (temp_dummies.columns.to_list())
X=pd.concat([X, temp_dummies], axis=1)
X = X.drop(col, axis = 1)
print(dummies)

From below image you can see the created dummies in the data set.

xa_Encode = X.copy()xa_Encode.isnull().sum()

By using above code I have checked for null values in the dataset for doing scaling.

Step 05 — Scale and/or standardized the features

Data standardization is the way of the rescaling one or multiple features so that they can have a mean value of 0 and a standard deviation of 1. Standardization assumes that your data has a Gaussian (bell curve) distribution. but it is not strictly have to be true, but it is the technique which is considered as more effective if your feature values distribution is belongs Gaussian.

By following below code I have removed categorical variables in the data set to do the scaling.

Remove_columns_values = dummies
Remove_columns_values.append('y')
X_without_Cat=X.drop(Remove_columns_values, axis = 1)
X_without_Cat.head()

Output of data set after removing categorical feature's,

I have used StandardScaler from sklearn.preprocessing to do the scailing for above dataset, I have shown that in below code.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_without_Cat)
X_Scaled = scaler.transform(X_without_Cat)
columns_value_new=X_without_Cat.columns
X_Scaled_Except = pd.DataFrame(X_Scaled, columns=columns_value_new)
X_Scaled_Except.head(5)

Output of data set after do the scaling,

plt.rcParams["figure.figsize"] = (24, 12)X_Scaled_Except.hist()

I have produced relevant graphs to show the scaling/ standardizing effect by using above code.

data_Final =X_Scaled_Exceptfor f in dummies:
data_Final = data_Final.join(xa_Encode[f])

By using above code I have joined categorical values to the scaled data set .

Step 06 — Correlation Matrix And Principal Component Analysis ( PCA )

By following below codes we can generate the correlation matrix for given data set,

import seaborn as sns
plt.rcParams["figure.figsize"] = (24, 8)
sns.heatmap(X_Scaled_Except.corr(),annot=True);
#Seems they can be assuemed as independent

By looking above and below figures we can say that nr_employed and euribor3m have high correlations. But have some considerable differences so ok to keep it as it is.

X_Scaled_Except.corr()

We can see the correlation with target variable Y by following below code.

X_Scaled_Except['y']=xa_Encode['y']plt.rcParams["figure.figsize"] = (24, 8)sns.heatmap(X_Scaled_Except.corr(),annot=True);

By looking above correlation matrix and below table values we can come to conclusion that cons_conf_idx have very low correlation with y compare to other features. it is a consumer confidence index which is monthly indicator. The Employee Confidence Index is a measure of employees’ overall confidence in the economy, their employer, and their ability to find other employment. So it is something important that we have to consider to predict if the client will subscribe a term deposit. So that I didn't remove it.

X_Scaled_Except.corr()
Y=data_Final['y']data_Final_without_Y = data_Final.drop('y', axis = 1)

By following above code i have dropped target variable Y to do the Principal Component Analysis.

Principal Component Analysis — ( PCA )

Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.

from sklearn.decomposition import PCApca = PCA(n_components=30)pca.fit(data_Final_without_Y)
X_PCA = pca.transform(data_Final_without_Y)
#X_pca_test = pca.transform(X_test)
X_PCA = pd.DataFrame(data = X_PCA)
pca.explained_variance_ratio_[:30].sum()

By reduce the dimensionality to 30, I have received total variance as 0.93 So its enough to proceed balance steps.

By using below codes i have divide the data set into 20% for testing and 80% for training by using train_test_split from sklearn.model_selection

from sklearn.model_selection import train_test_splitX_class_train, X_test, y_class_train, y_test = train_test_split(data_Final_without_Y, Y, test_size=0.2, random_state=0)

Step 07 — Checking for class Imbalance and Handling class Imbalance

Most machine learning algorithms assume data equally distributed. So when we have a class imbalance in the dataset, the machine learning classifier tends to be more biased towards the majority class, causing bad classification of the minority class. So we should have to check and Handle those issue.

By following below code we can check the class imbalance by plot the target data categorical values count.

import seaborn as sns
plt.rcParams["figure.figsize"] = (8, 6)
data_Final['y'].value_counts()
sns.countplot(x='y', data=data_Final)
plt.show()

We can see from above plot that our dataset data class imbalance. So i have used SMOTE. Perhaps the most widely used approach to synthesizing new examples is called the Synthetic Minority Oversampling Technique, or SMOTE for short. This procedure can be used to create as many synthetic examples for the minority class as are required.

There are two types of sampling techinques. one is under-sampling techniques which is under sampling techniques refer to remove majority class points. another one is over-sampling techniques which is oversampling techniques refer to create artificial minority class points.

from imblearn.over_sampling import SMOTEos = SMOTE(random_state=0)
columns = X_class_train.columns
data_X, data_y = os.fit_sample(X_class_train, y_class_train)
smoted_X = pd.DataFrame(data=data_X,columns=columns )
smoted_y = pd.DataFrame(data=data_y,columns=['y'])

By following below code we can see the output after applying SMOTE for the training data set only. we should not add for the testing data set. When use any sampling technique specifically synthetic like SMOTE you should divide your data first and then apply synthetic sampling on the training data only. After you train you can use the testing set which contains only original samples to evaluate.

sns.countplot(x='y', data=smoted_y)
plt.show()
X_train = smoted_X
y_train = smoted_y
X_test
y_test

Step 08 — Applying Support Vector Machine (SVM)

SVM is a one of the supervised Machine learning approach where y is categorical and binary. Here i have used svm which is available in the sklearn library is used for training and testing.

from sklearn import svm
from matplotlib import pyplot as plt
%matplotlib inline

Here the kernel can be either rbf, poly or sigmoid. but after trying different kernels i have choosed rbf for this model training. Radial Basis Kernel is a kernel function that is used in machine learning to find a non-linear classifier or regression line.

c is the SVM regularization parameter. I have choose 70 as c.

gamma is the Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. When the value of gamma increases, it will try to exactly fit the as per training data set. by trail i have selected 0.001

svc = svm.SVC(kernel='rbf', C=70, gamma=0.001).fit(X_train,y_train)
predictionsvm = svc.predict(X_test)
from sklearn.metrics import classification_report
print(classification_report(y_test,predictionsvm))

Above output is the classification report for model. I have explained above report in the end of this article.

predictionsvm = svc.predict(X_test)
percentage = svc.score(X_test,y_test)
percentage #0.8614955086185967

From above code output we can see the overall prediction accuracy of the model. But we can’t evaluate the model by looking overall prediction accuracy only. So have to do the study with comparing to the classification report also.

from sklearn.metrics import plot_confusion_matrix
plt.rcParams["figure.figsize"] = (8, 10)
fig=plot_confusion_matrix(svc, X_test, y_test,display_labels=["0",'1'],cmap=plt.cm.Blues,values_format = '.2f')
fig.figure_.suptitle("Confusion Matrix ")
plt.show()

By using above code we can plot Confusion Matrix for the results of the developed model.

Above image and classification report shows for our developed model. True Positive count is higher than the False positive count. Actually, it means our model does perform not well but considerable for yes classification. You can see precision for the 1 (yes) class is 0.42. It is lower when compared to the 0 (no) class. above results is after applying SMOTE. As a summary we can see, our model has 0.68 macro average precision and 0.89 weighted average precision.

--

--

Sachchithananthan Thanusan
Nerd For Tech

Final year Undergraduate, Faculty of Information Technology, University of Moratuwa.