Machine Learning - Analysis of Weather in Szeged 2006–2016 To Predict the Apparent Temperature

Sachchithananthan Thanusan
Analytics Vidhya
Published in
20 min readJun 26, 2021

--

By Linear Regression

Use case: Is there a relationship between humidity and temperature? What about between humidity and apparent temperature? Can you predict the apparent temperature given the humidity?

Data Set: View Data Set

To do practice you can download the data set from above link.

First we have to import some libraries. Because most of python libraries are having a set of useful functions which can simply eliminate the need for writing codes from scratch. So that I have imported below libraries.

import numpy as np
import pandas as pd
import scipy.stats as stats
from matplotlib import pyplot as plt
%matplotlib inline

from sklearn.preprocessing import StandardScaler

Pandas is mainly used for data analysis with various file formats such as comma-separated values, JSON, SQL, Microsoft Excel. Moreover it can do some data manipulation operations such as merging, reshaping, selecting, cleaning, etc. Further libraries i will explain in wherever required.

I have added below codes for mount the google drive account to google colaboratory to access the files which is available on drive. When we execute below code we have to go to the generated URL and get authorization code to enter here.

from google.colab import drive
drive.mount("/content/gdrive")

After enter the authorization code. you will get a message as ‘Mounted at /content/drive’. There after by below code you can access the files which is available on your connected google drive.

data = pd.read_csv('/content/gdrive/My Drive/Colab Notebooks/data/weatherHistory.csv')

Now u have successfully loaded your data in to the variable called ‘data’. for the modification purposes I have copied that data in to another variable called ‘X’

X = data.copy() #dataset has been copied to  X

By following below code you can see the top 10 stored data rows in that variable. Pandas library provide a method called head() is widely used to return top n rows of a data frame or series. that method by default return top 5 rows of stored data set.

X.head(10)

Step 01 Data Pre-Processing

Data pre-processing is a main step in Machine Learning as the useful information which can be derived it from data set directly affects the model quality so it is extremely important to do atleast necessary preprocess for our data before feeding it into our model.

Step 01 A -> Data Cleansing- Identifying and handling the Missing Values, Duplicates Records

By following below codes we can identify the Missing Values in the data each columns

X.isnull().any()

In above output we can see that Precip Type is True. balance are false. So we can come to conclusion that Precip Type have some missing values and balance features doesn’t have missing values. So we have to handle those missing values. For that we have to check the percentage of the missing values in given values.

X['Precip Type'].value_counts() # To get the count of each available types
(X['Precip Type'].isnull().sum()/len(X.index) )*100 # To check the null values percentage in available data set

From above output percentage value we can say that percentage of null values over the available data is very less. So we can drop those null values rows and get balance data for further preprocessing activities. By following below code we can drop the null values

X = X.dropna() #drop all nan values in given data set
X.isnull().sum() # For the verification.

In above output all are zero values so now we can come to conclusion that we don’t have any nan values in data set.

By following below codes we can check for the duplicate row in data set. Below output you can see that True for 24 values and False for 95912 values. So we can come to conclusion that this data set have 24 duplicated values.

print(X.duplicated().value_counts()) # To check duplicated values
print(X[X.duplicated()]) # To check view the duplicated values
X=X.drop_duplicates() # To drop the duplicate values
print(X.duplicated().value_counts()) # To check duplicated values

From above output we can come to conclusion that there were no duplicated rows in data set

Step 01 B -> Data Cleansing — Identifying and handling the Outliers

An outlier is called as an object which deviates significantly from the other objects. It can be caused by different types of errors. So that we should have to do the analysis of outlier in given data set to get the model with good quality.

By following below code we can get the continuous column names,

plt.rcParams["figure.figsize"] = (22, 3)
X._get_numeric_data().columns.tolist()

Output of above code is [‘Temperature ©’,’Apparent Temperature ©’,’Humidity’,’Wind Speed (km/h)’,’Wind Bearing (degrees)’,’Visibility (km)’,’Loud Cover’,’Pressure (millibars)’] so we can do a analysis on those columns.

Box plot and Q-Q Plot of Temperature © column

By Following below code we can make the boxplot of given columns to check the outliers,

temp_df = pd.DataFrame(X, columns=['Temperature (C)'])
temp_df.boxplot(vert=False)

By Following below code we can plot the probability plot to given columns for check the patterns of distribution, values range and etc.,

plt.rcParams["figure.figsize"] = (10, 6)
stats.probplot(X["Temperature (C)"], dist="norm", plot=plt)
plt.show()

Box plot and Q-Q Plot of Apparent Temperature © column

By Following below code we can make the boxplot of given columns to check the outliers,

plt.rcParams["figure.figsize"] = (22, 3)
temp_df = pd.DataFrame(X, columns=['Apparent Temperature (C)'])
temp_df.boxplot(vert=False)

By Following below code we can plot the probability plot to given columns for check the patterns of distribution, values range and etc.,

plt.rcParams["figure.figsize"] = (10, 6)
stats.probplot(X['Apparent Temperature (C)'], dist="norm", plot=plt)
plt.show()

Box plot and Q-Q Plot of Wind Speed (km/h) column

By Following below code we can make the boxplot of given columns to check the outliers,

plt.rcParams["figure.figsize"] = (22, 3)
temp_df = pd.DataFrame(X, columns=['Wind Speed (km/h)'])
temp_df.boxplot(vert=False)

By Following below code we can plot the probability plot to given columns for check the patterns of distribution, values range and etc.,

plt.rcParams["figure.figsize"] = (10, 6)
stats.probplot(X['Wind Speed (km/h)'], dist="norm", plot=plt)
plt.show()

Box plot and Q-Q Plot of Wind Bearing (degrees) column

By Following below code we can make the boxplot of given columns to check the outliers,

plt.rcParams["figure.figsize"] = (22, 3)
temp_df = pd.DataFrame(X, columns=['Wind Bearing (degrees)'])
temp_df.boxplot(vert=False)

By Following below code we can plot the probability plot to given columns for check the patterns of distribution, values range and etc.,

plt.rcParams["figure.figsize"] = (10, 6)
stats.probplot(X['Wind Bearing (degrees)'], dist="norm", plot=plt)
plt.show()

Box plot and Q-Q Plot of Visibility (km) column

By Following below code we can make the boxplot of given columns to check the outliers,

plt.rcParams["figure.figsize"] = (22, 3)
temp_df = pd.DataFrame(X, columns=['Visibility (km)'])
temp_df.boxplot(vert=False)

By Following below code we can plot the probability plot to given columns for check the patterns of distribution, values range and etc.,

plt.rcParams["figure.figsize"] = (10, 6)
stats.probplot(X['Visibility (km)'], dist="norm", plot=plt)
plt.show()

Box plot and Q-Q Plot of Loud Cover column

By Following below code we can make the boxplot of given columns to check the outliers,

plt.rcParams["figure.figsize"] = (22, 3)
temp_df = pd.DataFrame(X, columns=['Loud Cover'])
temp_df.boxplot(vert=False)

By Following below code we can plot the probability plot to given columns for check the patterns of distribution, values range and etc.,

plt.rcParams["figure.figsize"] = (10, 6)
stats.probplot(X['Loud Cover'], dist="norm", plot=plt)
plt.show()

We can look the value counts of ‘Loud Cover ‘ by following comment,

X['Loud Cover'].value_counts()

Output is 0.0 95912

From above Values and box plot of ‘Loud Cover’ we can come to conclusion that no use of keeping that column. Because all values are same and zero. So we can drop that column from our data set by following below codes,

X = X.drop('Loud Cover', axis = 1)

Box plot and Q-Q Plot of Pressure (millibars) column

By Following below code we can make the boxplot of given columns to check the outliers,

plt.rcParams["figure.figsize"] = (22, 3)
temp_df = pd.DataFrame(X, columns=['Pressure (millibars)'])
temp_df.boxplot(vert=False)

By Following below code we can plot the probability plot to given columns for check the patterns of distribution, values range and etc.,

plt.rcParams["figure.figsize"] = (10, 6)
stats.probplot(X['Pressure (millibars)'], dist="norm", plot=plt)
plt.show()

From above Box plot and Below counts we can see some anomalies.

X['Pressure (millibars)'].value_counts()

By following below code we can get the descriptions of given column such as mean, count, std, min etc.

X['Pressure (millibars)'].describe()

By following below code we can plot the points on graph to check the pattern of value changes,

plt.rcParams["figure.figsize"] = (500, 8)
plt.plot(X['Pressure (millibars)'].tolist(), label="Pressure")
plt.show()

In above graph we can see that there are considerable amounts of pressure drops which is equal to zero. It may happened due to the measurement machine errors. the widely accepted boundary where space begins, which would also be the point where the air pressure is assumed to be zero, is called the Kármán line, which is located 100 km (62 mi) up. so there is no possibility to have zero values or less that zero values in that column.

By following below code we can calculate the percentage of the zero or less than zero values in that given column.

len(X[(X['Pressure (millibars)']<=0.0) ])* 100/len(X)

1.342897656184836

By following the below codes we can reset the index of data frames to avoid problems in future steps.

X=X.reset_index(drop=True)

By following below code we can assign np.nan for zero values in pressure column to assign suitable values,

X.loc[X.index[X['Pressure (millibars)']<=0.0].tolist(), ['Pressure (millibars)']] =np.nan

We can check whether we have assigned values correctly by following code,

X['Pressure (millibars)'].isnull().sum() # Verification

Output is 1288 .

By following below code we can assign suitable value to np.nan values in pressure column,

from sklearn.impute import SimpleImputer
imputer = SimpleImputer()
imputer.fit(X[['Pressure (millibars)']])
X['Pressure (millibars)']=imputer.fit_transform(X[['Pressure (millibars)']])

We can check whether we have assigned values by SimpleImputer correctly by following code,

X['Pressure (millibars)'].isnull().sum() # Verificaion

0

After use of Simple Imputer, Visualization on graph,

plt.plot(X['Pressure (millibars)'].tolist(), label="Pressure")
plt.show()

Probability Plot after using Simple Imputer,

plt.rcParams["figure.figsize"] = (10, 6)
stats.probplot(X['Pressure (millibars)'], dist="norm", plot=plt)
plt.show()

Box plot and Q-Q Plot of Humidity column

By Following below code we can make the boxplot of given columns to check the outliers,

plt.rcParams["figure.figsize"] = (22, 3)
temp_df = pd.DataFrame(X, columns=['Humidity'])
temp_df.boxplot(vert=False)

By Following below code we can plot the probability plot to given columns for check the patterns of distribution, values range and etc.,

plt.rcParams["figure.figsize"] = (10, 6)
stats.probplot(X['Humidity'], dist="norm", plot=plt)
plt.show()

a

From above plot we can see that some values are having 0. But practically is not possible to have zero values or less than that values on normal environment.

plt.rcParams["figure.figsize"] = (500, 8)
plt.plot(X['Humidity'].tolist(), label="Humidity")
plt.show()

By using above code we can plot the graph to see the values changes pattern.

We can use the following code to check the percentage of zero values over the data set.

len(X[(X['Humidity']<=0.0) ])* 100/len(X)

0.022937692885144717

By following below code we can get the descriptions of given column such as mean, count, std, min etc.

X['Humidity'].describe()

From above evidence we can say that humidity have very less outliers. So by following below code we can drop those value rows in our data set.

X.drop(X[X['Humidity'] == 0].index, inplace = True)

If we have reset the index after each drop of row values by following code the we can avoid some errors in future when we join columns or split data sets.

X=X.reset_index(drop=True)

Step 01 C -> Data Coding

ML models are require all input and output values should to be numerical. So if your dataset have categorical data, you must have to encode it into the numbers before fit and evaluate a model. There are several methods available such as One-hot Encoding, Integer (Label) Encoding to do the task. Here i have used Integer (Label) Encoding because One-hot Encoding does not handle new categories in the test set automatically.

By using below codes we can categorize the ‘Precip Type’ , ‘Summary’.

from sklearn.preprocessing import LabelEncoder
labelencoder =LabelEncoder()
X['Precip_Types_cat']=labelencoder.fit_transform(X['Precip Type'])
X = X.drop('Precip Type', axis = 1)
X['Summary_cat']=labelencoder.fit_transform(X['Summary'])
X = X.drop('Summary', axis = 1)

After data coding i have dropped the original columns. that code also available on above snippet.

Step 01 C -> Feature Discretization

The feature discretization is refers to the process of partition or convert the continuous values of features or variables to discretized or nominal features or intervals. Here i have used KBinsDiscretizer from sklearn.preprocessing to do the relevant task.

You can follow the below code to do the feature discretization for ‘Wind Bearing (degrees)’ Because ‘Wind Bearing (degrees)’ have values from 0 to 359 degrees.

Usually Wind directions are considered on wind direction according to the above picture. so that i have choosed n_bins = 16 to do the further work

from sklearn.preprocessing import KBinsDiscretizer
discretizer = KBinsDiscretizer(n_bins=16, encode='ordinal', strategy='uniform')
discretizer.fit(X[['Wind Bearing (degrees)']])
X['Wind Bearing (degrees)'] = discretizer.transform(X[['Wind Bearing (degrees)']])

Here i have done a feature coding for the bins,

X['Wind_Bearing_cat']=labelencoder.fit_transform(X['Wind Bearing (degrees)'])
X = X.drop('Wind Bearing (degrees)', axis = 1)
X.head() # For the verification

You can see the effect of feature discretization on the last column of below picture.

by following code you can check for the null values to continue the work.

X.isnull().any()

Step 01 D -> Data Leakage handling

To handle the data leakage, we can separate the train set data and test set data before do the transformations. We have to only compute the mean and variance for normalization or some other transformations related to the statistics only on training data and use these values to then transform the training data itself, and then the same values to transform the test data. Including the test dataset in the transform computation will allow information to flow from the test data to the train data and therefore to the model that learns from it, thus allowing the model to cheat (introducing a bias).

By following above code we can split data set into two parts. test_size=0.2 so we can get the 20 % of the data rows for the testing purpose and 80 % for the training purpose with random_state=42. Random_state can be 0 or 1 or any other integer. but It must be the same value if you have to validate your codes or the file over multiple exection again and again of the code.

from sklearn.model_selection import train_test_splittrain_X, test_X = train_test_split(X, test_size=0.2,random_state=42)

If we have reset the index after each drop of row values by following code the we can avoid some errors in future when we join columns or split data sets.

train_X=train_X.reset_index(drop=True) # Reset the dataframe Index
len(train_X) # look the length for verification

76712, This is the length of train_X

test_X.head() # look  for verification

Step 02 Data Transformations

Mostly we prefer if data comes from Normal Distribution to train the model. Because usually features in real datasets will follow more a skewed distribution. By applying different type of transformations according to the skewness to these variables, we can map skewed distribution into the normal distribution.

By following below code we can generate a histogram for continuous values columns in our data set. from that histogram we can see the Distribution of data in feature.

plt.rcParams["figure.figsize"] = (24, 12)train_X[['Temperature (C)','Apparent Temperature (C)','Humidity','Wind Speed (km/h)','Visibility (km)','Pressure (millibars)']].hist()

You can see the output of above code in the below image,

Data Transformation of Humidity

By following below code we can create a Single histogram for give data column,

plt.rcParams["figure.figsize"] = (24, 6)train_X['Humidity'].hist()

From above Histogram of Humidity we can see the left-skewed distributions. So we can use Exponential or power transformation to reduce left-skewed distributions

from sklearn.preprocessing import FunctionTransformerexponential_transformer = FunctionTransformer(np.exp, validate=True)
exponential_transformer.fit(train_X[['Humidity']])
train_X['Humidity'] = exponential_transformer.transform(train_X[['Humidity']])
test_X['Humidity'] = exponential_transformer.transform(test_X[['Humidity']])

From above transformation code we have to only fit the train_X[[‘Humidity’]] to avoid the data leak.

By below code, We can look the histogram after the exponential transformation,

train_X['Humidity'].hist()

Data Transformation of Wind Speed (km/h)

By following below code we can create a Single histogram for give data column,

train_X['Wind Speed (km/h)'].hist()

From above Histogram of Wind Speed (km/h) we can see the right-skewed distributions. So we can use Square root transformation to reducing right-skewed distributions

Square_root_transformer = FunctionTransformer(np.sqrt,validate=True)
Square_root_transformer.fit(train_X[['Wind Speed (km/h)']])
train_X['Wind Speed (km/h)'] = Square_root_transformer.transform(train_X[['Wind Speed (km/h)']])
test_X['Wind Speed (km/h)'] = Square_root_transformer.transform(test_X[['Wind Speed (km/h)']])

From above transformation code we have to only fit the train_X[[‘Wind Speed (km/h)’]] to avoid the data leak.

By below code, We can look the histogram after the Square root transformation,

train_X['Wind Speed (km/h)'].hist()

Data Transformation of Visibility (km)

By following below code we can create a Single histogram for give data column,

train_X['Visibility (km)'].hist()

From above Histogram of Visibility (km) we can see the left-skewed distributions. So we can use Exponential or power transformation to reduce left-skewed distributions

exponential_transformer = FunctionTransformer(np.exp,validate=True)
exponential_transformer.fit(train_X[['Visibility (km)']])
train_X['Visibility (km)'] = exponential_transformer.transform(train_X[['Visibility (km)']])
test_X['Visibility (km)'] = exponential_transformer.transform(test_X[['Visibility (km)']])

From above transformation code we have to only fit the train_X[[‘Visibility (km)’]] to avoid the data leak.

By below code, We can look the histogram after the Exponential Transformation,

train_X['Visibility (km)'].hist()

By following the code we can drop the ‘Daily Summary’ column,

train_X = train_X.drop('Daily Summary', axis = 1) 
test_X = test_X.drop('Daily Summary', axis = 1)# Dependent Feature so we can remove that

Step 03 Data Standadization

Data standardization is the way of the rescaling one or multiple features so that they can have a mean value of 0 and a standard deviation of 1. Standardization assumes that your data has a Gaussian (bell curve) distribution. but it is not strictly have to be true, but it is the technique which is considered as more effective if your feature values distribution is belongs Gaussian.

We have to remove the Categorical Features before the standardization. So by following below codes you can remove that,

Remove_columns_values = ['Formatted Date','Precip_Types_cat','Summary_cat','Apparent Temperature (C)','Wind_Bearing_cat']train_X_without_Cat=train_X.drop(Remove_columns_values, axis = 1)
test_X_without_Cat=test_X.drop(Remove_columns_values, axis = 1)
train_X_without_Cat.head(10) # For the verification

We can see the Output after removal of the Categorical Features,

For the the verification of index to avoid future errors,

train_X.head()

By below codes we can Apply Standardization by Calling the standard scaler,

scaler = StandardScaler() 
train_X_Except =train_X_without_Cat
test_X_Except =test_X_without_Cat
scaler.fit(train_X_Except)
train_X_Scaled = scaler.transform(train_X_Except)
test_X_Scaled = scaler.transform(test_X_Except)

In above code we have to only fit train_X_Except into scaler.fit() to avoid the data leakage

By below codes we can fit the Standardized data’s into the data frame,

columns_value_new=train_X_without_Cat.columns
train_X_Scaled_Except = pd.DataFrame(train_X_Scaled, columns=columns_value_new)
train_X_Scaled_Except.head(10) # data set after Standardization

Output for the verification,

Same for the test data set.

columns_value_new=test_X_without_Cat.columns
test_X_Scaled_Except = pd.DataFrame(test_X_Scaled, columns=columns_value_new)
test_X_Scaled_Except.head(10) # data set after Standardization

Output for the verification,

View data set in histogram to show the scaling/ standardizing effect,

plt.rcParams["figure.figsize"] = (24, 12)
train_X_Scaled_Except.hist()

Step 04 Correlation Matrix And Principal Component Analysis ( PCA )

Identify significant and independent features

By following below codes we can generate the correlation matrix for given data set,

import seaborn as sns
plt.rcParams["figure.figsize"] = (24, 8)
sns.heatmap(train_X_Scaled_Except.corr(),annot=True); #Seems they can be assuemed as independent

Below picture is the output to represent the correlation of features

From above correlation matrix we can come to conclusion that, Humidity and Temperature have high correlation compare to other features but not very high. so we can keep that feature for train the model.

# By following those below and above , we can see the signifncance of the features compare to other features
train_X_Scaled_Except.corr()

From above table we can see the values of correlations between the each features.

Now we have to Join the Apparent Temperature © again to the above variable to look the correlation between Apparent Temperature and other features. By following the below code we can do that,

train_X_Scaled_With_Y = train_X_Scaled_Except.join(train_X['Apparent Temperature (C)'])train_X_Scaled_With_Y.isnull().sum()# Check for the null values
test_X_Scaled_With_Y = test_X_Scaled_Except.join(test_X['Apparent Temperature (C)'])
test_X_Scaled_With_Y.isnull().sum()
# Check for the null values

By following below code we can generate a correlation matrix to the all features,

sns.heatmap(train_X_Scaled_With_Y.corr(),annot=True); #Seems they can be assuemed as independent

From above correlation matrix and below table, we can come to conclusion that, Humidity and Temperature have high negative correlation compare to other features. as well as Apparent Temperature have very high positive correlation with Temperature same time negative correlation with Humidity.

#test the signifncance of the features
train_X_Scaled_With_Y.corr()

By following above code we can generate a table to look the correlation values in detail,

For the train data set We have to join back the categorical variables to go for PCA.

train_X_For_PCA =train_X_Scaled_Except.join(train_X['Precip_Types_cat'])
train_X_For_PCA = train_X_For_PCA.join(train_X['Summary_cat'])
train_X_For_PCA = train_X_For_PCA.join(train_X['Wind_Bearing_cat'])
#train_X_For_PCA.isnull().sum()

For the test data set We have to join back the categorical variables to go for PCA.

test_X_For_PCA =test_X_Scaled_Except.join(test_X['Precip_Types_cat'])test_X_For_PCA = test_X_For_PCA .join(test_X['Summary_cat'])test_X_For_PCA = test_X_For_PCA .join(test_X['Wind_Bearing_cat'])
#test_X_For_PCA.isnull().sum()

Principal Component Analysis — ( PCA )

Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
train_X_PCA_data =train_X_For_PCA
test_X_PCA_data =test_X_For_PCA
pca = PCA(n_components=7)
pca.fit(train_X_PCA_data)
train_X_pca = pca.fit_transform(train_X_PCA_data)
test_X_pca = pca.fit_transform(test_X_PCA_data)
train_X_principalDf = pd.DataFrame(data = train_X_pca)
test_X_principalDf = pd.DataFrame(data = test_X_pca)
train_X_principalDf.head(10)

Train data set X values after perform the Principal Component Analysis

Test data set X values after perform the Principal Component Analysis

test_X_principalDf.head(10)

We can see the shape of the Train data set X values after perform the Principal Component Analysis,

print(train_X_principalDf.shape)

(76712, 7)

By following code we can look the Principal Component Analysis explained variance ratio values,

pca.explained_variance_ratio_

array([0.49043775, 0.40130885, 0.03932747, 0.02414914, 0.01985586,0.01804037, 0.00558572])

The pca.explained_variance_ratio_ parameter is returns a vector of the variance explained by each dimension. here pca.explained_variance_ratio_[i] will give the variance which is explained by the i+1st dimension.

By following a below code you can generate the cumulative explained variance graph,

plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');

Step 05 Linear Regression

Linear Regression is a one of the supervised machine learning algorithm where the predicted output is continuous and it has a constant slope. It’s used to predict values within a continuous range, rather than trying to classify them into categories.

from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt

We have to create training and testing variables without having a data leakage so that i have split those before the transformation and scaling,

X_train = train_X_principalDf
y_train =train_X['Apparent Temperature (C)']
X_test =test_X_principalDf
y_test = test_X['Apparent Temperature (C)']
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

By following below code we can fit our data set into the model,

lm = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)
y_hat = lm.predict(X_test)

By following below code you can fit the actual values, Predicted values into the data frame,

test_df =pd.DataFrame({
'actual':y_test,
'prediction':y_hat,
'diff':(y_test-y_hat)})
test_df.head(10)

some times you can see from above data frame, the row indexes may not in the order so we can reset the index by following below code to avoid problems in future,

test_df=test_df.reset_index(drop=True)test_df.head(10) # For Verification After reset the index you can see that its start from zero

You can get the W parameters of the model by following below code,

print(lm.coef_)

The output is [-0.10932573 -0.36122572 6.42501305 -2.51270569 -0.80607153 -4.50456051 -6.51199797]. coef_ gives you an array of weights estimated by linear regression. If weights has huge values, we have to normalize data set features and use regularization for the model to reduce the weights values.

You can get the Intercept of the model by following below code,

print(lm.intercept_)

The output is 10.854332003684775, The constant term in linear regression analysis seems to be such a simple thing. Also known as the y intercept, it is simply the value at which the fitted line crosses the y-axis.

By following code we can plot the graph to show the Prediction Vs Actual,

plt.plot(test_df['prediction'][:500], label = "Pred")  # Load the 500 data points from prediction with label name 'Pred'
plt.plot(test_df['actual'][:500], label = "Actual") # Load the 500 data points from actual with label name 'Actual'
plt.xlabel('x - axis') # Set the x axis label of the current axis.
plt.ylabel('y - axis') # Set the y axis label of the current axis.
plt.title('Predicitons vs Actual') # Set a title of the current axes.
plt.legend() # show a legend on the plot
plt.show() # Display a figure.

Output plots is,

By following below code we can calculate the Root Mean Squared Error of trained model,

from math import sqrt
from sklearn.metrics import mean_squared_error
rmsq = sqrt(mean_squared_error(y_test, y_hat))
rmsq

The output is 1.2070026749031717

By following below code we can calculate the Mean Squared Error of trained model,

from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_hat)

The output is 1.4568554572234116

By following below code we can calculate the Percentage of explained variance of the predictions,

print(lm.score(X_test,y_test))

The output is 0.9873428171393527

By following below code we can calculate the Cross-Predicted Accuracy of the model,

# Necessary imports: 
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn import metrics
accuracy = metrics.r2_score(y_test, y_hat)
print("Cross-Predicted Accuracy:", accuracy)

Cross-Predicted Accuracy: 0.9873428171393528

By following below code you can create the distribution Plot , which we use this to look at the distribution of the actual vs predicted values for our model. A distribution plot can be made with sns.distplot(),

plt.rcParams["figure.figsize"] = (24, 8)
sns.distplot(y_test,hist=False,color ="r",label="Test")
sns.distplot(y_hat,hist=False,color ="b",label="HAT")

Thank You For Reading My Article — — — -| ) By Thanusan S.

--

--

Sachchithananthan Thanusan
Analytics Vidhya

Final year Undergraduate, Faculty of Information Technology, University of Moratuwa.