 # Logistic Regression

You may refer this literature for mathematical explanation of below implemented algorithm 1) http://web.pdx.edu/~newsomj/da2/ho_logistic.pdf

All codes discussed here can be found at my Github repository

For effective learning I suggest, you to calmly go through the explanation given below, run the same code from Github and then read mathematical explanation from above given links.

Code compatibility : Python 2.7 Only

To get this code running execute main.py file as given in GitHub repository.

Logistic Regression is very similar to the procedure describe in tutorial of Ordinary Least Square Regression (OLSR). in Logistic regression we only change from linear activation function $γ = β_0 + m_1X_1$ to Logistic function $(1/1+e-(β_0 + β_1X_1)))$. We will discuss, what is logistic function in detail.

So far we have seen two variable in data x any y. where x independent variable and value of y (dependent) depends on x. But real life data often remain multivariate. There can be many independent variable $X:[x1, x2, x3]$ and one dependent variable $Y$.

For single variable problem we can represent linear and logistic regression as follow : Figure 1. Difference between Linear regression (OLSR) and logistic Regression.>

To illustrate what is linear regression and logistic regression, I have applied two equation (linear and logistic) on the same set of the data (X1 and X2)and coefficients (Bo,B1 and B2) in Microsoft excel sheet and the resulting graph is as shown below. Figure 2. Practically difference between linear and logistic regression.">

Both the function I have plotted here in below given plots you can clearly see, there is linear increase in linear function and exponential with exponential function. with exponential function we can have chance to perform better on nonlinear data. Figure 3(A) : Linear regression curve"> ​​

Figure 3(B) : Logistic regression curve">

In blog I will use logistic regression on "Pima Indians Diabetes Database". 1. Number of times pregnant 2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test 3. Diastolic blood pressure (mm Hg) 4. Triceps skin fold thickness (mm) 5. 2-Hour serum insulin (mu U/ml) 6. Body mass index (weight in kg/(height in m)^2) 7. Diabetes pedigree function 8. Age (years) 9. Class variable (0 or 1) In present case based on variable 1 to 8 it is decided that what is the chance of a person being diabetic (class 1) or not (class 0). I have coded this tutorial in modular way, main algorithm function are kept in logitImplementation.py and other supporting function like reading file, minmax calculator and others kept in in dataUtils.py file. # Step 1. Read the file ion csv and keep it in form of array of floats. we have two function in dataUtils.py 1) loadFromcsv - to load data fro csv file. def loadFromcsv(self,fileName): python """ load a file and conver to 2d python list and return it :param fileName: csv file name with absolute path Example file - pima-indians-diabetes.data Test the script using following code loadDataInstance = loadData() print loadDataInstance.loadFromcsv('pima-indians-diabetes.data') e.g. https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data :return: 2D arrat [list of [list]] e.g. [['6', '148', '72', '35', '0', '33.6', '0.627', '50', '1'], ['1', '85', '66',...]..[]...] """ try: data = list(csv.reader(open(fileName))) return data except: print (traceback.print_exc())  2) convertDataToFloat - convert to data to float python def convertDataToFloat(self,dataset): """ loadFromcsv function returns data as list of list of strings, It must be converted to floats for further processing code can be tested through below given snippet loadDataInstance = loadData() dataset = loadDataInstance.loadFromcsv('pima-indians-diabetes.data') print loadDataInstance.convertDataToFloat(dataset) :param dataset: :return: dataset in floats """ for row in dataset: for i in range(len(row)): row[i] = float(row[i]) return dataset  # Step 2. min-max normalization Min-max makes attribute data is scaled to fit into a specific range. As we can see in the 'pima-indians-diabetes.data' there are 8 variable in different range. For effective learning it is very much required to convert all these variables in one range 0-1. This can be done using min max normalization. To know about min max normalization please refer this tutorial. We have functions in dataUtils.py to accomplish this task. 1) minMaxCalculator - will calculate min and max value for each of the eight variables. def minMaxCalculator(self,dataset): python """ :param dataset: data set is expected to be 2D matrix :return: minMax ==> list of list, e.g. [[12, 435], [13, 545], [5, 13424], [34, 454], [5, 2343], [4, 343]] To run this individual function for testing use below given code minMaxNormalizationInstance = minMaxNormalization() minMaxNormalizationInstance.minMaxCalculator([[12,13,13424,34,2343,343],[435,545,5,454,5,4],[43,56,67,87,89,8]]) """ minMax = list() for columnNo in range(len(dataset)): """ len(dataset) is number of elements present in the row e.g. columns iterating by column e.g. this is the column then, [[12,13,13424,34,2343,343],[435,545,5,454,5,4],[43,56,67,87,89,8]] """ columnvalues = [] for row in dataset: """ going to each row for particular column """ columnvalues.append(row[columnNo]) """ e.g. columnvalues [12, 435, 43] at the end """ minimumInColumn = min(columnvalues) maximumInColumn = max(columnvalues) """ this will be the min and max value for first column 12 435 """ minMax.append([minimumInColumn,maximumInColumn]) """ This will be in minMax list at the end of all iteration on all column where each sublist represent min and max value for that column respectively [[12, 435], [13, 545], [5, 13424], [34, 454], [5, 2343], [4, 343]] """ return minMax  For our case this function would return [[0.0, 17.0], [0.0, 199.0], [0.0, 122.0], [0.0, 99.0], [0.0, 846.0], [0.0, 67.1], [0.078, 2.42], [21.0, 81.0], [0.0, 1.0]], where each sub-array e.g. [0.0, 17.0] represents min and max value for given attribute column. 2) normalizeDatasetUsingMinmax - wil use above information about min and max value in each column and by usibg below given formula it will calculate resize all data between 0 and 1
$$z_i = \frac{x_i - min(x)}{max(x) - min(x)}$$ Where, $X_i$ is any data point in column $x$. $min (x)$ is minimum value in column $x$ as calculated above $max (x)$ is maximum value in column $x$ as calculated above
python def normalizeDatasetUsingMinmax(self,dataset,minMax): """ Actual implementation of min max normalization where it accepts data set and minmax value for each column y= (x-min)/(max-min) :param dataset: dataset in 2d array :param minMax: [[12, 435], [13, 545], [5, 13424], [34, 454], [5, 2343], [4, 343]] :return: will return min max value e.g. [[0.0, 0.0, 1.0, 0.0, 1.0, 1.0], [1.0, 1.0, 0.0, 1.0, 0.0, 0.0], [0.07328605200945626, 0.08082706766917293, 0.004620314479469409, 0.1261904761904762, 0.03592814371257485, 0.011799410029498525]] This snippet of code can be tested using following code minMaxNormalizationInstance = minMaxNormalization() dataset = [[12, 13, 13424, 34, 2343, 343], [435, 545, 5, 454, 5, 4], [43, 56, 67, 87, 89, 8]] minmax = minMaxNormalizationInstance.minMaxCalculator(dataset) print minMaxNormalizationInstance.normalizeDatasetUsingMinmax(dataset, minmax) """ for row in dataset: for eachColumnNo in range(len(row)): """ where minMax[eachColumnNo] = max for the given column minMax[eachColumnNo] = min for the given column """ row[eachColumnNo] = float((row[eachColumnNo]-minMax[eachColumnNo]))/float((minMax[eachColumnNo]-minMax[eachColumnNo])) # re-assigning minimized value to array named row return dataset  After application of normalizeDatasetUsingMinmax, each row in data-set would look like this (between 0 and 1) : [0.11764705882352941, 0.37185929648241206, 0.0, 0.0, 0.0, 0.0, 0.010247651579846282, 0.016666666666666666, 0.0] # Step 3. As we have data-set ready, let split it into train and test. 1) basicSplitter - I have designed a basic data splitter, that would devide data into 70%(train) and 30%(test) We have function in dataUtils.py to accomplish this task. python def basicSplitter(self,dataset): """ Just take the dataset and split it in to 70% and 30% ration :param dataset: use following code to run this code snippet loadDataInstance = loadData() dataset = loadDataInstance.loadFromcsv('pima-indians-diabetes.data') dataset = loadDataInstance.convertDataToFloat(dataset) splitToTrainTestInstance = splitToTrainTest() :return: test and train 2d array """ trainDataSize = int(len(dataset)*0.7) testDataSize = int(len(dataset) - len(dataset)*0.7) print ("Train data size : ",trainDataSize," | Test data size : ",testDataSize) train = dataset[:int(len(dataset)*0.7)] test = dataset[int(len(dataset) * 0.7):] return train, test  # Step 4. Actual training Training start by calling these below function repeatedly and updating coefficient as discussed in stochastic gradient tutorial. We have function in logitImplementation.py to accomplish this task. 1) Predict - used to predict on data using regression algorithm as discussed. It takes two inputs. - Data - only X part - Previously learned coefficients and slops and predicts on given data. 2) stochastic_gradient - used for to minimize errors and takes three 3 inputs - Data - whole data X and Y - learning rate (η) - Its is a value greater than 0 and lesser than or equal to 1 [ 0< η >=1] - Epochs - Number of time the same data to be given to the machine learning algorithm so that it can learn. Working of these two function can be nicely explained by below given figure Figure. 2 Working of prediction and stochastic gradient function

python def predict(self, Xrow,coefficients): """ for prediction based on given row and coefficients :param Xrow: [3.55565,4.65656,5.454654,1] where last element in a row remains Y [so called actual value y-actual] :param coefficients: [0.155,-0.2555,0.5456] previously updated coefficients by stochastic_gradient used for prediction :return: Ypredicted coefficient can be actually compared with memory from learning and be applied for further predictions """ Ypredicted = coefficients for i in range(len(Xrow)-1): Ypredicted += Xrow[i]*coefficients[i+1] return 1.0/(1.0+exp(-Ypredicted))  Here it is important to note that we have used logistic function - $(1/1+e-(β_0 + β_1X_1))$. Ypredicted = coefficients represent addition of β0 in the equation. Ypredicted += Xrow[i]*coefficients[i+1] represent addition of $β_1X_1$ to equation. 1.0/(1.0+exp(-Ypredicted)) represent overall equation - $(1/1+e-(β_0 + β_1X1))$. python def stochastic_gradient(self, trainDataset, learningRate, numberOfEpoches): """ :param trainDataset: :param learningRate: :param numberOfEpoches: :return: updated coefficient array """ """ For each column in train dataset we will be having one coefficient if training dataset having 5 column per array than coefficient array will be something like this [0.0, 0.0, 0.0, 0.0, 0.0] """ coefficient = [0.1 for i in range(len(trainDataset))] for epoch in range(numberOfEpoches): """ for each epoch repeat this operations """ squaredError = 0 for row in trainDataset: """ for each row calculate following things where each row will be like this [3.55565,4.65656,5.454654,1] ==> where last element in a row remains Y [so called actual value y-actual] """ Ypredicted = self.predict(row,coefficient) # sending row and coefficient for prediction error = row[-1] - Ypredicted #row[-1] is last elemment of row, can be considered as Yactual; Yactual - Ypredicted gives error """Updating squared error for each iteration"""" squaredError += error**2 """ In order to make learning, we should learn from our error here we will use stochastic gradient as a optimization function Stochastic gradient for each coefficient [b0,b1,b1,.....] can be formalized as coef[i+1] = coef[i+1] + learningRate * error * Ypredicted(1.0 - Ypredicted)* X[i] For a row containing elements [x1, x2, x3, x4, x5], coefficient [bo, b1, b2, b3, b4, b5] where each coefficient belongs to each element in a row e.g. b1 for X1, b2 for x2 and so on.. As coefficient[i] here is equal to bo, e.g. row element independent, we will update it separately. """ coefficient = coefficient+learningRate*error*Ypredicted*(1+Ypredicted) for i in range (len(row)-1): coefficient[i+1] = coefficient[i+1] + learningRate * error * Ypredicted*(1.0 - Ypredicted)* row[i] """ lets print everything as to know whether or not the error is really decreasing or not """ print (">>> Epoch : ", epoch ," | Error : ", squaredError) return coefficient  # Step 5. Testing We have learned things from training step and stored our learning in form of coefficients. We had eight attributes in our data-set so by equation $γ = β_0 + β_1X_1 + β_2X_2 + .. + β_8X_8$ we should have Nine coefficient [ β0 and m1 to m8] as shown below. Coefficient = [-0.40134710602161927, 0.28245197540248934, 0.7001137271400164, -0.1876675370110126, 0.0022099414467335937, 0.11226076950136779, 0.36010329497830973, 0.25824794282244273, 0.1412229577339494] For the test dataset we will use the same equation, here γ is unknown to us so by applying same equation - $(1/1+e-(β_0 + β_1X_1 + β_2X_2 + .. + β_8X_8))$. we will find out y . We have function in logitImplementation.py to accomplish this task. py def predictOnTest(self,testDataset, coefficient): """ predicting on leftover dataset :param testDataset: test dataset Array of array :param coefficient: Array of coefficient :return: actualvalues [Array],predictedlist [Array] """ actual = [] # stores actual value of Y predictedlist = [] # stores predicted value of Y for row in testDataset: actual.append(row[-1]) predicted = coefficient for i in range(len(row)-1): predicted +=row[i]*coefficient[i+1] predictedlist.append(predicted) print ("predicted : ",predicted, " | Actual : ",row[-1] ) return actual,predictedlist  # Step 6. Getting performance matrix If you look into predictedlist returned by predictOnTest function you would find float numbers like this : [0.004066216633807906, 0.23066261620267448, 0.30185797638280926, 0.3264935294499418, 0.25377601414111306, 0.30344480864150913, 0.060468595846753875, 0.037823825442416865, 0.5389956781340269, 0.665507821872288, ,,,]. These are not our classes. To get classes (diabetic /Non-diabetic ) we need to decide certain threshold, if predicted value is above it it will be 1 else 0. In ths way will get classes. Same thing as discussed I have implemented with below given createConfusionMatrix function. If you are unaware about performace measure in binary classification, read Performance Measures part of my earlier blog post. We have functions in dataUtils.py to accomplish this task. python def createConfusionMatrix(self,actual, predicted, threshold): """ will create confusion matrix for given set of actual and predicted array :param actual: Array of Actual sample :param predicted: Array of predicted sample :param threshold: Any number between 0-1 :return: """ fp = 0 fn = 0 tp = 0 tn = 0 for i in range(len(predicted)): if predicted[i] > threshold: predicted[i] = 1 else: predicted[i] = 0 for no in range(0, len(predicted)): if predicted[no] == 1 and actual[no] == 1: tp += 1 elif predicted[no] == 0 and actual[no] == 0: tn += 1 elif predicted[no] == 1 and actual[no] == 0: fn += 1 elif predicted[no] == 0 and actual[no] == 1: fp += 1 ACC = float((tp + tn))/ float((fp + tp + tn + fn)) F1 = float(2*tp)/ float(2*tp + fp + fn) print "False Positive : ",fp,", False Negative : ", fn,", True Positive : ", tp,", True Negative : ", tn,", Accuracy : ",ACC ,", F1 Score : ",F1  Here for this tutorial I have fixed the threshold to 0.30, and I got following performance measures : False Positive : 26 , False Negative : 23 , True Positive : 53 , True Negative : 129 , Accuracy : 0.787878787879 , F1 Score : 0.683870967742 Overall process of Logistic Regression can be summarized as given in below flowchart: Figure 3. Logistic Regression summarized.