Regression is the process of estimating relationship between dependent variable and one or more independent variables. Regression makes us to understand effect independent variables on dependent variable.
[All data / code used for this tutorial is available at my Github page
To understand Regression, lets start with one example : Below given data showing relation between Mortgage interest rates and median home prices. “Mortgage is a legal agreement by which a bank, building society, etc. lends money at interest in exchange for taking title of the debtor’s property, with the condition that the conveyance of title becomes void upon the payment of the debt.”
Here we have Home price ($X$) is an independent variable and Mortgage Interest Rate ($Y$) is a dependent variable on $X$. Our goal is to predict value of $Y$ for given value of $X$ by modelling an equation : $y = mx + b$ .
Figure 1. Overview of data-set.
Below given graph shows negative relation between Mortgage interest rates and median home prices. If we can somehow model this situation then we can predict what would be Mortgage interest rate for home of given price.
Figure 2. Regression Line (in orange) passing through entire data (in blue)
To model this situation, the simplest approach would be to make a straight line through the entire data-set (As shown by Orange line). This line is known as trend-line. Any point on trend-line, what would be the Mortgage interest rate for given house price.
The trend-line is shown by an equation - $y = mx + b$ Where $m = slop$ ; y = dependent variable(Mortgage interest rate) ; x = independent variable(house price) ; b = regression coefficient
We use two parameter to represent trend-line: Regression coefficient (b) Slop (m) Now we will see, how to calculate trend-line: Below given are the equations to calculate trend-line :
I have calculated the same in Microsoft Excel to facilitate understanding. All equations used in excel sheet are as shown above. We are getting the same result here too.
Figure 3. A way to calculate regression in Excel spread sheet.
Similarly we can calculate the same in python using following code : (try it yourself - you will get the same result)
import math def findTrendline(xArray,yArray): """ used to find trend line Need certain changes in input :param XY: :return: """ print xArray print yArray # calculating average for X xAvg = float(sum(xArray)) / len(xArray) # calculating average for Y yAvg = float(sum(yArray)) / len(yArray) upperPart = 0.0 # initializing numerator of the slop equation lowerPart = 0.0 # initializing denominator of the slop equation m = 0.0 # initializing slop for i in range(0, len(xArray)): #calculating numerator upperPart += (xArray[i] - xAvg) * (yArray[i] - yAvg) #calculating denominator lowerPart += math.pow(xArray[i] - xAvg,2) # calculating slop m = upperPart / lowerPart # calculating regression coefficient b = yAvg - m * xAvg return m, b # Example x =[183800,183200,174900,173500,172900,173200,173200,169700,174500,177900,188100,203200,230200,258200,309800,329800] y = [10.30,10.30,10.10,9.30,8.40,7.30,8.40,7.90,7.60,7.60,6.90,7.40,8.10,7.00,6.50,5.80] print findTrendline(x,y)
- Vary simple techniques
- Good for quick and dirty estimation
- Provide good estimation for linear data
- Cannot model non linear data; 99.99 % of data are non linear in nature.