Regression shows a line or curve that passes through all the data points on the target-predictor graph in such a way that the vertical distance between the data points and the regression line is minimum
Linear Regression is a commonly used type of predictive analysis. Linear Regression is a statistical approach for modelling the relationship between a dependent variable and a given set of independent variables. It is predicted that a straight line can be used to approximate the relationship. The goal of linear regression is to identify the line that minimizes the discrepancies between the observed data points and the line’s anticipated values.
Let’s discuss Simple Linear regression using R Programming Language .
In Machine Learning Linear regression is one of the easiest and most popular Machine Learning algorithms.
A linear line showing the relationship between the dependent and independent variables is called a regression line. A regression line can show two types of relationship:
Below are some important assumptions of Linear Regression. These are some formal checks while building a Linear Regression model, which ensures to get the best possible result from the given dataset.
It is a statistical method that allows us to summarize and study relationships between two continuous (quantitative) variables. One variable denoted x is regarded as an independent variable and the other one denoted y is regarded as a dependent variable. It is assumed that the two variables are linearly related. Hence, we try to find a linear function that predicts the response value(y) as accurately as possible as a function of the feature or independent variable(x).
Y = β₀ + β₁X + εFor understanding the concept let’s consider a salary dataset where it is given the value of the dependent variable(salary) for every independent variable(years experienced).
Salary dataset:
For general purposes, we define:
First we convert these data values into R Data Frame
Scatter plot of the given dataset
Output:
Linear Regression in R
Now, we have to find a line that fits the above scatter plot through which we can predict any value of y or response for any value of x
The line which best fits is called the Regression line.
The equation of the regression line is given by:
y = a + bxWhere y is the predicted response value, a is the y-intercept, x is the feature value and b is the slope.
To create the model, let’s evaluate the values of regression coefficients a and b. And as soon as the estimation of these coefficients is done, the response model can be predicted. Here we are going to use the Least Square Technique .
The principle of least squares is one of the popular methods for finding a curve fitting a given data. Say be n observations from an experiment. We are interested in finding a curve
Closely fitting the given data of size ‘n’. Now at x=x1 while the observed value of y is y1 the expected value of y from curve (1) is f(x1). Then the residual can be defined by…
Similarly, residuals for x2, x3…xn are given by …
While evaluating the residual we will find that some residuals are positives and some are negatives. We are looking forward to finding the curve fitting the given data such that residual at any xi is minimum. Since some of the residuals are positive and others are negative and as we would like to give equal importance to all the residuals it is desirable to consider the sum of the squares of these residuals. Thus we consider:
and find the best representative curve.
Least Square Fit of a Straight Line
Suppose, given a dataset of n observation from an experiment. And we are interested in fitting a straight line.
to the given data.
Now consider:
Now consider the sum of the squares of ei
Note: E is a function of parameters a and b and we need to find a and b such that E is minimum and the necessary condition for E to be minimum is as follows:
This condition yields:
The above two equations are called normal equations which are solved to get the value of a and b.
The Expression for E can be rewritten as:
The basic syntax for regression analysis in R is
lm(Y ~ model)
where Y is the object containing the dependent variable to be predicted and the model is the formula for the chosen mathematical model.
The command lm( ) provides the model’s coefficients but no further statistical information.
The following R code is used to implement Simple Linear Regression: