In this post, I'm going to walk you through an elementary single-variable linear regression with Octave (an open-source Matlab alternative).
If you're new to Octave, I'd recommend getting started by going through the linear algebra tutorial first.
If you're already familiar with the basics of linear algebra operations with Octave, you can move on to the linear regression tutorial. In this tutorial, we're going to see if we can predict the temperature by calculating the rate at which crickets chirp. First, download the data from this text file. (Source: calvin.edu)
Create a new Octave file for the linear regression script called linear_regression_with_octave.m.
First, we'll want to load the data:
$ Load the data from our text file
data = load('cricket_chirps_versus_temperature.txt');
Next, let's define x and y. The x vector is for the independent variable (rate of cricket chirping), and the y vector is for the dependent variable (temperature). To put it another way, your y vector is what you are trying to predict, and your x vector is the data you are going to use to predict it.
% Define x and y
x = data(:,2);
y = data(:,1);
Let's plot the data to see what it looks like:
% Create a function to plot the data
plot(x,y,'rx','MarkerSize',8); % Plot the data
% Plot the data
xlabel('Rate of Cricket Chirping'); % Set the x-axis label
ylabel('Temperature in Degrees Fahrenheit'); % Set the y-axis label
fprintf('Program paused. Press enter to continue.\n');
We're putting in a pause here so that when we generate a new plot later, there's a chronological separation between the two plots. Otherwise the computer will do everything faster than we can process what is happening.
Looking at this chart, there certainly seems to be a linear relationship here. (One of the nice things about a single-variable regression is that you can plot the data on a 2-dimensional chart in order to visualize the relationship.)
Your graph of the data should look like this: .
Now, we want to allow a non-zero intercept for our linear equation. That is, we don't want to require that our fitted equation go through the origin. In order to do this, we need to add a column of all ones to our x column.
% Count how many data points we have
m = length(x);
% Add a column of all ones (intercept term) to x
X = [ones(m, 1) x];
Note that we used lowercase x for the initial vector of cricket-chirp rates, but then we used uppercase X for the new two-column matrix. Recall that, by convention, vectors get lowercase variables and matrices get uppercase variables.
Now, let's use the normal equation to calculate theta. Basically, we are minimizing the sum of the squared errors between our predicted equation and the actual y values. This is a pretty decent error measure — by far the most widely used measure. One of the most attractive features of the linear least-squares method is that it has a closed-form solution; that is, no iteration / numerical computation is needed. That closed-form solution is called the normal equation. Anyway, if you want to learn more about the derivation of the normal equation, you can read about it on wikipedia.
The normal equation is this:
θ = (XTX)−1 XTy
Putting that into Octave:
% Calculate theta
theta = (pinv(X'*X))*X'*y
You should get theta = [24.9660; 3.3058]. This means that our fitted equation is as follows: y = 3.3058x + 24.9660.
Now, let's plot our fitted equation (prediction) on top of the training data, to see if our fitted equation makes sense.
% Plot the fitted equation we got from the regression
hold on; % this keeps our previous plot of the training data visible
plot(X(:,2), X*theta, '-')
legend('Training data', 'Linear regression')
hold off % Don't put any more plots on this figure
Your plot should look like this:
That's all there is to it! Now you know how to run a single-variable linear regression with Octave using the normal equation.