Overview of Least Squares Method
Brokk Toggerson
So you just completed your first fit for this lab using the LINEST function in Google Sheets. But what did that function actually do? That’s what we’re going to explore in this section. LINEST does what’s known as an ordinary least squares regression (OLS because that gets kind of long to talk about), which is a method you might have used in a previous course like in Biology Lab or, if you’ve taken it, statistics.
The goal of the ordinary least squares regression is to minimize what’s called the sum of the residuals. For each of your data points , you look at the value of your data , and then subtract it from the value of your line at that point : . You then square it, that way, negative values and positive values don’t cancel out. Finally you add them all up, that’s the sum of the residuals (we put a square to remind us that we’ve squared everything). This is the quantity that you then want to minimize. Mathematically, it looks like
It’s kind of an ugly formula. But hopefully you can see how it kind of fits with what we’ve been talking about.
Running through the least squares process with dummy data to see how it works
We’re going to go through how to do this process with some dummy data, just some dummy data that I made up, not about anything in particular.
x | y | |
1 | 3.102 | 5.211 |
2 | 6.273 | 8.070 |
3 | 8.738 | 8.686 |
4 | 12.656 | 10.138 |
5 | 15.079 | 13.939 |
6 | 18. 382 | 12.642 |
We’ve got six different data points here, one through six. For each one, we’ve got an and a . Let’s go through this ordinary least squares process, sort of by hand and see what it’s doing. To start with, place the points on the PhET simulation below.
Now, we will guess a line as you did in a previous part. I am going to guess . Using this guessed function, I can fill out the various steps as visible in this spreadsheet. You’ll notice that some differences are positive, like the first three, and some of them are negative. Positive means the point is above the line while negative means below the line. The sum of the residuals at the bottom is the quantity we are trying to minimize.
So that’s what the least squares regression is doing: it’s going through calculating the difference between the data value and the prediction squaring it adding them all up and then trying to minimize that number.
One important feature of ordinary least squares that we will use throughout the rest of this lab
The ordinary least squares fit will always go through the point that includes the average x and the average y value. We indicate this mathematically as the point . This is implicit in the mathematical construction of how the least squares Linus is is developed and how it’s minimized. You’ll always go through this point. That probably should make some sort of intuitive sense: your best fit line should certainly include the average of all of your data.
The least squares fit will ALWAYS include the average of the data
A least squares fit will always go through the point .
Problems with least squares fitting
Now let’s think a little bit about the problems with just basic least squares fitting this method that you might have seen before in your biology lab or statistics class. Turns out that there are actually some significant problems with this method. First, you’ll notice that in our calculation of the best fit using least squares, we didn’t use the error bars in any way. The PhET simulation does not even have error bars on it! There’s no information of the uncertainty on these data anywhere in this calculation, which is obviously kind of a problem. In fact, neither nor are being included in any way. Our data, of course, includes uncertainties, as does most scientific data, we should incorporate that information.