Posts Tagged ‘data tables’

Linear Regression by the Median of Slopes

May 19, 2013 Leave a comment

I have been trying to come up with a lesson about linear regression that involves more than pushing a few buttons, like on the TI-8ish, or using sliders in Desmos.  I tried to search the web for lessons of other people but I could not find what I was looking for.  Then I came across a method of finding the line of best fit called Theil–Sen estimator.  Here is the method.

As defined by Theil (1950), the Theil–Sen estimator of a set of two-dimensional points (xi,yi) is the median m of the slopes(yj − yi)/(xj − xi) determined by all pairs of sample points. Sen (1968) extended this definition to handle the case in which two samples have the same x-coordinate. In Sen’s definition, one takes the median of the slopes defined only from pairs of points having distinct x-coordinates.

Once the slope m has been determined, one may determine a line through the sample points by setting the y-intercept b to be the median of the values yi − mxi.[8] As Sen observed, this estimator is the value that makes the Kendall tau rank correlation coefficient comparing the sample data values yi with their estimated values mxi + b become approximately zero.

I really like this idea because it reinforces a lot of procedures of linear equations.  Here is how I might do the lesson. A link to the entire Desmos graph is here.

First give the students data and have them plot it with Desmos. This data is the annual gross ticket sales (in 100’s of millions) where x=0 for 1995. Using the table feature in Desmos is great.

Ticket Sales (100’s millions) VS Years (x=0 for 1995)

Next I would have students find the First Order Differences and plot these on the same graph.  We would have a discussion about what these values mean and also talk about how these are approximately constant so a linear model would be a good fit.

Green dots are the first order differences

Next we would begin finding the median slopes.  We might begin by asking how many different slopes could be found between 17 points.  Obviously, we would not find them all so we would assign a certain amount for each student to find.  Then we would gather up all of those slopes and plot them in Desmos.  This should be a great visual example to see the outliers of slopes within the data.  (For this example, I only found 10 different slopes.  Also, note that the first oder differences could be used as slope values.  Those slope values are for consecutive points.)

10 different slopes found within the data.

Then we can discuss what “average”, (mean, median, mode, midrange), we should use to find the “average slope”.  In Desmos, finding the median slope is easy.  Click on the top line, then hide it. Click on the bottom line, then hide it. Click on the new top line, hide it. Click on the new bottom line then hide it.  Continue doing so and this will result in the median slope.  Here is a picture of the final two.

The two median slopes out of the ten.

We can also plot that median slope with the first order differences.  This could bring up a good discussion about do we really need to find other slopes or could we just use the first order differences to find the “median slope”

Plotting the line y=median slope with the 1st order differences

Next we can go back to the table and find the median y-intercept.  In the Desmos table, we will make a column of values that is the expression y-(median slope)x.  We can also plot those points to show what the y-intercept would be for each data point.  Here is that graph.

Purple dots are the y-intercepts based on the median slope and data point.

Now that we have all of those different y-intercepts we can use a slider to estimate the median y-intercept value.  We could also throw the values into a spread sheet if we wanted, but I think the slider will be good enough. I made the slider have a lower bound of 4 and an upper bound of 5.  The b value ended up being 4.554.

Using a slider value to estimate the median y-intercept.

Finally, we are ready to plot the line of “median fit.”  using the equation y = (median slope)x + (median y-int)

The end result: Line of Median Fit.

For only using ten different slopes, I would say that the line looks pretty good.  However, the data did a have a strong correlation to begin with.  I have not compared the “median line” to line of least-squares because I think that would be a good follow up.  I think this method goes into the heart of regression.  Students get to see how many different lines are used to find the best line.  Student review stats concepts and how outliers impact different averages.  Students are creating a lot of evidence for their model, instead  of just relying on the “r-value”.

One other thought would be to have student’s create an error region for the model.  This might help them understand ideas of interpolation and extrapolation. Plus, it might allow us to discuss standard deviation, too.  In the graph below I graphed {median slope(x) + 1.15(median y-int)} and {median slope(x) – 1.15(median y-int)} to create a 15% above and 15% region. I could have found the standard deviation of the median b value and done three standard deviations above and below.

15% above and 15% below the median line.

The more I explore this concept the more it seems like it turning more into a statistical analysis.  I need to determine if that is the route I want to go on since the class I am developing this for is “Math Modeling” course.

I hope all of this gives you some ideas about linear regression.  I have not designed the lab sheet that will go with this yet.  I would love to hear feed back if you have any.