## Linear Regression by the Median of Slopes

I have been trying to come up with a lesson about linear regression that involves more than pushing a few buttons, like on the TI-8ish, or using sliders in Desmos. I tried to search the web for lessons of other people but I could not find what I was looking for. Then I came across a method of finding the line of best fit called Theil–Sen estimator. Here is the method.

As defined by Theil (1950), the Theil–Sen estimator of a set of two-dimensional points (

x,_{i}y) is the median_{i}mof the slopes(y−_{j}y)/(_{i}x−_{j}x) determined by all pairs of sample points. Sen (1968) extended this definition to handle the case in which two samples have the same_{i}x-coordinate. In Sen’s definition, one takes the median of the slopes defined only from pairs of points having distinctx-coordinates.Once the slope

mhas been determined, one may determine a line through the sample points by setting they-interceptbto be the median of the valuesy−_{i}mx._{i}^{[8]}As Sen observed, this estimator is the value that makes the Kendall tau rank correlation coefficient comparing the sample data valuesywith their estimated values_{i}mx+_{i}bbecome approximately zero.

I really like this idea because it reinforces a lot of procedures of linear equations. Here is how I might do the lesson. A link to the entire Desmos graph is here.

First give the students data and have them plot it with Desmos. This data is the annual gross ticket sales (in 100’s of millions) where x=0 for 1995. Using the table feature in Desmos is great.

Next I would have students find the First Order Differences and plot these on the same graph. We would have a discussion about what these values mean and also talk about how these are approximately constant so a linear model would be a good fit.

Next we would begin finding the median slopes. We might begin by asking how many different slopes could be found between 17 points. Obviously, we would not find them all so we would assign a certain amount for each student to find. Then we would gather up all of those slopes and plot them in Desmos. This should be a great visual example to see the outliers of slopes within the data. (For this example, I only found 10 different slopes. Also, note that the first oder differences could be used as slope values. Those slope values are for consecutive points.)

Then we can discuss what “average”, (mean, median, mode, midrange), we should use to find the “average slope”. In Desmos, finding the median slope is easy. Click on the top line, then hide it. Click on the bottom line, then hide it. Click on the new top line, hide it. Click on the new bottom line then hide it. Continue doing so and this will result in the median slope. Here is a picture of the final two.

We can also plot that median slope with the first order differences. This could bring up a good discussion about do we really need to find other slopes or could we just use the first order differences to find the “median slope”

Next we can go back to the table and find the median y-intercept. In the Desmos table, we will make a column of values that is the expression y-(median slope)x. We can also plot those points to show what the y-intercept would be for each data point. Here is that graph.

Now that we have all of those different y-intercepts we can use a slider to estimate the median y-intercept value. We could also throw the values into a spread sheet if we wanted, but I think the slider will be good enough. I made the slider have a lower bound of 4 and an upper bound of 5. The b value ended up being 4.554.

Finally, we are ready to plot the line of “median fit.” using the equation y = (median slope)x + (median y-int)

For only using ten different slopes, I would say that the line looks pretty good. However, the data did a have a strong correlation to begin with. I have not compared the “median line” to line of least-squares because I think that would be a good follow up. I think this method goes into the heart of regression. Students get to see how many different lines are used to find the best line. Student review stats concepts and how outliers impact different averages. Students are creating a lot of evidence for their model, instead of just relying on the “r-value”.

One other thought would be to have student’s create an error region for the model. This might help them understand ideas of interpolation and extrapolation. Plus, it might allow us to discuss standard deviation, too. In the graph below I graphed {median slope(x) + 1.15(median y-int)} and {median slope(x) – 1.15(median y-int)} to create a 15% above and 15% region. I could have found the standard deviation of the median b value and done three standard deviations above and below.

The more I explore this concept the more it seems like it turning more into a statistical analysis. I need to determine if that is the route I want to go on since the class I am developing this for is “Math Modeling” course.

I hope all of this gives you some ideas about linear regression. I have not designed the lab sheet that will go with this yet. I would love to hear feed back if you have any.