Archive

Archive for May, 2013

Linear Regression by the Median of Slopes

I have been trying to come up with a lesson about linear regression that involves more than pushing a few buttons, like on the TI-8ish, or using sliders in Desmos.  I tried to search the web for lessons of other people but I could not find what I was looking for.  Then I came across a method of finding the line of best fit called Theil–Sen estimator.  Here is the method.

As defined by Theil (1950), the Theil–Sen estimator of a set of two-dimensional points (xi,yi) is the median m of the slopes(yj − yi)/(xj − xi) determined by all pairs of sample points. Sen (1968) extended this definition to handle the case in which two samples have the same x-coordinate. In Sen’s definition, one takes the median of the slopes defined only from pairs of points having distinct x-coordinates.

Once the slope m has been determined, one may determine a line through the sample points by setting the y-intercept b to be the median of the values yi − mxi.[8] As Sen observed, this estimator is the value that makes the Kendall tau rank correlation coefficient comparing the sample data values yi with their estimated values mxi + b become approximately zero.

I really like this idea because it reinforces a lot of procedures of linear equations.  Here is how I might do the lesson. A link to the entire Desmos graph is here.

First give the students data and have them plot it with Desmos. This data is the annual gross ticket sales (in 100’s of millions) where x=0 for 1995. Using the table feature in Desmos is great.

Ticket Sales (100’s millions) VS Years (x=0 for 1995)

Next I would have students find the First Order Differences and plot these on the same graph.  We would have a discussion about what these values mean and also talk about how these are approximately constant so a linear model would be a good fit.

Green dots are the first order differences

Next we would begin finding the median slopes.  We might begin by asking how many different slopes could be found between 17 points.  Obviously, we would not find them all so we would assign a certain amount for each student to find.  Then we would gather up all of those slopes and plot them in Desmos.  This should be a great visual example to see the outliers of slopes within the data.  (For this example, I only found 10 different slopes.  Also, note that the first oder differences could be used as slope values.  Those slope values are for consecutive points.)

10 different slopes found within the data.

Then we can discuss what “average”, (mean, median, mode, midrange), we should use to find the “average slope”.  In Desmos, finding the median slope is easy.  Click on the top line, then hide it. Click on the bottom line, then hide it. Click on the new top line, hide it. Click on the new bottom line then hide it.  Continue doing so and this will result in the median slope.  Here is a picture of the final two.

The two median slopes out of the ten.

We can also plot that median slope with the first order differences.  This could bring up a good discussion about do we really need to find other slopes or could we just use the first order differences to find the “median slope”

Plotting the line y=median slope with the 1st order differences

Next we can go back to the table and find the median y-intercept.  In the Desmos table, we will make a column of values that is the expression y-(median slope)x.  We can also plot those points to show what the y-intercept would be for each data point.  Here is that graph.

Purple dots are the y-intercepts based on the median slope and data point.

Now that we have all of those different y-intercepts we can use a slider to estimate the median y-intercept value.  We could also throw the values into a spread sheet if we wanted, but I think the slider will be good enough. I made the slider have a lower bound of 4 and an upper bound of 5.  The b value ended up being 4.554.

Using a slider value to estimate the median y-intercept.

Finally, we are ready to plot the line of “median fit.”  using the equation y = (median slope)x + (median y-int)

The end result: Line of Median Fit.

For only using ten different slopes, I would say that the line looks pretty good.  However, the data did a have a strong correlation to begin with.  I have not compared the “median line” to line of least-squares because I think that would be a good follow up.  I think this method goes into the heart of regression.  Students get to see how many different lines are used to find the best line.  Student review stats concepts and how outliers impact different averages.  Students are creating a lot of evidence for their model, instead  of just relying on the “r-value”.

One other thought would be to have student’s create an error region for the model.  This might help them understand ideas of interpolation and extrapolation. Plus, it might allow us to discuss standard deviation, too.  In the graph below I graphed {median slope(x) + 1.15(median y-int)} and {median slope(x) – 1.15(median y-int)} to create a 15% above and 15% region. I could have found the standard deviation of the median b value and done three standard deviations above and below.

15% above and 15% below the median line.

The more I explore this concept the more it seems like it turning more into a statistical analysis.  I need to determine if that is the route I want to go on since the class I am developing this for is “Math Modeling” course.

I hope all of this gives you some ideas about linear regression.  I have not designed the lab sheet that will go with this yet.  I would love to hear feed back if you have any.

Summer is here. Yippee for me!  This means that I can get back to blogging some of ideas.  So, here is my next idea, The Math of Temple Run.

Have your ever played Temple Run?  Probably, yes, given the game’s popularity.  If not, here is information about the game. http://www.imangistudios.com  The game is free to play. Did you catch that – FREE! But what is better than being free is that Temple Run has a lot math problems waiting to be explored.  Here are some that I will be trying out.

• How fast is the person running?  The game keeps track of how far you run in meters.  This means that you can time how long the person runs and then calculate the speed.
• Is the person’s speed possible?  I won’t spoil it for you but you need to see how fast the person runs.
• The person runs faster as the game progresses.  Students can make a chart of distance and time because the game flashes up the distances as you travel.  Does the increase of speed follow a quadratic pattern or is it more like a piece-wise linear?
• Next have students play the game and record 15 rounds of data.  Here is a link to the table of data that I made. https://www.desmos.com/calculator/barud9egsn

Blue Dots: Coins vs Distance – Red Dots: Distance vs Score

The data looks to have a strong linear correlation, which allows us to explore rates of change.  What does the rate of change mean for the blue dots and what does the rate of change mean for the red dots?  Are certain games better than others?  Is a better game based on the distance? Is a better game based on the amount of coins? Is a better game related to the number of coins compared to the score?

• In Desmos we can use a slider to create a line of best fit.  The  ones I made were y=(coins/dist)x and y=(score/dist)x  I will probably discuss with students why we can make the initial value of zero.  (Actually, one of the goals in the game is to go 1,000 meters and get zero coins.)  Next, with the slider values we can either make the numerator or denominator equal to 1 and adjust the other slider accordingly. Once they have the line of best fit, we can talk about what it means for the data points to be above/below the line of best fit.
• Remember that each student will be playing the game, hopefully.  So, this means we will have many different graphs.  This will allows us to talk about how you can look at a graph and say, “That is good player. This is a so-so player. Etc.” or to be able to look at graph and point out who has had more experience playing the game.
• How is the score calculated?  After students record the amount of coins, distance, and score they will have some data to try and figure out how the score is calculated.  The game does not tell how the score is calculated.  All you see is a running total in the upper right corner.  Wikipedia does give a formula for the score.  http://en.wikipedia.org/wiki/Temple_Run However, that formula was not working for me.  I might doing some type of error.  But this is good because it gives students the opportunity to figure it out.  Plus, the formula opens up the door to the floor function and ordinal numbers.
• I might throw in Game Theory at the end.  Not quite sure.

Plenty of stuff to work on there.  I have not made the worksheets yet but I will update the post soon.  If you could, please play the game 15 times and record your data into this Desmos graph https://www.desmos.com/calculator/b2bycwrenn.  Save your work and then post it here in the comments or on twitter @LukeSelfwalker.  I will gather up the data and post it later.  Thanks for your help.