Guide to Linear Regression

gpcz · on July 21, 2015

A related really cool thing I learned in grad school is that you can implement Deming regression (https://en.wikipedia.org/wiki/Deming_regression) by storing the moments (outer product sum and point sum) of the training points and then finding the dominant eigenvector of the outer product sum using a singular value decomposition. Since you can approximate the directional part of a 2x2 SVD with atan2(), it effectively becomes a O(1) operation to add or remove training points.

(specifics here: https://april.eecs.umich.edu/courses/eecs568_f12/linefitting... )

jaytaylor · on July 22, 2015

That is very cool. I love how closely the formal math relates to the actual code needed/involved.

alexhwoods · on July 22, 2015

agreed

bluusteel · on July 22, 2015

I thought closed form expressions existed for linear regression. Why is gradient descent needed?

srean · on July 22, 2015

That closed form requires matrix inversion. And that is almost always (but not always) a bad idea. It is numerically unstable/sensitive and more expensive than need be. Sometimes you also need a quick ball-park figure of the answer and the ability to query the answer anytime. These can be done with iterative algorithms (gradient descent is one such iterative algorithm, conjugate gradient descent would be an improvement of that). With the closed form you have to wait till it finishes going through the motions. Its an all or nothing deal. OTOH you can stop an iteration anytime and peek at the current estimate of the answer.

If this comment has even one takeaway, I would like that to be "don't invert, unless you are very sure that is exactly what you need". In some scenarios inverses are indeed required, solving linear equations are almost always not that scenario.

east2west · on July 22, 2015

Some clarifications. Explicitly computing matrix inversion is never a good idea, unless you require an inverted matrix. For solving the normal equation, indeed for solving any linear equation, use appropriate specific algorithms (variants of Gaussian elimination, QR, SVD, ...). For regression, computing the inverse of covariance matrix is actually faster than QR methods ( plural because there are more than one) at the cost of numerical instability. SVD is the more stable but also more expensive method. These are the standard methods you will find in any linear algebra or statistics textbooks. For regular use they are just fine. It is when you have thousands of variables and tens of thousands of measurements when they fall short. Iterative optimization algorithms can be used, or random matrix algorithms. There is no free lunch, because numerical instability is inherent in these algorithms, but you do have some, more relaxed, guarantee of errors or probability of errors.

The reason I said iterative optimization algorithms can be used for regression because linear regression is a quadratic optimization problem. We have analytic solutions because of special linear algebra properties. It can be said that we have a good handle on quadratic optimization problems, hence the multitude of algorithm choices. General optimization algorithms necessarily carry cost in numerical instability, but it may be a cost worth paying, especially for large data sets.

Finally, I want to echo your sentiment that DO NOT invert. Solve linear equations should be by tailored algorithms. In theory they yield identical results, in practice specific algorithms are much better.

santaclaus · on July 22, 2015

Implicit in this discussion is the fact that the matrix is positive definite, no?

shoo · on July 22, 2015

yeah.

and helpfully, in this context, if you're starting from the point of minimising the L2 error between the output of some linear function and a target, then the resulting linear system will have the form $A^T A x = A^T b$, so the matrix in question is $A^T A$, so it'll always be positive semidefinite.

imh · on July 22, 2015

For some models, linear regression included, there may be a closed form solution, but it might just be too expensive to compute. In particular, high dimensions can screw everything and sparsity can allow much cheaper solutions, so sometimes you just revert to gradient descent or SGD.

dlib · on July 22, 2015

For completeness I would include the closed form solution in a discussion of linear regression. In matrix notation is fairly simple and it's easy to follow how it falls apart when the assumptions underlying linear regression are not present in the data you're working with (homoskedasticity, independence of errors etc.)

cstuder · on July 22, 2015

This is a nice guide, I've also enjoyed playing around with the Ipython Notebook here: http://nbviewer.ipython.org/github/justmarkham/DAT4/blob/mas...

The one question which remains: Is there a more intuitive guide which helps you with deciding which features you need to choose in order to get a good regression?

alexhwoods · on July 22, 2015

Hey, one easy way to decide features is to use a correlation matrix. The stronger the correlation coefficient r is, the more of a linear relationship exists.

The code goes like this -

install.packages('corrplot') library(corrplot)

mcor <- cor(crime) # if crime is your dataframe corrplot(mcor)

That's one easy way to start out. Perhaps I'll write a post on feature engineering.

cstuder · on July 22, 2015

Thank you, I will try this. And I would love to see such a post.

kastnerkyle · on July 22, 2015

For anyone who is interested in a python version with inferior plots I have a notebook here:

http://kastnerkyle.github.io/posts/linear-regression/

Closed form is at the bottom, basically invalidating the whole rest of my post.

npx · on July 22, 2015

Gonna use this as the title for my autobiography! :)