This article had a brutal leap from being aimed at someone with barely any under...

q3k · on June 19, 2020

Yeah, the leap from cute emoji variables to partial derivatives really threw me off. The vectorisation was a final punch in the gut (“I hope you can convince yourself, assuming you know matrix multiplication”).

pps43 · on June 19, 2020

> that's really the most obvious cost function

To me Deming regression is the most obvious. It actually took me a long time to realize that x~y and y~x are in most cases different lines.

quietbritishjim · on June 19, 2020

Oops, you're absolutely right. I remember back at secondary school a teacher going through exactly the process I described, with him asking us what the best choice of line would be. The first choice someone (not me sadly) suggested was indeed this one.

For others that, like me, don't already know it (at least by name): the Deming cost function is the sum of perpendicular distances to the points from the line, as opposed to measuring only the vertical component.

Edit: Actually it looks like Deming regression is a bit more subtle and statistical than just sum of perpendicular distances. A treatment of linear regression from a statistical perspective of trying untangle some random noise is very worthy, but I'd save it for a second lesson, following a first lessons that's just an unmotivated "let's choose the line that's somehow closest to all the points".

em500 · on June 19, 2020

Deming regressions have some serious drawbacks compared to OLS, most importantly that it's not scale/unit-invariant: if you rescale your (e.g. use house area in square metres instead of square feet to predict house prices) you get different results.

This may be less of an issue for the machine leaner who's only interested in point predictions, but it's a serious concern for us old fashioned statisticians who are more interested in inference.

pps43 · on June 19, 2020

I would not call it a drawback. In regular regression you make an assumption that x is known exactly, and all errors are in y. In Deming regression you make another assumption, about the ratio of their variances (δ). If you change units, you need to change δ as well, and then there's no difference.

autokad · on June 19, 2020

i wouldn't dare to say its a 'serious drawback'. as soon as you add ridge to regression (something very common for statisticians to do), it's also no longer scale/unit-invariant.

Edit: this is also not just true for ridge, but lasso as well

em500 · on June 19, 2020

Personally I consider the lack of scale-invariance one of the main drawbacks of most common regularizers too. Again, not a big deal if all you're after are y-hat, a bit more concerning if you're interested in beta-hat.

autokad · on June 19, 2020

i try not to stick too religiously to coefficient interpretation unless its a very simple and known problem. one missing variable could change the coefficients sign.

zwaps · on June 20, 2020

Depends on whether the missing variable is correlated with the beta hats you care about ;-)

tomrod · on June 19, 2020

This is also called Total Least Squares -- IIRC it was developed for use when you have measurement error in X (least squares assumes no measurement error in X, all error is in Y).

simonwardjones · on June 19, 2020

haha - I love the use of the word brutal here! It is a brutal leap.

I wanted to give the ideas in the simplest way with friendly emojis and then go onto the more complex derivation with the cost function etc. I haven't explained the idea of the cost function enough. I think your idea for stating with the cost function (as a concept - not a formula) and absolute error may have been nicer to be fair. I can see a nice d3 visual where you can slide the gradient and see the total error change!

I could always add the line of best fit approach and more intuition for the cost function after I introduce training data (and before the brutal leap)?

rmrfstar · on June 19, 2020

I'm very much not a fan of this introduction.

You use gradient descent, but do not introduce the normal equations. This is problematic for at least two reasons.

Case 1: Design matrix has full rank

Omitting the normal equations obfuscates what is really going on. You are inverting a matrix to solve the first order condition of a strictly convex objective, which therefore has a unique optimum.

Case 2: Design matrix does not have full rank

Omitting the normal equations hides the fact that there are multiple solutions to the first order condition. Gradient descent will find one of them, but you need a principled method for selecting among them. The Moore-Penrose pseudo-inverse method gives you the solution with the smallest L2 norm.

Omitting these details is setting learners up for failure.

spekcular · on June 19, 2020

I agree with this 100%.

You may amused to learn that "How would you program a solver for a system of linear equations?" was an informal interview question for a top machine learning PhD program, and applicants were not looked upon favorably if they mindlessly gave gradient descent as an answer.

em500 · on June 19, 2020

As I mentioned in another comment, your solution derivation doesn't make sense for someone who hasn't already seen/done it before. And for them, the intro seems a bit too cutesy.

It seems to me that you're trying to condense a few weeks of intro linear algebra into a single blog post. In my experience that only works as a refresher, not for someone starting from zero.

edge17 · on June 19, 2020

Well, since everyone seems to be saying they're not a fan, I'll just add that I think it's great. Then again, I'm familiar with linear algebra, error functions, etc so maybe the writer is writing for a specific type of reader with a specific background. In spite of being familiar with the material, I still quite enjoy reading how other people explain these things.

saeranv · on June 19, 2020

There's a couple of really good things in this intro that I wanted to point out.

1. I appreciate the rigour that went into the math notation. I need to see equations to start to make sense of something. And if you're a beginner, this at least provides an intuition for starting to grok these visuals.

2. It's visually very clear, and structured clearly. Sounds trivial, but this kind of visual organization helps a lot to break down complex topics.

3. Intro -> Math theory -> Python breakdown. The best way to teach math.

4. No memes. I feel like there's a huge temptation in articles for beginners to litter the post with moving gifs and memes which are distracting and annoying.

patrick5415 · on June 21, 2020

Personally, I find linear regression makes more sense from an orthogonality of sub-spaces perspective. That the same solution can be derived as the optimum of some cost function is neat though.