This article had a brutal leap from being aimed at someone with barely any understanding of maths (complete with friendly emojis) to use of cost functions (without any explanation) and associated code. It's bit like the "how to draw an owl" meme.
A good article on linear regression, in my opinion, would break it down into three steps:
1. Spend a bit of time looking at cost functions. In principle linear regression is finding the "best" line, where by "best" we consider all possible lines (yes, all uncountably infinite of them), compute the cost function for each one, and pick the one where the cost comes out lowest. You want to show a few example lines on top of some example points and label their costs. Start with absolute deviation (i.e. l1 norm) to start with - let's face it, that's really the most obvious cost function if you don't already know what comes next - then contrast with least squares (i.e. l2 norm). For example, note that least squares cost function "cares" more about points that are a particularly long way away.
2. Admit that, OK, we do want an algorithm more sensible than "try every possible line, labourously computing the cost of each one". Now you can talk about gradient descent - and I mean WHY you use gradient descent, not the computation. And now you can mention that least squares is differentiable, so solves nicely for gradient descent, which is the real reason we tend to prefer it over absolute deviation.
3. Finally, after both of those you can solve the gradient descent equations and show some associated code.
Yeah, the leap from cute emoji variables to partial derivatives really threw me off. The vectorisation was a final punch in the gut (“I hope you can convince yourself, assuming you know matrix multiplication”).
Oops, you're absolutely right. I remember back at secondary school a teacher going through exactly the process I described, with him asking us what the best choice of line would be. The first choice someone (not me sadly) suggested was indeed this one.
For others that, like me, don't already know it (at least by name): the Deming cost function is the sum of perpendicular distances to the points from the line, as opposed to measuring only the vertical component.
Edit: Actually it looks like Deming regression is a bit more subtle and statistical than just sum of perpendicular distances. A treatment of linear regression from a statistical perspective of trying untangle some random noise is very worthy, but I'd save it for a second lesson, following a first lessons that's just an unmotivated "let's choose the line that's somehow closest to all the points".
Deming regressions have some serious drawbacks compared to OLS, most importantly that it's not scale/unit-invariant: if you rescale your (e.g. use house area in square metres instead of square feet to predict house prices) you get different results.
This may be less of an issue for the machine leaner who's only interested in point predictions, but it's a serious concern for us old fashioned statisticians who are more interested in inference.
I would not call it a drawback. In regular regression you make an assumption that x is known exactly, and all errors are in y. In Deming regression you make another assumption, about the ratio of their variances (δ). If you change units, you need to change δ as well, and then there's no difference.
i wouldn't dare to say its a 'serious drawback'. as soon as you add ridge to regression (something very common for statisticians to do), it's also no longer scale/unit-invariant.
Edit: this is also not just true for ridge, but lasso as well
Personally I consider the lack of scale-invariance one of the main drawbacks of most common regularizers too. Again, not a big deal if all you're after are y-hat, a bit more concerning if you're interested in beta-hat.
i try not to stick too religiously to coefficient interpretation unless its a very simple and known problem.
one missing variable could change the coefficients sign.
This is also called Total Least Squares -- IIRC it was developed for use when you have measurement error in X (least squares assumes no measurement error in X, all error is in Y).
haha - I love the use of the word brutal here! It is a brutal leap.
I wanted to give the ideas in the simplest way with friendly emojis and then go onto the more complex derivation with the cost function etc. I haven't explained the idea of the cost function enough. I think your idea for stating with the cost function (as a concept - not a formula) and absolute error may have been nicer to be fair. I can see a nice d3 visual where you can slide the gradient and see the total error change!
I could always add the line of best fit approach and more intuition for the cost function after I introduce training data (and before the brutal leap)?
You use gradient descent, but do not introduce the normal equations. This is problematic for at least two reasons.
Case 1: Design matrix has full rank
Omitting the normal equations obfuscates what is really going on. You are inverting a matrix to solve the first order condition of a strictly convex objective, which therefore has a unique optimum.
Case 2: Design matrix does not have full rank
Omitting the normal equations hides the fact that there are multiple solutions to the first order condition. Gradient descent will find one of them, but you need a principled method for selecting among them. The Moore-Penrose pseudo-inverse method gives you the solution with the smallest L2 norm.
Omitting these details is setting learners up for failure.
You may amused to learn that "How would you program a solver for a system of linear equations?" was an informal interview question for a top machine learning PhD program, and applicants were not looked upon favorably if they mindlessly gave gradient descent as an answer.
As I mentioned in another comment, your solution derivation doesn't make sense for someone who hasn't already seen/done it before. And for them, the intro seems a bit too cutesy.
It seems to me that you're trying to condense a few weeks of intro linear algebra into a single blog post. In my experience that only works as a refresher, not for someone starting from zero.
Well, since everyone seems to be saying they're not a fan, I'll just add that I think it's great. Then again, I'm familiar with linear algebra, error functions, etc so maybe the writer is writing for a specific type of reader with a specific background. In spite of being familiar with the material, I still quite enjoy reading how other people explain these things.
There's a couple of really good things in this intro that I wanted to point out.
1. I appreciate the rigour that went into the math notation. I need to see equations to start to make sense of something. And if you're a beginner, this at least provides an intuition for starting to grok these visuals.
2. It's visually very clear, and structured clearly. Sounds trivial, but this kind of visual organization helps a lot to break down complex topics.
3. Intro -> Math theory -> Python breakdown. The best way to teach math.
4. No memes. I feel like there's a huge temptation in articles for beginners to litter the post with moving gifs and memes which are distracting and annoying.
Personally, I find linear regression makes more sense from an orthogonality of sub-spaces perspective. That the same solution can be derived as the optimum of some cost function is neat though.
A good article on linear regression, in my opinion, would break it down into three steps:
1. Spend a bit of time looking at cost functions. In principle linear regression is finding the "best" line, where by "best" we consider all possible lines (yes, all uncountably infinite of them), compute the cost function for each one, and pick the one where the cost comes out lowest. You want to show a few example lines on top of some example points and label their costs. Start with absolute deviation (i.e. l1 norm) to start with - let's face it, that's really the most obvious cost function if you don't already know what comes next - then contrast with least squares (i.e. l2 norm). For example, note that least squares cost function "cares" more about points that are a particularly long way away.
2. Admit that, OK, we do want an algorithm more sensible than "try every possible line, labourously computing the cost of each one". Now you can talk about gradient descent - and I mean WHY you use gradient descent, not the computation. And now you can mention that least squares is differentiable, so solves nicely for gradient descent, which is the real reason we tend to prefer it over absolute deviation.
3. Finally, after both of those you can solve the gradient descent equations and show some associated code.