haha - I love the use of the word brutal here! It is a brutal leap.
I wanted to give the ideas in the simplest way with friendly emojis and then go onto the more complex derivation with the cost function etc. I haven't explained the idea of the cost function enough. I think your idea for stating with the cost function (as a concept - not a formula) and absolute error may have been nicer to be fair. I can see a nice d3 visual where you can slide the gradient and see the total error change!
I could always add the line of best fit approach and more intuition for the cost function after I introduce training data (and before the brutal leap)?
You use gradient descent, but do not introduce the normal equations. This is problematic for at least two reasons.
Case 1: Design matrix has full rank
Omitting the normal equations obfuscates what is really going on. You are inverting a matrix to solve the first order condition of a strictly convex objective, which therefore has a unique optimum.
Case 2: Design matrix does not have full rank
Omitting the normal equations hides the fact that there are multiple solutions to the first order condition. Gradient descent will find one of them, but you need a principled method for selecting among them. The Moore-Penrose pseudo-inverse method gives you the solution with the smallest L2 norm.
Omitting these details is setting learners up for failure.
You may amused to learn that "How would you program a solver for a system of linear equations?" was an informal interview question for a top machine learning PhD program, and applicants were not looked upon favorably if they mindlessly gave gradient descent as an answer.
As I mentioned in another comment, your solution derivation doesn't make sense for someone who hasn't already seen/done it before. And for them, the intro seems a bit too cutesy.
It seems to me that you're trying to condense a few weeks of intro linear algebra into a single blog post. In my experience that only works as a refresher, not for someone starting from zero.
Well, since everyone seems to be saying they're not a fan, I'll just add that I think it's great. Then again, I'm familiar with linear algebra, error functions, etc so maybe the writer is writing for a specific type of reader with a specific background. In spite of being familiar with the material, I still quite enjoy reading how other people explain these things.
There's a couple of really good things in this intro that I wanted to point out.
1. I appreciate the rigour that went into the math notation. I need to see equations to start to make sense of something. And if you're a beginner, this at least provides an intuition for starting to grok these visuals.
2. It's visually very clear, and structured clearly. Sounds trivial, but this kind of visual organization helps a lot to break down complex topics.
3. Intro -> Math theory -> Python breakdown. The best way to teach math.
4. No memes. I feel like there's a huge temptation in articles for beginners to litter the post with moving gifs and memes which are distracting and annoying.
I had never actually seen "Data Science from Scratch" - sounds like a book I should read!
Fair point about the gradient descent Vs normal equations closed form solution. I am planning on working through a few algorithms so thought it would be better to introduce gradient descent with something simple before talking about gradient boosted decision trees and Neural Networks. Also I would have to explain more complex matrix stuff like invertibility issues and linear dependance like you said.
I guess I just dodged that bullet and went for gradient descent. Maybe another post for the linear algebra fans! Thanks for reading though!
I wanted to give the ideas in the simplest way with friendly emojis and then go onto the more complex derivation with the cost function etc. I haven't explained the idea of the cost function enough. I think your idea for stating with the cost function (as a concept - not a formula) and absolute error may have been nicer to be fair. I can see a nice d3 visual where you can slide the gradient and see the total error change!
I could always add the line of best fit approach and more intuition for the cost function after I introduce training data (and before the brutal leap)?