You use gradient descent, but do not introduce the normal equations. This is problematic for at least two reasons.
Case 1: Design matrix has full rank
Omitting the normal equations obfuscates what is really going on. You are inverting a matrix to solve the first order condition of a strictly convex objective, which therefore has a unique optimum.
Case 2: Design matrix does not have full rank
Omitting the normal equations hides the fact that there are multiple solutions to the first order condition. Gradient descent will find one of them, but you need a principled method for selecting among them. The Moore-Penrose pseudo-inverse method gives you the solution with the smallest L2 norm.
Omitting these details is setting learners up for failure.
You may amused to learn that "How would you program a solver for a system of linear equations?" was an informal interview question for a top machine learning PhD program, and applicants were not looked upon favorably if they mindlessly gave gradient descent as an answer.
You use gradient descent, but do not introduce the normal equations. This is problematic for at least two reasons.
Case 1: Design matrix has full rank
Omitting the normal equations obfuscates what is really going on. You are inverting a matrix to solve the first order condition of a strictly convex objective, which therefore has a unique optimum.
Case 2: Design matrix does not have full rank
Omitting the normal equations hides the fact that there are multiple solutions to the first order condition. Gradient descent will find one of them, but you need a principled method for selecting among them. The Moore-Penrose pseudo-inverse method gives you the solution with the smallest L2 norm.
Omitting these details is setting learners up for failure.