> The simple reason is that that illustration shows how we regularize models conceptually, with hard constraints, not how we actually implement regularization, with soft constraints!
Note that these are equivalent. In particular, assuming the two problems are (for some loss "l" and regularizer "r")
minimize l(θ) + λr(θ),
and
minimize l(θ)
subject to r(θ) ≤ M,
for every λ there exists an M and for every M there exists a λ such that the resulting problems are equivalent, in the sense that a solution for one is a solution for the other. (Of course, under some fairly general regularity conditions, but these hold for all given examples). I agree that this is not often stated in many introductory texts, but the intuitive image is the same.
Indeed; lagrange multipliers are your friend! My problem initially was just the disconnect between the picture and the soft-constraint lagrange multiplier. Just couldn't figure out how that thresholded like that. :)
> for every λ there exists an M and for every M there exists a λ such that the resulting problems are equivalent
Sorry, this might be a tad off-topic. I have over the last month being thinking extensively on the topic.
As far as I can see, the above is actually not true. If the trade-off between l(θ) and r(θ) is concave (which might occur frequently for a lot of problems), you will not find all the solutions with the first approach you would find with the second approach. With the first approach you cannot find points on the concave parts of the pareto-curve, while with the second approach you can walk the whole pareto curve. When minimizing a linear combination of losses, you can only find the points on the ends of the concave part of a pareto curve.
Is there a reason the above is not correct? I.e. do you have some more explanation or a source on the "fairly general regularity conditions" you mentioned? It would help me out a lot.
Well, I'm not sure what you mean by "concave part of a Pareto curve." (Perhaps you meant convex?) Usually, the Pareto curve is the epigraph of the individual function evaluations at the optimal points, so, unless your problem is highly nonconvex, the Pareto curve is likely to be convex.
More specifically, the tradeoff curve is defined as the following: letting θ(λ) be the optimal solution of the first optimization problem at fixed λ, the Pareto optimal curve/frontier is the epigraph of the set of of all (l(θ(λ)), r(θ(λ))). I.e.,
{(x, y) | x ≥ l(θ(λ)), y ≥ r(θ(λ)), λ ∈ ℝ}
If l and r are convex functions, then this curve is guaranteed to be convex. A similar curve can be defined for the second problem (over M instead of λ), and both curves are equal (in the sense that they both include the same points).
Note that this equivalency between both problems doesn't require convexity; for example, lower semi-continuity of l and r with r bounded from below [0], are sufficient conditions that would include essentially all functions used in ML.
-----
[0] Boundedness is needed in this case since, otherwise, we could just have r(θ) = -∞ for a bunch of points, in which case we would always choose this θ, no matter how large or small λ is (except at λ = 0) in problem (1), whereas it would have a different effect in problem (2).
The world certainly doesn't need yet another article on the mechanics of regularized linear models. What's lacking is a simple and intuitive explanation for what exactly is going on during regularization. The goal of this article is to explain how regularization behaves visually, dispelling some myths and answering important questions along the way.
haha. :) Yeah, I got stuck for SOOooo long trying to reconcile the standard picture from ESL book with the math. Turns out they don't match! The picture is for a hard constraint whereas the math penalty term just makes bigger coefficients more expensive. Man, that really threw me.
Note that these are equivalent. In particular, assuming the two problems are (for some loss "l" and regularizer "r")
and for every λ there exists an M and for every M there exists a λ such that the resulting problems are equivalent, in the sense that a solution for one is a solution for the other. (Of course, under some fairly general regularity conditions, but these hold for all given examples). I agree that this is not often stated in many introductory texts, but the intuitive image is the same.