I spent three semesters in college learning RL only to be massively disappointed...

jmward01 · on May 6, 2024

I modeled part of my company's business problem as a MAB problem and saved my company 10% off their biggest cost and, just as important, showcased an automated truth signal that helped us understand what was, and wasn't, working in several of our features. Like all tools, finding the right place to use RL concepts is a big deal. I think one thing that is often missed in a classroom setting is pushing more real world examples of where powerful ideas can be used. Talking about optimal policies is great, but if you don't help people understand where those ideas can be applied then it is just a bunch of fun math. (which is often a good enough reason on its own :)

smokel · on May 6, 2024

For those not in the know, "MAB" is short for Multi-Armed Bandit [1], which is a decision-making framework that is often discussed in the broader context of reinforcement learning.

In my limited understanding, MAB problems are simpler than those tackled by Deep Reinforcement Learning (DRL), because typically there is no state involved in bandit problems. However, I have no idea about their scale in practical applications, and would love to know more about said business problem.

[1] https://en.wikipedia.org/wiki/Multi-armed_bandit

jmward01 · on May 6, 2024

There are often times when you have n possible providers of service y, each with strengths and weaknesses. If you have some ultimate truth signal (like follow on costs which are linked to quality, which was what I used) then you can model the providers as bandits and use something like UCB1 to choose which to use. If you then apply this to every individual customer what you end up doing is learning the optimal vendor for each customer which gives you a higher efficiency than had you picked just one 'best all around' vendor for all customers. So the pattern here is: If you have n_service_providers and n_customers and a value signal to optimize then maybe MAB is the place to go for some possible quick gains. Of course if you have a huge state space to explore instead of just n_service_providers, for instance you want to model combinations of choices, using something like a NN to learn the state space value function is also a great way to go.

alessiodm · on May 6, 2024

RL can be massively disappointing, indeed. And I agree with you (and with the amazing post I already referenced [1]) that it is hard to get it to work at all. Sorry to hear you have been disappointed so much!

Nonetheless, I would personally recommend even just learning the basics and fundamentals of RL. Beyond supervised, unsupervised, and the most-recent and well-deservedly hyped semi-supervised learning (generative AI, LLMs, and so on), reinforcement learning indeed models the learning problem in a very elegant way: an agent interacting with an environment and getting feedback. Which is, arguably, a very intuitive and natural way of modeling it. You could consider backward error correction / propagation as an implicit reward signal, but that would be a very limited view.

On a positive note, RL has very practical sucessful applications today - even if in niche fields. For example, LLM fine-tuning techniques like RLHF successfully apply RL to modern AI systems, companies like Covariant are working on large robotics models which definitely use RL, and generally as a research field I believe (but I may be proven wrong!) there is so much more to explore. For example, check Nvidia Eureka that combines LLM to RL [2]: pretty cool stuff IMHO!

Far from attempting to convince you on the strength and capabilities of DRL, just recommending folks to not discard it right away and at least give it a chance to learn the basics, even just for an intellectual exercise :) Thanks again!

[1] https://www.alexirpan.com/2018/02/14/rl-hard.html

[2] https://blogs.nvidia.com/blog/eureka-robotics-research/

vineyardlabs · on May 6, 2024

RL seems to be in this weird middle ground right now where nobody knows how to make it work all that well but almost everybody at the top levels of ML research agrees it's a vital component of further advances in AI.