Reinforcement Learning – Bandit Problems

subatomic · on May 7, 2018

I was heavily into reinforcement learning around the turn of the century, and at the time, "Reinforcement Learning - An introduction" (Barto and Sutton) https://mitpress.mit.edu/books/reinforcement-learning was an absolute goldmine for me getting started. I think parts of it are online somewhere including all their pseudocode and solutions.

https://mitpress.mit.edu/books/reinforcement-learning

svalorzen · on May 7, 2018

The complete first edition can be found here: http://incompleteideas.net/book/ebook/the-book.html

If you're interested in some well documented C++ implementations of the algorithms shown in the book, feel free to check out https://github.com/Svalorzen/AI-Toolbox. I started the project because when I was first reading the book I had no reference implementation to compare the book to, and personally I learn better with practical examples, so maybe it can help you too.

levesque · on May 7, 2018

If you are going to start in RL, you should really consider reading the second edition even though it is not released yet. I am guessing that Sutton is getting closer to the finishing line as there have been numerous revisions already. The second edition has better notation and benefits from the field having matured a lot since the first book was written. http://incompleteideas.net/book/the-book-2nd.html

johnmoberg · on May 7, 2018

It's a fantastic book! The authors have been working on a second edition for a few years, and I think it's finally finished. A draft is generously available here: http://incompleteideas.net/book/the-book-2nd.html

lpage · on May 7, 2018

As a self contained, foundational course, Georgia Tech's OMSCS offering [1] is solid. Charles Isbell and Michael Littman are great at building intuition into equations.

[1] https://www.udacity.com/course/reinforcement-learning--ud600

joshuamorton · on May 7, 2018

Isbell's course in person was great. And if the exams for the online version are anything like the in person ones, it really does test your understanding of foundational concepts.

richfnelson · on May 8, 2018

Yup, just took the online RL class and the average grade for the final exam was 45 out of 100, high score of 76. The format was true/false with a short explanation for your answer. I never thought I'd be proud about getting a 53 on a true/false exam, but it was an extremely challenging and rewarding class.

pure-awesome · on May 7, 2018

I was wondering how you were over 120 years old there, for a moment, then I realized turn of the century doesn't mean that century any more.

alexcnwy · on May 7, 2018

Bartow & Sutton is excellent.

You can definitely find it online but be sure to find the right version - the latest version has great illustrations and is a lot clearer.

Also, check out the RL jupyter notebook here by my friend Ryan Sweke who does work on RL for quantum computing: https://github.com/R-Sweke/CrashCourseInNeuralNetworksWithKe...

oneraynyday · on May 7, 2018

Great suggestion! The blog was based on a large portion of the book. A friend of mine asked for a version of the first chapter that was digestible for an audience that is in high-school to undergrad college level. I wrote this blog with that in consideration, while adding my own observations as well. I am planning to write up some python solutions for the MDP chapter as well. Thanks for reading :)

markatkinson · on May 7, 2018

Thanks for this, I have read a couple books on deep learning but struggled to find anything on Reinforcement Learning. Maybe an Ask HN is in order.

sharonzhou · on May 7, 2018

I'm surprised there was no octopus to represent the bandits.

In all seriousness, this post makes sense to me, as someone who does RL research. However, the intuition behind the concepts could be communicated more clearly. I would reason that this piece is less accessible to those who have much less knowledge of RL/bandits. Given that it's an introduction, I presume that's your intended reader, though perhaps writing can also be for your own edification. Who's your audience?

oneraynyday · on May 7, 2018

Thanks for the reply! My audience is for someone who is early into their undergrad degree, doing math or computer science. Which parts do you think is not communicated clearly? I can make an edit later today :)

joshuamorton · on May 7, 2018

Oh cool, I've also done a presentation on bandit methods for early-undergrads (see: https://gtagency.github.io/2016/experimentation-with-no-ragr..., its missing speaker notes so it looks a bit strange, but that outlines the structure fairly well). It was also sort of an intro to the beta distribution and why you should love it, hence the focus on Thompson Sampling.

Some critiques:

- I feel like your justification/explanation for why this is useful is a bit lacking. Personally I find framing it in terms of regret-minimzation better than gain-maximization, even though in practice they're the same. I think it frames the situation in such a way where you go in knowing that you will have to pick non-optimal things some, so your job is to learn the underlying distributions as quickly as possible, instead of trying to pick the best things. Interestingly, I think thinking of it as gain-maximization leads you down an epsilon greedy path, whereas regret-minimzation leads you toward UCB1/Thompson Sampling better. Since you pivot to RL instead of just bandits, I can kind of understand it, but see my last point.

- As a general rule, I try to minimize math in undergrad-focused talks/documents. Even as someone who spends a lot of time explaining statistical concepts to people, my eyes glaze over when I see `q*(a) = E[Rt|At = a]`. Obviously you need some and this is just a personal thing. For the most part I actually think you do a decent job of explaining the equations you use. At least until the gradient bandit part :P Then it just feels like a textbook proof excerpt.

- Nit: You don't fully explain that epsilon-greedy is greedy, except epsilon of the time. That caught me up for a second.

- The last thing is that I feel like the motivation and difference between stationary and nonstationary reward distributions isn't well explained. Nonstationary rewards don't really "fit" the mental model behind k-armed bandits a lot of the time. I'm actually curious for a better motivation there, as I can't articulate one myself.

oneraynyday · on May 7, 2018

Hey Joshua, thanks so much for the criticism. I do see your point towards epsilon greedy vs. regret minimization. I will add a section about that before presenting UCB1. I also will add more information in the epsilon-greedy strategy section, elaborating on what the epsilon is really for. I'm not 100% sure how to reframe the nonstationary reward situation, because I feel like that adds state dependent on t to the bandit scenario, which then feels more like MDP.

joshuamorton · on May 7, 2018

>because I feel like that adds state dependent on t to the bandit scenario, which then feels more like MDP.

Yeah this was sort of exactly the issue I was running into. I can't justify it to myself without essentially saying "this is just an MDP in disguise", which maybe is the right way to do it. I'm pretty sure you can define a k-armed bandit as an MDP on a single state, where each action corresponds to a machine, and all actions return you to the single state.

So maybe that is the right motivation. But reversing that "an MDP is just a k-armed bandit problem where sometimes playing a machine breaks it and forces you to play other machines, which can impact how quickly the casino fixes your first machine..." feels forced.

All that said, its a good article :)

antognini · on May 7, 2018

One thing that confused me is that you introduce the "preference" for an action, but don't really define it. Some exposition about what that means would have helped.

oneraynyday · on May 7, 2018

Ah yes, I didn't want the reader to fuss too much about the idea of a preference, since it's really just an analog of un-normalized pi(modulo the exponentiation part). It doesn't have a strict definition nor is it a formal term. I will un-bold it and italicize it instead. Thank you :)

picomancer · on May 7, 2018

It seems like any method for solving this problem could be interpreted in a Bayesian way: At any time, you consider all the different possible distributions each arm could have, and assign each a probability which is how likely you think that distribution is to occur.

The probabilities are initialized to some value (the "prior"), then when you pull the arm, you get some new information, which you use to update the probabilities based on evidence.

It would be interesting to try to see if you could analytically solve this problem for a simple family of distributions. For example, assume each lever produces Gaussian results, but has an unknown mean and SD. Set the prior to be that the means are normally distributed with mean 0 and SD 1, and the SD's are exponentially distributed with mean 1.

srean · on May 7, 2018

Your intuition is spot on. With this line of argument you will end up with Thompson Sampling [0]. TS is an old algorithm from the 1930s. It is disconcertingly effective, even when the underlying Bayesian assumptions are not correct. It is also extremely hard to analyze. This led to a weird state of affairs - TS gave excellent empirical performance but we could not make any theoretical statements about its efficacy. It is only very recently that this has changed.

[0] https://en.wikipedia.org/wiki/Thompson_sampling

xyhopguy · on May 7, 2018

I think that's how you derive UCB, but optimizing cumulative regret rather than finding the probability distribution directly. Pls correct me if I'm wrong

crisbal_ · on May 7, 2018

As someone who is doing his bachelor thesis on Reinforcement Learning this is some useful information.

OT (but not really) question: does anyone here use Reinforcement Learning techniques at work? For the thesis I am working on black-box optimization of 2 variable functions with Reinforcement Learning (and comparing it with Bayesian Optimization techniques).

As someone else suggested the Sutton & Barto book is really great knowledge but I would also like to suggest these lessons (https://www.youtube.com/watch?v=2pWv7GOvuf0) by David Silver (who worked on AlphaGo)

chudi · on May 7, 2018

We used it to discover the best picture of a product in terms of click trough rate, we also found that the end user really don't want you to change the order of the pictures of their products

Q6T46nT668w6i3m · on May 7, 2018

I regularly use reinforcement methods for simple robotics problems. TensorFlow, et al. have made their use practical.

leongrandote · on May 7, 2018

At the beginning of this post the given definition of value doesn't seem correct to me, because I think it should be the expected value of the sum of all the rewards from t to infinity and not of R_t, but I could be wrong. Citing D. Silver: The agent's job is to maximise cumulative rewards (highlight cumulative). I would suggest David Silver course for reinforcement learning, http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html, another interesting post is https://jeremykun.com/tag/bandit-learning/ and the following (5 related post comparing implementations of the algorithms in python).

I know I should not discourage people, but you should read the post Deep reinforcement learning doesn't work yet before drinking all the cool aid: https://www.alexirpan.com/2018/02/14/rl-hard.html

Edited many times.

svalorzen · on May 7, 2018

You are right, it should be the expected value for the sum of all future rewards.

At the same time, given that we're talking about bandits, it doesn't really matter, since there is no state. Thus, the summatory over time you'd like to see doesn't change the relative ordering of the actions: the expected value in your definition is simply the expected reward for the action multiplied by the number of times you expect to play. So it doesn't really change anything.

leongrandote · on May 7, 2018

You are right, anyway I think that people playing with a bandit machine are going to continue playing more time if they are getting a lot of money that if they are loosing money, so when people are involved in games there is a hidden state, the mental state of the player. But if you decide up front the number of steps and you don't change your strategy depending of your mood, then this formal algorithm work as stated.

banditlover · on May 7, 2018

Contextual bandits, on the other hand, allow you to put your mood as a context (features) and your strategy depends on the features. You still have that simple expectation maximization (instead of a brutally hard to optimize loss), yet much more flexibility.

oneraynyday · on May 7, 2018

Yes, it is as you stated. Due to the fact that bandits are stateless, there is no state parameter in $q_(a,s)$. From where I learned it, this could arguably be an abuse of notation to use $q_$ in the same context. In my newer entry(which is currently WIP), it uses $q_*(a,s)$ and uses cumulative sum of the future rewards(with discount). Thanks for the reply guys :)

subroutine · on May 7, 2018

This is interesting. It reminds me of the Rescorla-Wagner model (a model of classical conditioning that's probably mentioned in every psych/neuro learning and memory course textbook), which describes the trial-by-trial change in associative strength between a conditioned stimulus (CS) and unconditioned stimulus (US).

Though it has its shortcomings, R&M is quite elegant in its simplicity, while doing a pretty good job modeling some fairly complex behavioral/cognitive changes (and made an interesting prediction about 'blocking' that ended up being true).

For more on this...

http://campus.albion.edu/wjwilson/files/2012/03/RWSimplified...

http://www.scholarpedia.org/article/Rescorla-Wagner_model

closed · on May 7, 2018

RL can be seen as a generalization of the Rescorla-Wagner model!

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4633133/

indescions_2018 · on May 7, 2018

Nice writeup! Why do we love bandit problems? Not only the efficacy with which agents can minimize exploration and maximize rewards in unknown worlds. But also the emergent philosophical truths that arise concerning causality!

Remains very much an active research topic. With applications ranging from epidemiology, to website optimization ;)

CS7792 - Counterfactual Machine Learning, T. Joachims, Cornell University

http://www.cs.cornell.edu/courses/cs7792/2016fa/

Deep Bayesian Bandits Showdown - Google Brain

https://arxiv.org/pdf/1802.09127.pdf

bagrow · on May 7, 2018

Anyone aware of research on bandits with a growing set of arms? Meaning, every so often when you play an arm part of your reward is revealing a new arm!

This seems to not fit the criteria anymore (not tabular, not Markov). Is it related to structured bandits?

dhotson · on May 7, 2018

Hmm, perhaps something like UCB1? The UCB (upper-confidence-bound) family of algorithms might be appropriate—it's not quite what you've described though.

They'll explore the arm with the highest potential payoff i.e. the highest upper-confidence-bound, which often is the arm you know the least about since they're roughly calculated as (average + confidence interval) and early on the confidence bound is large. This style of algorithm means you can add arms as you go through the experiment and they'll be explored/exploited in a reasonable way.

What you're describing sounds more like you're exploring a frontier and "discovering" new options along the way..?

nabla9 · on May 7, 2018

Consider any arm not pulled is "a new arm". A growing set of arms is similar to case where k >> T. This creates innovate vs. exploit problem. The learner must choose between exploiting what it already knows and searching for better rewards.

Variations of this scheme include exploit vs. copy vs. innovate used in computational biology. Learning agent can copy what others do or innovate and try something new.

noelwelsh · on May 7, 2018

Adding new arms in a bandit problem doesn't pose a problem for most bandit algorithms. Any of the common algorithms will handle it just fine. Arms disappearing is more interesting, as that effects the explore / exploit tradeoff. It's been a while since I was studying bandit algorithms but "Mortal multi-armed bandits" is one paper that addresses this.

leongrandote · on May 7, 2018

If the additional arms appear late, my intuition is to use one algorithm for the old parts and one for the new arms, and a weighting factor to decide which model to use. This procedure can be generalized for generational partitions. For example you group the old arms into group A and the new arms in group B, and a bandit with two arms to decide which group to use, then inside each group you apply your favorite bandit algorithm, that is not a divide and conquer algorithm but has something in common. This is a hierarchical model.

closed · on May 7, 2018

I don't work in this area, but googled for HMM + indian buffet process and found this page. I'd guess some of the nonparametric bayesian models focus on the problem you mentioned.

http://research.cs.rutgers.edu/~cmansley/fall08/cs500.html

bhl · on May 7, 2018

What's the difference in problem statement between k-bandits and the expert's problem?

mathinpens · on May 7, 2018

nice blog but you should include citations/references even if it is very informal