DFS, poker, basketball, Starcraft and strong opinions, weakly held

gautamcgoel · on Dec 25, 2019

The principle the author is describing is very similar to the Optimism in the Face of Uncertainty (OFU) principle in online learning, which was proposed by Auer et Al in the context of the multi-armed bandit problem. The basic idea is that when you are sequentially making decisions in the presence of uncertainty, you should make decisions which interpret the data you have in the most optimistic light. The intuition here is that if your optimism turns out to be correct, you can capture most of the upside, but if your optimism is misguided, you can quickly gather information and reassess.

dtran · on Dec 25, 2019

OP here. Thanks for making this connection! I hadn't heard of OFU before, but it sounds interesting. I'm actually working on another post about "optimizing around the optimal" and use multi-armed bandit as an example =). As Jeff Atwood mentions in his post on SOWH, writing about this has been a great way for me to think about the topic. If it turns out that smarter/more experienced people have already thought about/written about this subject at length, I'm not at all disappointed that I'm retracing their steps. In fact, it means that I independently found the right path!

I see this Coursera course on OFU: https://www.coursera.org/lecture/practical-rl/optimism-in-fa... and what looks like Auer's original UCRL paper http://papers.nips.cc/paper/3052-logarithmic-online-regret-b... Do you have a preferred source for learning about this? Thanks and Merry Christmas!

gwern · on Dec 25, 2019

Another one you might like is the Amazonism 'disagree and commit': https://www.amazon.com/p/feature/z6o9g6sysxur57t

I ( https://www.gwern.net/Timing#try-try-again-but-less-less ) see it as a kind of bandit as well, but specifically, Thompson sampling, because you can interpret groups of people with strong but inconsistent beliefs individually as collectively implementing a Bayesian distribution, and then individuals going off and following what seems to then like the most profitable opportunity is equivalent to Thompson sampling: https://people.csail.mit.edu/pkrafft/papers/krafft-thesis-fi...

gautamcgoel · on Dec 25, 2019

I like these notes on multi-armed bandits (https://arxiv.org/pdf/1904.07272.pdf), I think they're pretty accessible. Section 1.3.3 discuss the OFU principle. Happy holidays to you as well!

Kinrany · on Dec 25, 2019

Initially this seems asymmetric for no reason: why would you interpret the data in the most optimistic and not the most pessimistic light?

One reason for this asymmetry comes to mind after a few minutes. There are two sources of errors: known unknowns and unknown unknowns. Known unknowns can be handled with statistics and abstracted away when comparing different plans.

Unknown unknowns, on the other hand, cannot be accounted for in advance by definition. The worst case is always that an unknown factor will make you lose everything no matter what. On the other hand, the best case is limited by the plans you're making. So it makes sense to go with the plan with the best payout in case there will be no unknown unknowns.

gautamcgoel · on Dec 25, 2019

> So it makes sense to go with the plan with the best payout in case there will be no unknown unknowns.

Yes, this intuition is in the right direction. A key point about online learning is that we usually measure the performance of online algorithms by static regret, which is the difference between the reward/cost incurred by the online learner and the reward/cost of the best fixed action (for example, the payout of the best single arm in the multi-armed bandit setting). The goal is usually to design algorithms whose static regret is sublinear in the number of rounds played; the intuition here is that if the learner's regret is sublinear, then the average reward/cost incurred by the online learner across rounds converges to the average reward/cost of the best fixed action, so in that sense the online learner's choices are converging to the optimal choices. Any online learner which incurs an additive constant more cost per round (on average) relative to the best fixed action cannot have sublinear regret; hence it is important that the learner aggressively exploit any upside it finds. Put another way, if the learner consistently misses the mark per round, for example by being too conservative, it will not achieve sublinear regret.

vladf · on Dec 25, 2019

I think there's a really important caveat to OFU here.

Namely, we have your higher-level interpretation in the GGP ("great-grandparent post"):

> The intuition here is that if your optimism turns out to be correct, you can capture most of the upside, but if your optimism is misguided, you can quickly gather information and reassess.

A _really_ important characteristic that lets UCB work is that its environment is stochastic, not adversarial. If you're OFU in an adversarial environment, you will get penalized for it and won't achieve sublinear regret. So I think there's an important taxonomy here, between an indifferent ("stochastic") and adversarial nature. These principles get stretched a little bit when you further compare oblivious to adaptive adversaries (or, in the stochastic case, stationary vs non-stationary).

My point here is that the exact contours of the exploration/exploitation tradeoff are delicately dependent on the problem setting, so it's a bit dangerous to make generalized conclusions about OFU as a principle without also acknowledging when it's relevant.

And from the GP:

> Known unknowns can be handled with statistics... so go with the plan with the best payout in case there will be no unknown unknowns.

I don't know about that. It's statistics all the way down.

gautamcgoel · on Dec 26, 2019

Yes, good catch! As this comment points out, OFU only makes sense when talking about stochastic environments (the OFU algorithm is described in terms of confidence intervals). The adversarial setting is quite different.

sabas123 · on Dec 25, 2019

As somebody who was top ~2% in Starcraft I got heavily triggered until this part

>The key insight from Starcraft is that you need to be constantly testing and adjusting your hypothesis by scouting your opponent, all while continually executing whatever the best strategy you know of at any given point in time.

This is one of actually a good description of what is to be (in part) learned from Starcraft. However the level to which you can opt to commit to your idea without accurate information is larger than most people can suspect, and a lot of work goes into making the unwanted adjustments seem a non-issue.

dtran · on Dec 25, 2019

I started this post with the Starcraft campaign example, and then as I kept writing, was thinking that I might have to backtrack and get rid of SC completely.

To be fair, the campaign is very contrived, but in most 1v1s, I feel like strong opinions were the way to go, with the “unwanted adjustments” you talked about, like waiting a few seconds longer to take your natural after your opponent’s scout leaves. It’s also why after many thousands of games, I sort of lost interest in Starcraft. Based on maps and matchups, besides cheese, most games played out with both players just executing their predetermined builds until one player made a mistake (the game breaks down to balancing economy, army size, tech and army positioning, so a miscalculation along any of those axes). I did enjoy some of the ICCup seasons with new maps before people figured out optimal builds where it wasn’t as repetitive. But mostly I stopped playing cuz of all the long drawn out PvTs :)

artosispylon · on Dec 25, 2019

Hi David, thanks for the article. Maybe poker, basketball, and Starcraft are highly correlated hobbies among software engineers in the Bay area, but I find it incredible that we share such similar interests. Starcraft not as much for me anymore, despite my HN username (though I still play maybe once a month or so).

dtran · on Dec 26, 2019

Thanks for reading! Whoa, do you still play BW once a month, or is that SC2? I never started playing SC2, so I actually had to look up Artosis pylon. I def remember Manner pylon though :)

And hmm, I actually didn't think I'd find ANY readers at the intersection of startups, CS, SC, poker and basketball. Assuming most people on HN are interested in startups and computer science, I'd imagine that the limiting area of intersection would probably be SC ∩ bball (at least more so than SC ∩ poker or poker ∩ bball), so it's pretty awesome that we share all of those!