Explore-then-commit algorithm

Explore Then Commit (ETC) is an algorithm for the multi-armed bandit problem focused on finding the best trade-off between exploration and exploitation.

Multi-armed bandit problem

The multi-armed bandit problem is a sequential game where one player has to choose at each turn between K {\displaystyle K} actions (arms). Behind every arm a {\displaystyle a} is an unknown distribution ν a {\displaystyle \nu _{a}} that lies in a set D {\displaystyle {\mathcal {D}}} known by the player (for example, D {\displaystyle {\mathcal {D}}} can be the set of Gaussian distributions or Bernoulli distributions).

At each turn t {\displaystyle t} the player chooses (pulls) an arm a t {\displaystyle a_{t}}, they then get an observation X t {\displaystyle X_{t}} of the distribution ν a t {\displaystyle \nu _{a_{t}}}.

Regret minimization

The goal is to minimize the regret at time T {\displaystyle T} that is defined as

R T := ∑ a = 1 K Δ a E [ N a ( T ) ] {\displaystyle R_{T}:=\sum _{a=1}^{K}\Delta _{a}\mathbb {E} [N_{a}(T)]}

where

μ a := E [ ν a ] {\displaystyle \mu _{a}:=\mathbb {E} [\nu _{a}]} is the mean of arm a {\displaystyle a}
μ ∗ := max a μ a {\displaystyle \mu ^{*}:=\max _{a}\mu _{a}} is the highest mean
Δ a := μ ∗ − μ a {\displaystyle \Delta _{a}:=\mu ^{*}-\mu _{a}}
N a ( t ) {\displaystyle N_{a}(t)} is the number of pulls of arm a {\displaystyle a} up to turn t {\displaystyle t}

The player has to find an algorithm that chooses at each turn t {\displaystyle t} which arm to pull based on the previous actions and observations ( a s , X s ) s < t {\displaystyle (a_{s},X_{s})_{s<t}} to minimize the regret R T {\displaystyle R_{T}}.

This is a trade-off problem between exploration (finding the arm with the highest mean) and exploitation (playing the arm which is perceived to be the best as much as possible).

Algorithm

Two runs of ETC with the same M = 10. On the first run it does manage to find the best arm after the exploration while it does not on the second run

The algorithm explores each arm M {\displaystyle M} times. For the rest of the game the algorithm exploits its discoveries by playing the arm with the highest mean. If the horizon T {\displaystyle T} is known, then the number of explorations M {\displaystyle M} can depend on T {\displaystyle T}.

Adaptations of the algorithm exist and can be found in the literature for other settings.

Pseudocode

Theoretical results

Trade of between exploration (large M) and exploitation (small M) for ETC

When all arms are 1 {\displaystyle 1}-sub gaussian, by choosing to explore each arm M {\displaystyle M} times, the regret at time T {\displaystyle T} verify

R T ≤ M ∑ i = 1 K Δ i + ( T − M K ) ∑ i = 1 K Δ i exp ⁡ ( − M Δ i 2 4 ) {\displaystyle R_{T}\leq M\sum _{i=1}^{K}\Delta _{i}+(T-MK)\sum _{i=1}^{K}\Delta _{i}\exp \left(-{\frac {M\Delta _{i}^{2}}{4}}\right)}

the first term is considered the cost of the exploration

M ∑ i = 1 K Δ i {\displaystyle M\sum _{i=1}^{K}\Delta _{i}}.

The second term is the cost of not having explored enough, leading to a probability of not having an optimal arm as the arm with the highest empirical mean.

( T − M K ) ∑ i = 1 K Δ i exp ⁡ ( − M Δ i 2 4 ) {\displaystyle (T-MK)\sum _{i=1}^{K}\Delta _{i}\exp \left(-{\frac {M\Delta _{i}^{2}}{4}}\right)}

Increasing M {\displaystyle M} increases the first term while decreasing the second term. The best possible M {\displaystyle M} must depend on the ( Δ i ) i {\displaystyle (\Delta _{i})_{i}} which is unknown by the player.