Post

Successive Halving Algorithm



Successive Halving Algorithm

๐Ÿ“„ Jamieson, Kevin, and Ameet Talwalkar. โ€œNon-stochastic best arm identification and hyperparameter optimization.โ€ Artificial intelligence and statistics. PMLR, 2016.

Non-stochastic Best Arm Identification

Successive Halving Algorithm์€ Bandit ๊ธฐ๋ฐ˜์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. Bandit์ด๋ผ๋Š” ๋‹จ์–ด๋ฅผ ๋“ค์œผ๋ฉด ๊ฐ€์žฅ ๋จผ์ € ๋– ์˜ค๋ฅด๋Š” ๊ฒƒ์ด A/B ํ…Œ์ŠคํŠธ์— ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” Multi-armed Bandit์ž…๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์Šฌ๋กฏ๋จธ์‹ ์ด ์žˆ๊ณ , ๊ทธ ์Šฌ๋กฏ๋จธ์‹ ์˜ ํŒ” (arm)์„ ๋‹น๊ฒจ์„œ ๋ณด์ƒ (reward)์„ ์–ป๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด ๋ณด์ƒ์€ ์–ด๋–ค ํ™•๋ฅ  ๋ถ„ํฌ๋กœ ๋‚˜ํƒ€๋‚˜๋Š”๋ฐ, ์–ด๋–ค ์ „๋žต์œผ๋กœ ์Šฌ๋กฏ๋จธ์‹ ์„ ๊ณจ๋ผ์•ผ ์ตœ๋Œ€์˜ ๋ณด์ƒ์„ ์–ป๋Š”์ง€์— ๊ด€ํ•œ ๋‚ด์šฉ์ด ๋ฐ”๋กœ MAB์ž…๋‹ˆ๋‹ค. MAB ๋ฌธ์ œ๋ฅผ ํ‘ธ๋Š” ๋ฐฉ๋ฒ•์€ ํฌ๊ฒŒ ๋‘ ๊ฐœ๋กœ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ๋Š”๋ฐ์š”. ํ•˜๋‚˜๋Š” ๊ฐ€์žฅ ํฐ ๋ณด์ƒ์„ ์ฃผ๋Š” ์ตœ์„ ์˜ ์Šฌ๋กฏ๋จธ์‹ ์˜ ํŒ”์„ ์ฐพ๋Š” ๊ฒƒ์ด๊ณ , ํ•˜๋‚˜๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ๋งŽ์ด ๋‹ค๋ฃจ๋Š” Exploration-Exploitation Trade-off๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋ณธ ๊ธ€์—์„œ ๋‹ค๋ฃจ๋Š” Successive Halving Algorithm์€ ์ฒซ ๋ฒˆ์งธ ๋ฐฉ๋ฒ•์ธ Best Arm Identification์„ ๊ธฐ๋ฐ˜์— ๋‘๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

Stochastic vs Non-stochastic

Best Arm Identification ๋ฐฉ๋ฒ•๋„ ํฌ๊ฒŒ Stochasticํ•œ ํ™˜๊ฒฝ๊ณผ Non-stochasticํ•œ ํ™˜๊ฒฝ์œผ๋กœ ๋‚˜๋‰ฉ๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋ ‡๊ฒŒ ํ‘œํ˜„ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

๐ŸŽฐ Stochastic

๋ชจ๋“  $i \in [n], k \geq 1$์— ๋Œ€ํ•ด์„œ $\mathbb{E}[\ell_{i, k}] = \mu_i$๋ฅผ ๋งŒ์กฑํ•˜๋Š” $\ell_{i,k}$๋ฅผ $[0, 1]$ ๊ตฌ๊ฐ„ ๋‚ด์˜ ํ™•๋ฅ  ๋ถ„ํฌ์—์„œ ๋ฝ‘์€ i.i.d ์ƒ˜ํ”Œ๋กœ ๋‘์ž. ์ด๋•Œ $\sum^n_{i=1} T_i$๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๋™์•ˆ $\arg\min_i \mu_i$๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ด ๋ชฉ์ ์ด๋‹ค.

๐ŸŽฏ Non-stochastic

๋ชจ๋“  $i \in [n], k\geq 1$์— ๋Œ€ํ•ด์„œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํ–‰๋™์— ๋…๋ฆฝ์ ์ธ Loss sequence $\ell_{i, k} \in \mathbb{R}$์„ ์ƒ์„ฑํ•œ๋‹ค๊ณ  ํ•˜์ž. ๋” ๋‚˜์•„๊ฐ€ $\nu_i = \lim_{\tau \to \infty} \ell_{i, \tau}$๊ฐ€ ์กด์žฌํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜์ž. ์ด๋•Œ $\sum^n_{i=1} T_i$๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๋™์•ˆ $\arg\min_{i} \nu_i$๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ด ๋ชฉ์ ์ด๋‹ค.

ํ‘œํ˜„์ด ์–ด๋ ต์ง€๋งŒ ๊ฐ ํ™˜๊ฒฝ์—์„œ ์ค‘์š”ํ•œ ๋ถ€๋ถ„์€ $\ell_{i, k}$๋ฅผ ์–ด๋–ป๊ฒŒ ์ •์˜ํ•˜๋Š๋ƒ์ž…๋‹ˆ๋‹ค. Stochasticํ•œ ๋ฐฉ๋ฒ•์€ $\ell_{i, k}$๋ฅผ ํ™•๋ฅ  ๋ถ„ํฌ์—์„œ ๋ฝ‘์€ i.i.d ์ƒ˜ํ”Œ์„, Non-stochasticํ•œ ๋ฐฉ๋ฒ•์—์„œ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๋…๋ฆฝ์ ์ธ ์–ด๋–ค ์ˆœ์—ด์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์ตœ์ ํ™”๋Š” Non-stochastic ํ™˜๊ฒฝ์—์„œ์˜ Bandit ๋ฌธ์ œ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์Šฌ๋กฏ๋จธ์‹ ์˜ ํŒ”์€ ํ•˜๋‚˜์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •์œผ๋กœ, $\ell_{i, k}$๋Š” ํ•™์Šตํ•˜๋Š” ๋ชจ๋ธ์˜ ์†์‹ค ํ•จ์ˆ˜ ๊ฐ’์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋•Œ non-stochasticํ•˜๊ณ  obliviousํ•œ ๊ฒฐ๊ณผ๋ฅผ ์–ป๊ฒŒ ๋˜๋Š”๋ฐ์š”. ์†์‹ค ํ•จ์ˆ˜์˜ ๊ฐ’์€ ํ™•๋ฅ ์ ์ด ์•„๋‹Œ ๊ฒฐ์ •๋ก ์ (deterministic)์œผ๋กœ ๋‚˜์˜ค๊ณ , ๊ฐ๊ฐ์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •๊ณผ ์†์‹ค ํ•จ์ˆ˜์˜ ๊ฐ’์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ๊ณต์œ ํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฐ ๊ด€์ ์—์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋„ Non-stochasticํ•œ Best Arm Identification์œผ๋กœ ํ”„๋ ˆ์ด๋ฐํ•˜์—ฌ ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์•ˆํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

Non-stochastic์ด ์‰ฝ์ง€ ์•Š์€ ์ด์œ 

Stochasticํ•œ ํ™˜๊ฒฝ์€ ๋ช‡ ๊ฐ€์ง€ ์„ค์ •์„ ํ†ตํ•ด Non-stochasticํ•œ ํ™˜๊ฒฝ์œผ๋กœ ๋ฐ”๋€” ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Stochasticํ•œ ํ™˜๊ฒฝ์—์„œ์˜ Loss ๊ฐ’์ธ $\ell^\prime_{i, T_i}$ ๋ฅผ ์ด์šฉํ•ด $\ell_{i, T_i} = \frac{1}{T_i} \sum^{T_i}_{k=1} \ell^\prime_{i, T_i}$ ๋กœ ์„ค์ •ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค. ์ด ๋•Œ ํฐ ์ˆ˜์˜ ๋ฒ•์น™์— ์˜ํ•ด ๋‹ค์Œ์ด ์„ฑ๋ฆฝํ•ฉ๋‹ˆ๋‹ค.

\[\lim_{\tau \to \infty} \ell_{i, \tau} = \mathbb{E}[\ell^\prime_{i, 1}].\]

๊ทธ๋ฆฌ๊ณ  $\ell_{i, T_i} = \min \left\{ \ell^\prime_{i, 1}, \cdots, \ell^\prime_{i, T_i} \right \}$ ๋กœ ๋‘”๋‹ค๋ฉด $\ell_{i, t}$ ๋Š” bounded, monotonically decreasing sequence๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ $\ell_{i, t}$ ๋Š” Monotone Convergence Theorem์— ์˜ํ•ด ๋ฐ˜๋“œ์‹œ ๊ทนํ•œ๊ฐ’์„ ๊ฐ–๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

Stochasticํ•œ ํ™˜๊ฒฝ๊ณผ Non-stochasticํ•œ ํ™˜๊ฒฝ์˜ ์ฐจ์ด๋Š” ๊ทนํ•œ๊ฐ’์œผ๋กœ ์ˆ˜๋ ดํ•˜๋Š” ์†๋„์—์„œ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค. Stochasticํ•œ ํ™˜๊ฒฝ์€ ์ˆ˜๋ ดํ•˜๋Š” ์†๋„๋ฅผ ์•Œ ์ˆ˜ ์žˆ์ง€๋งŒ Non-stochastic ํ™˜๊ฒฝ์€ ์•Œ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. Stochasticํ•œ ์„ค์ •์—์„œ $\hat{\mu}_{i, T_i} = \frac{1}{T_i} \sum^{T_i}_{k=1} \ell_{i, k}$๋กœ ๋’€์„ ๋•Œ ๋ชจ๋“  $i \in [n], T_i > 0$ ์— ๋Œ€ํ•ด ๋‹ค์Œ์„ ๋งŒ์กฑํ•ฉ๋‹ˆ๋‹ค.

\[\left| \hat{\mu}_{i, T_i} - \mu_i \right| \leq \sqrt{\frac{\log(4nT_i^2)}{2T_i}}.\]

๐Ÿง ์ด ์‹์ด ์„ฑ๋ฆฝํ•˜๋Š” ์ด์œ ๋Š”?

$X_t$๊ฐ€ $[0, 1]$ ์•ˆ์— ์žˆ๊ณ , $n > 1$ ์ผ ๋•Œ ๋‹ค์Œ์ด ์„ฑ๋ฆฝํ•œ๋‹ค.

\[P \left( \bigcup^\infty_{t=1} \left\{ \left| \frac{1}{t} \sum^t_{s=1} X_s - \mathbb{E}[X_s] \right| \geq \sqrt{\frac{\log(4nt^2)}{2t}} \right\} \right) \leq \frac{1}{n}.\]

์ฆ๋ช….

Booleโ€™s Inequality์— ์˜ํ•ด์„œ ๋‹ค์Œ์ด ์„ฑ๋ฆฝํ•ฉ๋‹ˆ๋‹ค.

\[P \left( \bigcup^\infty_{t=1} \left\{ \left| \frac{1}{t} \sum^t_{s=1} X_s - \mathbb{E}[X_s] \right| \geq \sqrt{\frac{\log(4nt^2)}{2t}} \right\} \right) \leq \sum^\infty_{t=1} P \left( \left| \frac{1}{t} \sum^t_{s=1} X_s - \mathbb{E}[X_s] \right| \geq \sqrt{\frac{\log(4nt^2)}{2t}} \right).\]

Hoeffdingโ€™s Inequality์™€ ๋ฌดํ•œ ๊ธ‰์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์Œ์„ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

\[\begin{aligned} \sum^\infty_{t=1} P \left( \left| \frac{1}{t} \sum^t_{s=1} X_s - \mathbb{E}[X_s] \right| \geq \sqrt{\frac{\log(4nt^2)}{2t}} \right) &\leq \sum^\infty_{t=1} 2e^{-\log(4nt^2)} \\ &\leq \sum^\infty_{t=1} \frac{2}{4nt^2} \\ &\leq \sum^\infty_{t=1} \frac{1}{2nt^2} \\ &\leq \frac{1}{2n} \sum^\infty_{t=1} \frac{1}{t^2} \\ &= \frac{1}{2n} \cdot \frac{\pi^2}{6} \\ &\leq \frac{1}{2n} \cdot 2 = \frac{1}{n} &\blacksquare \end{aligned}\]

Non-stochasticํ•œ ํ™˜๊ฒฝ์—์„œ์˜ ๊ฐ€์ •์€ $\lim_{\tau \to \infty} \ell_{i, \tau}$ ๊ฐ€ ์กด์žฌํ•˜๋ฉด ๋‹ค์Œ์„ ๋งŒ์กฑํ•˜๋Š” ์ฆ๊ฐ€ํ•˜์ง€ ์•Š๋Š” ํ•จ์ˆ˜ $\gamma_i$ ๊ฐ€ ์กด์žฌํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

\[\left| \ell_{i, t} - \lim_{\tau \to \infty} \ell_{i, \tau} \right| \leq \gamma_{i}(t) \to 0 \quad \text{ as} \quad t \to \infty\]

์œ„ ๋‚ด์šฉ์€ ๊ทนํ•œ๊ฐ’์œผ๋กœ์˜ ์ˆ˜๋ ด์€ ๋ณด์žฅํ•ด์ฃผ์ง€๋งŒ ์–ผ๋งˆ๋‚˜ ๋นจ๋ฆฌ $\gamma_i(t)$ ๊ฐ€ 0์œผ๋กœ ๋„๋‹ฌํ•˜๋Š”์ง€๋Š” ์•Œ๋ ค์ฃผ์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ด ์‚ฌ์‹ค์€ ๋‹ค์Œ์˜ ๋‘ ๊ฐ€์ง€ ๊ฒฐ๊ณผ๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  1. ๊ฒฐ๊ตญ Best Arm์ด ์กด์žฌํ•˜๊ธฐ๋Š” ํ•œ๋‹ค.
  2. Best Arm์ธ์ง€ ํ™•์ธํ•  ์ˆ˜ ์—†๊ฑฐ๋‚˜ Best Arm์˜ ์ •ํ™•ํ•œ ๊ฐ’์„ ์•Œ์•„๋‚ผ ์ˆ˜ ์—†๋‹ค.

๋ณธ ๋…ผ๋ฌธ์€ ๊ธฐ์กด์— ์ œ์•ˆ๋˜์—ˆ๋˜ Successive Halving Algorithm์— ๋Œ€ํ•ด ์œ„ ๋‚ด์šฉ์„ ํ†ตํ•ด ์ •ํ•ด์ง„ ์˜ˆ์‚ฐ(budget) ๋‚ด์—์„œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ตœ์ ํ™”ํ•˜๋Š” ํšจ๊ณผ์ ์ด๊ณ  ํšจ์œจ์ ์ธ ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

Successive Halving Algorithm

Pseudo-code for SHA Pseudo-code for SHA

Successive Halving Algorithm์€ ๋‹ค์Œ ์ˆœ์„œ๋กœ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

  • ์ž…๋ ฅ๊ฐ’์œผ๋กœ ์˜ˆ์‚ฐ $B$ ์™€ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •๊ฐ’ ๊ฐœ์ˆ˜ $n$ ์„ ๊ฐ–์Šต๋‹ˆ๋‹ค.
    • ๋‘ ๊ฐœ์˜ ์ž…๋ ฅ๊ฐ’ ์ค‘ ์˜ˆ์‚ฐ $B$ ๋Š” $B \leftarrow n$, $B \leftarrow 2B$ ๋กœ ์„ค์ •ํ•˜๋Š” ๋”๋ธ”๋ง ํŠธ๋ฆญ (doubling trick)์œผ๋กœ ์—†์•จ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๊ทธ ๋‹ค์Œ ์ตœ์ดˆ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ • ์ง‘ํ•ฉ $S_0$ ์„ ์ดˆ๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค.
  • ๋‹ค์Œ์˜ ๊ณผ์ •์„ ํ•˜๋‚˜์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •์ด ๋‚จ์„ ๋•Œ๊นŒ์ง€ ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค.
    • $S_k$ ์— ๋‚จ์•„ ์žˆ๋Š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •๋“ค๋กœ ๋ชจ๋ธ์„ $r_k$ Epoch ๋งŒํผ ๋” ํ•™์Šตํ•˜๊ณ  $R_k$ ๋ฅผ ์ง€๊ธˆ๊นŒ์ง€ ํ•™์Šตํ•œ Epoch ์ˆ˜๋กœ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.
    • ํ•™์Šตํ•œ ๋ชจ๋ธ ์ค‘์—์„œ ์†์‹ค ํ•จ์ˆ˜ ๊ฐ’์ธ $\ell_{i, k}$ ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š์€ ์ ˆ๋ฐ˜์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •๊ฐ’์„ ๋ฒ„๋ฆฝ๋‹ˆ๋‹ค.
    • ๋‚จ์€ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •์„ $S_{k+1}$ ๋กœ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.
  • ๋งˆ์ง€๋ง‰์œผ๋กœ ๋‚จ์€ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •์„ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

์‹ค์ œ ์˜ˆ์‹œ

Figure from [1] Figure from [1]

์ดํ•ด๋ฅผ ์œ„ํ•ด์„œ ์ง์ ‘ ๊ฐ’์„ ๋Œ€์ž…ํ•˜์—ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์–ด๋–ป๊ฒŒ ์ž‘๋™ํ•˜๋Š”์ง€ ์•Œ์•„๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ์šฐ์„  ์˜ˆ์‚ฐ $B$ ๋ฅผ 32๋กœ, ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •๊ฐ’ ๊ฐœ์ˆ˜ $n$ ์€ 8๋กœ ๊ฐ€์ •ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด ์ตœ์ดˆ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ • ์ง‘ํ•ฉ $S_0$ ์˜ ํฌ๊ธฐ๋Š” ๋‹น์—ฐํžˆ 8์ด๊ณ , ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋ฐ˜๋ณต์€ $k=0$ ๋ถ€ํ„ฐ $k = \lfloor \log_2(n) \rfloor -1 = 2$ ๊นŒ์ง€ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

  1. $k=0$
    • $r_0 = \lfloor \frac{32}{ |S_0| \lceil \log_2(8) \rceil} \rfloor = \lfloor \frac{32}{8 \cdot 3} \rfloor = 1$
    • $S_0$ ๋‚ด ๋ชจ๋“  ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ 1 Epoch ๋งŒํผ ํ•™์Šตํ•˜๊ณ , ์„ฑ๋Šฅ์ด ๋‚ฎ์€ ์ ˆ๋ฐ˜์„ ๋ฒ„๋ฆฝ๋‹ˆ๋‹ค.
    • ๋”ฐ๋ผ์„œ $|S_1| = 4$ ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.
  2. $k=1$
    • $r_1 = \lfloor \frac{32}{|S_1| \lceil \log_2(8) \rceil} \rfloor = \lfloor \frac{32}{4 \cdot 3} \rfloor = 2$
    • $S_1$ ๋‚ด ๋ชจ๋“  ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ 2 Epochs ๋งŒํผ ํ•™์Šตํ•˜๊ณ , ์„ฑ๋Šฅ์ด ๋‚ฎ์€ ์ ˆ๋ฐ˜์„ ๋ฒ„๋ฆฝ๋‹ˆ๋‹ค.
    • ๋”ฐ๋ผ์„œ $|S_2| = 2$ ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.
  3. $k=2$
    • $r_2 = \lfloor \frac{32}{|S_2| \lceil \log_2(8) \rceil} \rfloor = \lfloor \frac{32}{2 \cdot 3} \rfloor = 5$
    • $S_2$ ๋‚ด ๋ชจ๋“  ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ 5 Epochs ๋งŒํผ ํ•™์Šตํ•˜๊ณ , ์„ฑ๋Šฅ์ด ๋‚ฎ์€ ์ ˆ๋ฐ˜์„ ๋ฒ„๋ฆฝ๋‹ˆ๋‹ค.
    • ๋”ฐ๋ผ์„œ $|S_3| = 1$์ด ๋˜๊ณ , $S_3$์— ์กด์žฌํ•˜๋Š” ๋‹จ ํ•˜๋‚˜์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์ด ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ SHA์—์„œ ์ฐพ๋Š” ์ตœ์  ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์ž…๋‹ˆ๋‹ค.

SHA์˜ ์ˆ˜๋ ด์„ฑ ๋ณด์žฅ

โš ๏ธ ์•„๋ž˜ ๋‚ด์šฉ์€ ์ž์„ธํ•œ ์ดํ•ด๋ฅผ ์›ํ•˜์‹œ๋Š” ๊ฒฝ์šฐ์—๋งŒ ๋ณด์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค. ํ•„์ˆ˜์ ์ธ ๋‚ด์šฉ์€ ์•„๋‹™๋‹ˆ๋‹ค. ๐Ÿ˜€

๋ณด๊ธฐ์—๋Š” ๊ฐ„๋‹จํ•ด๋ณด์ด๋Š”๋ฐ ์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ตœ์  ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ž˜ ์ฐพ์•„์ค€๋‹ค๋Š” ๋ณด์žฅ, ์ฆ‰ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์ˆ˜๋ ด์„ฑ ๋ณด์žฅ์€ ์–ด๋–ป๊ฒŒ ๋˜๋Š”๊ฑธ๊นŒ์š”? ๋…ผ๋ฌธ์—์„œ๋Š” ์—ฌ๋Ÿฌ ์ •๋ฆฌ(Theorem)๋ฅผ ์ด์šฉํ•˜์—ฌ ์ˆ˜๋ ด์„ฑ์— ๋Œ€ํ•œ ๋‚ด์šฉ์„ ์—„๋ฐ€ํ•˜๊ฒŒ ๋‹ค๋ฃจ๊ณ  ์žˆ๋Š”๋ฐ, ๋ณธ ํฌ์ŠคํŠธ์—์„œ๋Š” ์•ฝ์‹์œผ๋กœ ๋‹ค๋ฃจ๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

(1) ์ „์ฒด ์ƒ˜ํ”Œ์˜ ์ˆ˜๊ฐ€ ์˜ˆ์‚ฐ์„ ๋„˜์ง€ ์•Š์Œ

์šฐ์„  ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ ๋ฐœ์ƒํ•˜๋Š” ์ „์ฒด ์ƒ˜ํ”Œ์ด ์˜ˆ์‚ฐ $B$๋ฅผ ๋„˜์ง€ ์•Š๋Š” ๊ฒƒ๋ถ€ํ„ฐ ๋ณด์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋งŒ์•ฝ ๋„˜๊ฒŒ ๋œ๋‹ค๋ฉด ์ˆ˜๋ ด์€ ๋‘˜์งธ์น˜๊ณ  ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์œ ํšจ์„ฑ์ด ์‚ฌ๋ผ์ง€๋‹ˆ๊นŒ์š”. ์ „์ฒด ์ƒ˜ํ”Œ์˜ ์ˆ˜๋Š” ๋งค ๋‹จ๊ณ„๋งˆ๋‹ค์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์ง‘ํ•ฉ ๋‚ด ์›์†Œ์˜ ์ˆ˜์™€ ํ•™์Šตํ•  Epoch ์ˆ˜๋ฅผ ๊ณฑํ•œ ๊ฒƒ๊ณผ ๊ฐ™์œผ๋ฏ€๋กœ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

\[\text{\#Samples} = \sum_{k=0}^{\lceil \log_2(n) \rceil -1} |S_k| \left\lfloor \frac{B}{|S_k| \lceil \log_2(n) \rceil} \right\rfloor\]

์—ฌ๊ธฐ์„œ floor function์˜ ์„ฑ์งˆ์„ ์ด์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.. ๋ถ„๋ชจ์— ์žˆ๋Š” $|S_k|$ ๋Š” ์ž์—ฐ์ˆ˜์ด๊ธฐ ๋•Œ๋ฌธ์— ์ด ๊ฐ’์„ floor function ๋ฐ–์œผ๋กœ ๋บ์„ ๋•Œ ์†Œ์ˆซ์  ์•„๋ž˜ ์ˆ˜๊ฐ€ ๋ฐœ์ƒํ•ด์„œ $|S_k|$ ๋ฅผ floor function ๋ฐ–์œผ๋กœ ๋บ์„ ๋•Œ๊ฐ€ ๋” ํฐ ์ˆ˜๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

\[\begin{aligned} \sum_{k=0}^{\lceil \log_2(n) \rceil -1} |S_k| \left\lfloor \frac{B}{|S_k| \lceil \log_2(n) \rceil} \right\rfloor &\leq \sum_{k=0}^{\lceil \log_2(n) \rceil -1} |S_k| \frac{1}{|S_k|} \left\lfloor \frac{B}{\lceil \log_2(n) \rceil} \right\rfloor \\ &= \sum_{k=0}^{\lceil \log_2(n) \rceil -1} \left\lfloor \frac{B}{\lceil \log_2(n) \rceil} \right\rfloor \\ & \leq \sum_{k=0}^{\lceil \log_2(n) \rceil -1} \frac{B}{\lceil \log_2(n) \rceil} = B \end{aligned}\]

๋”ฐ๋ผ์„œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ ๋ฐœ์ƒํ•˜๋Š” ์ „์ฒด ์ƒ˜ํ”Œ์˜ ์ˆ˜๋Š” ์˜ˆ์‚ฐ๋ณด๋‹ค ์ž‘๊ฑฐ๋‚˜ ๊ฐ™์Šต๋‹ˆ๋‹ค.

(2) ์ˆ˜๋ ด์„ฑ ๋ณด์žฅ

๋ชจ๋“  ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ • $i = 1, 2, \cdots, n$์— ๋Œ€ํ•ด์„œ ๋‹ค์Œ์„ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.

\[\nu_i = \lim_{\tau \to \infty} \ell_{i, \tau}\]

์ด ๊ฐ’์€ ์œ„์—์„œ์˜ Non-stochasticํ•œ ํ™˜๊ฒฝ์—์„œ์˜ ๊ฐ€์ •์— ์˜ํ•ด ๋ฐ˜๋“œ์‹œ ์กด์žฌํ•˜๋Š” ๊ฐ’์ž…๋‹ˆ๋‹ค. ์ผ๋ฐ˜์„ฑ์„ ์žƒ์ง€ ์•Š๊ณ  ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ฐ€์ •ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ๋‹จ์ˆœํžˆ ๊ฐ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •์„ ์„ฑ๋Šฅ ์ˆœ์œผ๋กœ ์žฌ์ •๋ ฌํ–ˆ๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

\[\nu_1 \leq \nu_2 \leq \cdots \leq \nu_n.\]

๊ทธ๋ฆฌ๊ณ  ๊ฐ๊ฐ์˜ $i = 1, 2, \cdots, n$์— ๋Œ€ํ•ด์„œ ๋‹ค์Œ์„ ๋งŒ์กฑํ•˜๋Š” point-wise smallest, non-increasingํ•œ $t$์— ๋Œ€ํ•œ ํ•จ์ˆ˜ $\gamma_i(t)$ ๋ฅผ ์ •์˜ํ•ฉ๋‹ˆ๋‹ค.

\[\left| \ell_{i, t} - \nu_i \right| \leq \gamma_i(t) \quad \forall t. \tag{*}\]

$\gamma_i(t)$ ๋Š” $\ell_{i, t}$ ์™€ ๊ทธ ๊ทนํ•œ๊ฐ’ $\nu_i$ ์˜ ์ฐจ์ด๋ฅผ boundํ•˜๋Š” ์–ด๋–ค ํ•จ์ˆ˜๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ๊นŒ์ง€ ์ •์˜๊ฐ€ ๋˜์—ˆ๋‹ค๋ฉด ์ž„์˜์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ • $i$ ์˜ ์–ด๋–ค ์‹œ์  $t_i$ ์—์„œ์˜ ์†์‹ค ํ•จ์ˆ˜๊ฐ’ $\ell_{i, t_i}$ ์™€ ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ์ข‹๊ฒŒ ๋‚˜์˜ฌ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •์˜ ์–ด๋–ค ์‹œ์  $t_1$ ์—์„œ์˜ ์†์‹ค ํ•จ์ˆ˜๊ฐ’ $\ell_{1, t_1}$ ์˜ ์ฐจ์ด๋ฅผ ๊ณ„์‚ฐํ•ด๋ด…์‹œ๋‹ค.

\[\begin{aligned} \ell_{i, t_i} - \ell_{1, t_1} &= (\ell_{i, t_i} - \nu_i) + (\nu_1 - \ell_{1, t_1}) + 2\left(\frac{\nu_i - \nu_1}{2}\right) \\ & \geq -\gamma_i(t_i) - \gamma_1(t_1) + 2\left(\frac{\nu_i - \nu_1}{2}\right) & \text{by (*)} \end{aligned}\]

์—ฌ๊ธฐ์„œ $\gamma_i(t)$ ์— ๋Œ€ํ•œ ์œ ์‚ฌ ์—ญํ•จ์ˆ˜๋ฅผ ๊ณ ๋ คํ•ฉ๋‹ˆ๋‹ค.

\[\gamma_i^{-1}(\alpha) = \min \{ t \in \mathbb{N} \mid \gamma_i(t) \leq \alpha \} \quad \forall i \in [n]\]

๋งŒ์•ฝ $t_i > \gamma_i^{-1} \left( \frac{\nu_i - \nu_1}{2} \right)$์ด๋ฉด์„œ $t_1 > \gamma_1^{-1} \left( \frac{\nu_i - \nu_1}{2} \right)$ ๋ผ๋ฉด $\gamma_i(t)$ ๋Š” non-increasing function์ด๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค์Œ์ด ์„ฑ๋ฆฝํ•ฉ๋‹ˆ๋‹ค.

\[\gamma_i(t_i) < \frac{\nu_i - \nu_1}{2} \quad \text{and} \quad \gamma_1(t_1) < \frac{\nu_i - \nu_1}{2}\]

๋”ฐ๋ผ์„œ ์‹ (12)๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‹ค์‹œ ์“ธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

\[\begin{aligned} \ell_{i, t_i} - \ell_{1, t_1} &= (\ell_{i, t_i} - \nu_i) + (\nu_1 - \ell_{1, t_1}) + 2\left(\frac{\nu_i - \nu_1}{2}\right) \\ & \geq -\gamma_i(t_i) - \gamma_1(t_1) + 2\left(\frac{\nu_i - \nu_1}{2}\right) > 0 \end{aligned}\]

๋ณด๋‹ค ์—„๋ฐ€ํ•˜๊ฒŒ ๋งํ•˜์ž๋ฉด ๋‹ค์Œ์ด ์„ฑ๋ฆฝํ•ฉ๋‹ˆ๋‹ค.

\[\min \{ t_i, t_1 \} > \max \left\{ \gamma_i^{-1} \left( \frac{\nu_i - \nu_1}{2} \right), \gamma_1^{-1} \left( \frac{\nu_i - \nu_1}{2} \right) \right\} \implies \ell_{i, t_i} > \ell_{1, t_i}\]

์ด๋ฅผ ํ†ตํ•ด ์ ๋‹นํ•œ ์ž„์˜์˜ ๋‘ ์‹œ์ , $t_i$์™€ $t_1$์—์„œ $\ell_{i, t_i} > \ell_{1, t_1}$ ๊ฐ€ ์„ฑ๋ฆฝํ•œ๋‹ค๋ฉด ํ•™์Šต ์ค‘๊ฐ„์˜ ์†์‹ค ํ•จ์ˆ˜๋ฅผ ๋น„๊ตํ•˜๋Š” ๊ฒƒ๋งŒ์œผ๋กœ ์ตœ์ข… ์ˆ˜๋ ด๊ฐ’์ธ $\nu_i$์™€ $\nu_1$์˜ ๋Œ€์†Œ ๊ด€๊ณ„๋ฅผ ๋น„๊ตํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ง๊ด€์ ์œผ๋กœ ์ƒ๊ฐํ•ด๋ณธ๋‹ค๋ฉด $\gamma_i(t)$๊ฐ€ ๋‘ ์ˆ˜๋ ด๊ฐ’ $\nu_i$์™€ $\nu_1$์˜ ํ‰๊ท ๋ณด๋‹ค ์ž‘์•„์ง€๊ฒŒ ๋˜๋Š” $t$๋Š” ์ถฉ๋ถ„ํžˆ ํฐ ์ˆ˜๊ฐ€ ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค. $t$๋ฅผ ํฌ๊ฒŒ ์žก๊ธฐ ์œ„ํ•ด์„  ์ด ์˜ˆ์‚ฐ $B$๋ฅผ ์ถฉ๋ถ„ํžˆ ํฌ๊ฒŒ ์žก์•„์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ถฉ๋ถ„ํžˆ ํฐ $B$ ๊ฐ’์„ ๊ฐ–๋Š”๋‹ค๋ฉด ์ด๋Š” ๊ฒฐ๊ตญ SHA๊ฐ€ ์ตœ์ ์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •์„ ์ฐพ์•„๋‚ด๊ธฐ์— ์ถฉ๋ถ„ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

๋ ˆํผ๋Ÿฐ์Šค

[1] Feurer, Matthias, and Frank Hutter. โ€œHyperparameter optimization.โ€ Automated machine learning. Springer, Cham, 2019. 3-33.



This post is licensed under CC BY 4.0 by the author.