Post

BOHB - Robust and Efficient Hyperparameter Optimization at Scale



BOHB - Robust and Efficient Hyperparameter Optimization at Scale

๐Ÿ“„ Falkner, Stefan, Aaron Klein, and Frank Hutter. โ€œBOHB: Robust and efficient hyperparameter optimization at scale.โ€ International Conference on Machine Learning. PMLR, 2018.

๋“ค์–ด๊ฐ€๋ฉฐ

ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์ตœ์ ํ™” (HPO) ๋ฌธ์ œ๋Š” ๋ชจ๋ธ์˜ ๊ทœ๋ชจ๊ฐ€ ์ปค์ง€๋ฉด ์ปค์งˆ์ˆ˜๋ก ๊ทธ ์ค‘์š”์„ฑ๋„ ์—ญ์‹œ ์ปค์ง‘๋‹ˆ๋‹ค. ์ด์ „์— ํฌ์ŠคํŒ…ํ•œ Bayesian Optimization, Successive Halving Algorithm, Hyperband ๋ชจ๋‘ ๊ทธ ๋งฅ๋ฝ์—์„œ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์„ ๋‹ค๋ฃจ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฒˆ ํฌ์ŠคํŒ…์€ ์•ž์„œ ์–ธ๊ธ‰ํ•œ ์„ธ ๊ฐ€์ง€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋ชจ๋‘ ๊ฒฐํ•ฉํ•˜์—ฌ ์ตœ๊ทผ ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๋ฐฉ๋ฒ•๋ก ์ธ BOHB์— ๋Œ€ํ•ด์„œ ๋‹ค๋ฃน๋‹ˆ๋‹ค.

์ €์ž๋“ค์€ HPO ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐฉ๋ฒ•๋“ค์€ ๋‹ค์Œ์˜ ์š”๊ตฌ ์‚ฌํ•ญ์„ ๋งŒ์กฑํ•ด์•ผ ๋งํ•ฉ๋‹ˆ๋‹ค.

  1. ๐Ÿ’ช Strong Anytime Performance (์–ธ์ œ๋“  ๋†’์€ ์„ฑ๋Šฅ)
    • ์ตœ๊ทผ์— ๋“ฑ์žฅํ•˜๋Š” ๋ชจ๋ธ๋“ค์€ ๋ชจ๋‘ ํฐ ๊ทœ๋ชจ๋ฅผ ์ž๋ž‘ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ง์ธ์ฆ‰์Šจ ํ•œ ๋ฒˆ์˜ ํ•™์Šต์—๋„ ๊ธด ์‹œ๊ฐ„์ด ํ•„์š”ํ•˜๋‹ค๋Š” ์˜๋ฏธ์ธ๋ฐ์š”. ๋”ฐ๋ผ์„œ ์ ์€ ์‹œ๊ฐ„์—๋„ ๋žœ๋ค ์„œ์น˜๊ฐ™์€ ๋ฐฉ๋ฒ• ์ด์ƒ์˜ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค˜์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  2. ๐Ÿ Strong Final Performance (์ตœ์ข… ์„ฑ๋Šฅ)
    • ์ด๋Ÿฌ๋‚˜์ €๋Ÿฌ๋‚˜ ๊ฒฐ๊ตญ ๋ชจ๋ธ์„ ๋ฐฐํฌํ•  ๋•Œ์˜ ๋งˆ์ง€๋ง‰ ์„ฑ๋Šฅ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. ๋žœ๋ค ์„œ์น˜๋Š” ๊ธด ์‹œ๊ฐ„์„ ๋“ค์ด๋”๋ผ๋„ ํฐ ํญ์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ์–ด๋ ค์šด ๋งŒํผ ์ข‹์€ HPO ๋ฐฉ๋ฒ•์€ ๋†’์€ ์ตœ์ข… ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  3. ๐Ÿ›ค๏ธ Effective Use of Parallel Resources (ํšจ์œจ์ ์ธ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ)
    • ์ตœ๊ทผ ๋Œ€๋ถ€๋ถ„์˜ ํ™˜๊ฒฝ์ด ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๋ฅผ ์ง€์›ํ•˜๋Š” ๋งŒํผ ํšจ์œจ์ ์ธ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  4. ๐Ÿšœ Scalability (ํ™•์žฅ์„ฑ)
    • ์ตœ๊ทผ ๋งŽ์€ ๋ชจ๋ธ๋“ค์€ ๋งค์šฐ ๋งŽ์€ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ฐ–๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋ธ์˜ ๊ตฌ์กฐ, ์ตœ์ ํ™”, ์ •๊ทœํ™” ๋“ฑ ์—ฌ๋Ÿฌ ๋ฒ”์ฃผ์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์œ ์šฉํ•œ HPO ๋ฐฉ๋ฒ•์ด๋ผ๋ฉด ์ˆ˜์‹ญ๊ฐœ์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‰ฝ๊ฒŒ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  5. ๐Ÿ› ๏ธ Robustness & Flexibility (๊ฐ•๊ฑด์„ฑ๊ณผ ์œ ์—ฐ์„ฑ)
    • ๊ธฐ๋ณธ์ ์œผ๋กœ ๋‹ค์–‘ํ•œ ML ๋ฌธ์ œ์— ์ ์šฉ ๊ฐ€๋Šฅํ•œ ๊ฒƒ์ด ๋‹น์—ฐํ•œ ์ผ์ด์ง€๋งŒ ๊ทธ๋ ‡๊ฒŒ ์‰ฌ์šด ์ผ์€ ์•„๋‹™๋‹ˆ๋‹ค. ์–ด๋–ค ๋ชจ๋ธ์€ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์— ๋ฏผ๊ฐํ•œ ๋ฐ˜๋ฉด, ์–ด๋–ค ๋ชจ๋ธ์€ ์ผ๋ถ€ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋งŒ์ด ์„ฑ๋Šฅ์— ์˜ํ–ฅ์„ ๋ฏธ์นฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์œ ์šฉํ•œ HPO ๊ธฐ๋ฒ•์ด๋ผ๋ฉด ์–ด๋–ค ์ƒํ™ฉ์—์„œ๋„ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ๋ฒ”์ฃผํ˜•, ์ •์ˆ˜ํ˜•, ์—ฐ์†ํ˜• ๋“ฑ ๋‹ค์–‘ํ•œ ์ข…๋ฅ˜์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๋งŽ์€ HPO ๊ธฐ๋ฒ•๋“ค์€ ๊ฐ๊ฐ์˜ ์žฅ๋‹จ์ ์„ ๊ฐ–๊ณ  ์žˆ์ง€๋งŒ ์œ„ ์š”๊ตฌ ์‚ฌํ•ญ์„ ๋ชจ๋‘ ๋งŒ์กฑํ•˜์ง„ ๋ชปํ–ˆ์Šต๋‹ˆ๋‹ค. ์ €์ž๋“ค์€ ์ด ๋…ผ๋ฌธ์„ ํ†ตํ•ด ์ด๋ฏธ ์•Œ๋ ค์ง„ ๋ช‡๋ช‡ ๊ธฐ๋ฒ•์„ ๊ฒฐํ•ฉํ•ด ์œ„ ์š”๊ตฌ์‚ฌํ•ญ์„ ๋ชจ๋‘ ๋งŒ์กฑํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ Bayesian Optimization๋ณด๋‹ค ๋น ๋ฅด๊ฒŒ ์ข‹์€ ์†”๋ฃจ์…˜์„ ์ฐพ๊ณ , ๊ทธ ์ˆ˜๋ ด ์†๋„๋Š” Hyperband ๋ณด๋‹ค ๋น ๋ฆ…๋‹ˆ๋‹ค.

Figure 1. Figure 1. ์„œ๋กœ ๋‹ค๋ฅธ HPO ๊ธฐ๋ฒ•์œผ๋กœ ์‹ ๊ฒฝ๋ง์— ์‚ฌ์šฉํ•œ ์—ฌ์„ฏ ๊ฐœ์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ตœ์ ํ™”ํ•œ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค. Hyperband๋Š” ๋Œ€๋ถ€๋ถ„์˜ ์‹œ๊ฐ„๋Œ€์—์„œ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ด์ง€๋งŒ, ๋งŽ์€ ์‹œ๊ฐ„์ด ์ง€๋‚ฌ์„ ๋•Œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ๋ˆˆ์— ๋„์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋ฐ˜๋Œ€๋กœ Bayesian optimization์€ ์ฒ˜์Œ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜ค๊ธฐ๊นŒ์ง€ ๋งŽ์€ ์‹œ๊ฐ„์ด ๊ฑธ๋ฆฌ์ง€๋งŒ ์ถฉ๋ถ„ํ•œ ์‹œ๊ฐ„์ด ์ฃผ์–ด์กŒ์„ ๋•Œ Hyperband๋ณด๋‹ค ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ƒ…๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์ธ BOHB๋Š” ๊ฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์žฅ์ ์„ ๊ฒฐํ•ฉํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

Model-Based Hyperband

BOHB๋Š” TPE์™€ Hyperband๋ฅผ ๊ฒฐํ•ฉํ•œ ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ๊ฐ๊ฐ์˜ ๋ฐฉ๋ฒ•์„ ์ž์„ธํ•˜๊ฒŒ ์„ค๋ช…ํ•˜๊ณ  ์žˆ์ง€๋งŒ, ๋ณธ ํฌ์ŠคํŠธ๋Š” ์ด์ „์— ํฌ์ŠคํŒ…ํ•œ ๋‚ด์šฉ์œผ๋กœ ๊ฐˆ์Œํ•ฉ๋‹ˆ๋‹ค.

Hyperband๋Š” ์œ„์—์„œ ์–ธ๊ธ‰ํ•œ ๋‹ค์„ฏ ๊ฐ€์ง€์˜ ์š”๊ตฌ์‚ฌํ•ญ ์ค‘ ๋Œ€๋ถ€๋ถ„์„ ๋งŒ์กฑํ•ฉ๋‹ˆ๋‹ค. ์ •ํ™•ํ•˜๊ฒŒ๋Š” (1) ์–ธ์ œ๋“  ๋†’์€ ์„ฑ๋Šฅ, (4) ํ™•์žฅ์„ฑ, (5) ๊ฐ•๊ฑด์„ฑ๊ณผ ์œ ์—ฐ์„ฑ์„ ๋งŒ์กฑํ•˜์ฃ . Bayesian Optimization์€ (2) ์ตœ์ข… ์„ฑ๋Šฅ์„ ๋งŒ์กฑํ•ฉ๋‹ˆ๋‹ค. ์ €์ž๋“ค์€ ์—ฌ๊ธฐ์— ํ•˜๋‚˜ ๋‚จ์€ (3) ํšจ์œจ์ ์ธ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๋ฅผ ๊ฐ€๋ฏธํ•ฉ๋‹ˆ๋‹ค. ๊ฒŒ๋‹ค๊ฐ€ ๋‹ค์Œ์˜ ๋‘ ๊ฐ€์ง€ ํŠน์ง•๋„ ๊ณ๋“ค์˜€์Šต๋‹ˆ๋‹ค.

  1. ๐Ÿš… Simplicity (๋‹จ์ˆœ์„ฑ)
    • ๋‹จ์ˆœํ• ์ˆ˜๋ก ๊ฒ€์ฆํ•˜๊ธฐ ์‰ฝ๊ณ  ๋‹ค์‹œ ๊ตฌํ˜„ํ•˜๊ธฐ๋„ ์‰ฝ๋‹ค๋Š” ์ ์—์„œ ๋‹จ์ˆœ์„ฑ์€ ๊ทธ ์ž์ฒด๋กœ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. Hyperband๋Š” ๋‹จ์ˆœํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด์ง€๋งŒ Gaussian process ๊ธฐ๋ฐ˜์˜ Bayesian optimization์€ ๊ทธ๋ ‡์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์ง‘ํ•ฉ์— ๋Œ€ํ•ด ๋ณต์žกํ•œ MCMC ์ƒ˜ํ”Œ๋ง์„ ์ˆ˜ํ–‰ํ•ด์•ผ ํ•˜๊ณ , ๋ฐ์ดํ„ฐ์— ๋งž๋Š” ์ปค๋„ ํ•จ์ˆ˜ ์„ ํƒ์ด ๊นŒ๋‹ค๋กญ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.
  2. ๐Ÿ’ป Computational Efficiency (๊ณ„์‚ฐ ํšจ์œจ์„ฑ)
    • Hyperband๋Š” ์ ์€ ์‹œ๊ฐ„์—๋„ ๋งŽ์€ ํ•จ์ˆ˜ ๊ณ„์‚ฐ์ด ๊ฐ€๋Šฅํ•˜๋ฏ€๋กœ ์ผ๋ฐ˜์ ์ธ GP์˜ ๊ณ„์‚ฐ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋ณต์žก๋„๊ฐ€ ์กฐ๊ธˆ ๋” ๋‚ฎ์€ ๊ทผ์‚ฌ GP์— ๋Œ€ํ•œ ๊ณ„์‚ฐ๋„ ๋ฌธ์ œ๊ฐ€ ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
      • ์ผ๋ฐ˜์ ์ธ GP๋Š” ์‹œ๊ฐ„ ๋ณต์žก๋„๊ฐ€ $O(n^3)$์— ๋‹ฌํ•ฉ๋‹ˆ๋‹ค.
    • ๋˜ํ•œ ๋ณต์žกํ•œ ํš๋“ ํ•จ์ˆ˜ (acquisition function)์„ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•œ ๋ณต์žก๋„ ์—ญ์‹œ ๋ณ‘๋ชฉ ํ˜„์ƒ์„ ์ผ์œผํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ๋˜๋ฉด ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ์˜ ํšจ์œจ์„ฑ์ด ๊ธ‰๊ฐํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

์ด๋Ÿฐ ์ด์œ ์—์„œ BOHB์˜ Bayesian optimization์€ GP๊ฐ€ ์•„๋‹Œ TPE์— ๊ธฐ๋ฐ˜์„ ๋‘๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ƒ์„ธ

BOHB๋Š” ์–ด๋–ค ์ฃผ์–ด์ง„ ์˜ˆ์‚ฐ์— ๋Œ€ํ•ด ๋ช‡ ๊ฐœ์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •์˜ ์„ฑ๋Šฅ์„ ๊ณ„์‚ฐํ• ์ง€ Hyperband๋ฅผ ์ด์šฉํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ Hyperband์™€ ๋‹ค๋ฅธ ์ ์€ SHA๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์ „ ์ตœ์ดˆ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •์„ ๋žœ๋ค ์ƒ˜ํ”Œ๋ง์ด ์•„๋‹ˆ๋ผ ๋ชจ๋ธ ๊ธฐ๋ฐ˜์œผ๋กœ ํƒ์ƒ‰ํ•œ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ๋งํ•˜๋Š” ๋ชจ๋ธ์ด ๋ฐ”๋กœ Bayesian optimization์ž…๋‹ˆ๋‹ค.

BOHB์—์„œ์˜ Bayesian optimization์€ TPE์™€ ๋งค์šฐ ์œ ์‚ฌํ•˜์ง€๋งŒ ํ•˜๋‚˜์˜ ์ฐจ์ด์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์›๋ž˜ TPE๋Š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •์ด ๋ช‡ ๊ฐœ์ด๋“ ์ง€ 1์ฐจ์› ์ปค๋„ ๋ถ„ํฌ ์ถ”์ • (KDE)์„ ํ•˜์ง€๋งŒ BOHB์—์„œ๋Š” ํ•˜๋‚˜์˜ ๋‹ค์ฐจ์› ์ปค๋„ ๋ถ„ํฌ ์ถ”์ •์„ ํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์ฐจ์› ์ปค๋„ ๋ถ„ํฌ ์ถ”์ •์„ ํ•˜๋Š” ์ด์œ ๋Š” ์—ฌ๋Ÿฌ ๊ฐœ์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ„์˜ ์ƒํ˜ธ ์ž‘์šฉ์„ ๊ณ ๋ คํ•˜๊ธฐ ์œ„ํ•จ์ž…๋‹ˆ๋‹ค.

Algorithm 2. Algorithm 2. Pseudocode for sampling in BOHB

TPE๋Š” ์„ฑ๋Šฅ์ด ์ข‹์€ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์˜ ๋ถ„ํฌ์™€ ์„ฑ๋Šฅ์ด ๋‚˜์œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์˜ ๋ถ„ํฌ๋ฅผ ์ถ”์ •ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ถ„ํฌ ์ถ”์ •์„ ์œ„ํ•ด ์‚ฌ์šฉํ•  ์ตœ์†Œํ•œ์˜ ๋ฐ์ดํ„ฐ ์ˆ˜ $N_\text{min}$ ์„ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค. ์ด๋•Œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜ $d$ ์— ๋Œ€ํ•ด $N_\text{min} = d+1$ ๋กœ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ ์˜ˆ์‚ฐ $b$ ๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ๊ด€์ธกํ•  ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์ง‘ํ•ฉ์˜ ๊ฐœ์ˆ˜ $N_b = |D_b|$ ์— ๋Œ€ํ•ด์„œ $q \cdot N_b \geq N_\min$ ๋งŒ ๋งŒ์กฑํ•˜๋ฉด ๋ฐ”๋กœ ๋ถ„ํฌ ์ถ”์ •์„ ํ•˜๋„๋ก ์„ค์ •ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ $q$๋Š” ์‚ฌ์ „์— ์ •ํ•œ percentile ๊ฐ’์ž…๋‹ˆ๋‹ค. ์ •๋ฆฌํ•˜์ž๋ฉด ์ข‹์€ ๋ถ„ํฌ์™€ ๋‚˜์œ ๋ถ„ํฌ $\ell(x)$์™€ $g(x)$์˜ ๋ถ„ํฌ๋ฅผ ์ถ”์ •ํ•˜๊ธฐ ์œ„ํ•œ ์ตœ์†Œํ•œ์˜ ๋ฐ์ดํ„ฐ ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

\[\begin{aligned} N_{b, \ell} &= \max(N_\text{min}, q \cdot N_b) \\ N_{b, g} &= \max(N_\text{min}, N_b - N_{b, \ell}) \end{aligned}\]

Algorithm 2์—์„œ ๋‘ ๋ฒˆ์งธ ์ค„์—์„œ ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด ์ตœ์ ํ™” ๊ณผ์ •์—์„œ BOHB๋Š” ํ•ญ์ƒ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์ง‘ํ•ฉ์— ๋Œ€ํ•ด ๊ด€์ธกํ•œ ์„ฑ๋Šฅ ๊ฒฐ๊ณผ๊ฐ€ ์ถฉ๋ถ„ํ•  ๋•Œ์˜ ๊ฐ€์žฅ ํฐ ์˜ˆ์‚ฐ ๊ฐ’์— ๋Œ€ํ•œ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

\[b = \arg \max \{ D_b : |D_b| \geq N_\text{min} + 2 \}.\]

๋‘ ๊ฐœ์˜ ๋ถ„ํฌ์— ๋Œ€ํ•œ ์ถ”์ •์„ ๋งˆ์ณค๋‹ค๋ฉด EI๋ฅผ ์ตœ์ ํ™”ํ•ด์•ผ ํ•˜๋Š”๋ฐ, $\ell^\prime(x)$ ๋ผ๋Š” ๋ถ„ํฌ์—์„œ $N_s$ ๋งŒํผ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ˜ํ”Œ๋งํ•ฉ๋‹ˆ๋‹ค. $\ell^\prime(x)$๋Š” $\ell(x)$์—์„œ bandwidth์— $b_w$๋ฅผ ๊ณฑํ•œ ๋ถ„ํฌ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋ถ„ํฌ์˜ smoothness๊ฐ€ ๋†’์•„์ง€๊ณ  ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋” ๋งŽ์€ ํƒ์ƒ‰์„ ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ €์ž๋“ค์€ bandwidth๋ฅผ ๋†’์ž„์œผ๋กœ์จ ์ตœ์ ํ™” ํ›„๋ฐ˜๋ถ€์— ์ˆ˜๋ ด์„ฑ์ด ๋†’์•„์ง€๋Š” ๊ฒƒ์„ ํ™•์ธํ–ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿค” Bandwidth์— $b_w$๋ฅผ ๊ณฑํ•˜๋ฉด?

์ปค๋„ $K : \mathbb{R}^n \to \mathbb{R}^+$์™€ bandwidth $h > 0$์— ๋Œ€ํ•ด์„œ kernel density estimator๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

\[\hat{f}_h(x) = \frac{1}{n}\sum^n_{i=1} K_h(x - x_i) = \frac{1}{nh} \sum^n_{i=1} K \left( \frac{x - x_i}{h} \right).\]

์ด๋•Œ $h$ ๊ฐ’์˜ ๋ณ€ํ™”์— ๋”ฐ๋ฅธ ์ถ”์ • ๋ถ„ํฌ์˜ ํ˜•ํƒœ๋Š” ์•„๋ž˜ ์ด๋ฏธ์ง€์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

ํšŒ์ƒ‰์€ ์‹ค์ œ ๋ถ„ํฌ, ๋นจ๊ฐ„์ƒ‰์€ h=0.05์ผ ๋•Œ์˜ KDE,
๊ฒ€์€์ƒ‰์€ h=0.337์ผ ๋•Œ์˜ KDE, ๋งˆ์ง€๋ง‰์œผ๋กœ ์ดˆ๋ก์ƒ‰์€ h=2์ผ ๋•Œ์˜ KDE

์ด๋ฏธ์ง€์—์„œ๋„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค์‹œํ”ผ bandwidth๊ฐ€ ํด ์ˆ˜๋ก ์ถ”์ •๋˜๋Š” ๋ถ„ํฌ๊ฐ€ ๋„“๊ฒŒ ํผ์ง€๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. TPE๋Š” $\ell(x)/g(x)$ ๊ฐ€ ๊ฐ€์žฅ ํฐ ์ ์„ ๋‹ค์Œ ํƒ์ƒ‰ ํ›„๋ณด๋กœ ๋‘๋Š”๋ฐ, bandwidth๊ฐ€ ์ปค์ง์— ๋”ฐ๋ผ $\ell(x)$๊ฐ€ ๋„“๊ฒŒ ํผ์ง€๊ณ  ํƒ์ƒ‰ํ•  ํ›„๋ณด์˜ ํญ์ด ๋„“์–ด์ง‘๋‹ˆ๋‹ค. ์ฆ‰ ๋” ๋งŽ์€ ํƒ์ƒ‰์„ ํ•˜๊ฒŒ ๋˜๋Š”๊ฑฐ์ฃ .

๋งˆ์ง€๋ง‰์œผ๋กœ Hyperband์˜ ์ด๋ก ์  ๋ณด์žฅ์„ ์œ„ํ•ด ํŠน์ • ํ™•๋ฅ  $\rho$๋กœ ์ž„์˜์˜ ๋žœ๋ค ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •์„ ์ƒ˜ํ”Œ๋งํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์ตœ์•…์˜ ๊ฒฝ์šฐ ๋žœ๋ค ์„œ์น˜๋ณด๋‹ค $\rho$๋ฐฐ ๋Š๋ฆด ์ˆ˜ ์žˆ์ง€๋งŒ ๊ฒฐ๊ตญ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์ˆ˜๋ ด์„ ๋ณด์žฅํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ๋Š” Hyperband์™€ BOHB๊ฐ€ ๋žœ๋ค ์„œ์น˜๋ณด๋‹ค ํ˜„์ €ํžˆ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ

BOHB๋Š” TPE์™€ Hyperband์˜ ํŠน์„ฑ์„ ์ด์šฉํ•˜์—ฌ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๋ฅผ ํ•ฉ๋‹ˆ๋‹ค. TPE๋Š” ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•ด EI๋ฅผ ์ตœ์ ํ™”ํ•  ๋•Œ ์ƒ˜ํ”Œ ์ˆ˜๋ฅผ ์ œํ•œํ•˜๊ณ , ๋‹ค์–‘ํ•œ ๊ฒฐ๊ณผ๋ฅผ ์œ„ํ•ด์„œ ์™„๋ฒฝํ•˜๊ฒŒ ์ตœ์ ํ™”ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ฐ˜ํ™˜ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. Hyperband๋Š” ๋™์‹œ์— ๋‹ค๋ฅธ bracket์— ๋Œ€ํ•ด Successive Halving์„ ์ˆ˜ํ–‰ํ•˜๊ณ , ๊ฐ SH๋ฅผ ์ˆ˜ํ–‰ํ•  ๋•Œ ๋™์‹œ์— ๋‹ค๋ฅธ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

BOHB๋Š” ์šฐ์„  ์ˆœ์ฐจ์ ์ธ Hyperband์—์„œ ๊ฐ€์žฅ ์ž‘์€ ์˜ˆ์‚ฐ์ด ํ•„์š”ํ•œ SH๋ถ€ํ„ฐ ์ตœ์ ํ™”๋ฅผ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค. ์ด๋•Œ (a) ๋ชจ๋“  worker๊ฐ€ ์ž‘์—… ์ค‘์ด๊ฑฐ๋‚˜ (b) SH๋กœ๋ถ€ํ„ฐ ์ถฉ๋ถ„ํ•œ ์ˆ˜์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์ƒ˜ํ”Œ๋ง ๋  ๋•Œ๊นŒ์ง€ ์œ„ Algorithm 2๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

  • (a)์˜ ๊ฒฝ์šฐ worker์˜ ์ž‘์—…์ด ๋๋‚  ๋•Œ๊นŒ์ง€ ๊ธฐ๋‹ค๋ ค์„œ ์ƒˆ๋กœ์šด ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ƒ˜ํ”Œ๋งํ•ฉ๋‹ˆ๋‹ค.
  • (b)์˜ ๊ฒฝ์šฐ ๋ณ‘๋ ฌ๋กœ ๋‹ค์Œ SH๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ด๋•Œ ๊ด€์ธก ๊ฒฐ๊ณผ $D$๋Š” ๋ชจ๋“  SH ์‹œํ–‰์— ๊ณต์œ ๋ฉ๋‹ˆ๋‹ค. BOHB๋Š” ๊ฐ ์‹œ์ ์—์„œ ์ตœ๊ณ ์˜ ์„ฑ๋Šฅ์„ ๋‚ด๋Š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ๊ตฌ์„ฑ์„ ์ถ”์ ํ•˜๋Š” ์ƒ์‹œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ (anytime algorithm)์ด๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.


This post is licensed under CC BY 4.0 by the author.