Eligibility Traces

n-step TD๋‚˜ TD($$\lambda$$), eligibility trace๋Š” ์ตœ๊ทผ ๊ฐ•ํ™”ํ•™์Šต์—์„œ ๋งŽ์ด ๋‹ค๋ฃจ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์€ ์•„๋‹™๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ตœ๊ทผ Asynchronous methods for deep reinforcement learning ๋…ผ๋ฌธ์—์„œ n-step์„ ์‚ฌ์šฉํ•œ ์ผ์€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ œ๊ฐ€ ์ƒ๊ฐํ•˜๊ธฐ์— ํฌ๊ฒŒ ์ค‘์š”ํ•œ ๋ถ€๋ถ„์€ ์•„๋‹ˆ๊ณ  "์–ด๋– ํ•œ ๋ฌธ์ œ์ ์ด ์žˆ์–ด์„œ ๊ฐœ์„ ์‹œํ‚ค๋ ค๊ณ  ํ–ˆ๋˜ ๊ณผ์ •๋“ค์ด๊ตฌ๋‚˜"๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ํŽธํ•  ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

1. n-step TD

TD learning์€ Monte-Carlo learning์— ๋น„ํ•ด์„œ ๋งค time-step๋งˆ๋‹ค ํ•™์Šต์„ ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๋‹ค์‹œ ํ•œ ๋ฒˆ TD Control์ธ Sarsa๋ฅผ ๋ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ์•„๋ž˜์˜ update์‹์„ ๋ณด๋ฉด ๊ฒฐ๊ตญ updateํ•œ ๋ฒˆ ํ•  ๋•Œ Rํ•˜๋‚˜์˜ ์ •๋ณด๋ฐ–์— ์•Œ ์ˆ˜ ์—†์–ด์„œ ํ•™์Šตํ•˜๋Š”๋ฐ ์˜ค๋ž˜๊ฑธ๋ฆฌ๊ธฐ๋„ ํ•˜๊ณ  bias๊ฐ€ ๋†’๊ธฐ ๋•Œ๋ฌธ์— ์‚ฌ์‹ค์€ TD์™€ MC์˜ ๋‘˜์˜ ์žฅ์ ์„ ๋‹ค ์‚ด๋ฆด ์ˆ˜ ์žˆ๋‹ค๋ฉด ์ข‹์Šต๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ update๋ฅผ ํ•  ๋•Œ one step๋งŒ ๋ณด๊ณ ์„œ update๋ฅผ ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๊ณ  n-step์„ ์›€์ง์ธ ๋‹ค์Œ์— update๋ฅผ ํ•˜์ž๋ผ๊ณ  ์ƒ๊ฐํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๋งŒ์•ฝ ์ง€๊ธˆ์ด t๋ผ๊ณ  ํ•˜๋ฉด t+n์ด ๋˜๋ฉด ๊ทธ ๋•Œ๊นŒ์ง€ ๋ชจ์•˜๋˜ reward๋“ค๋กœ์„œ n-step return์„ ๊ณ„์‚ฐํ•˜๊ณ  ๊ทธ ์‚ฌ์ด์˜ ๋ฐฉ๋ฌธํ–ˆ๋˜ state๋“ค์˜ value function์„ updateํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ทธ๊ฒƒ์ด ๋ฐ”๋กœ n-step TD์ž…๋‹ˆ๋‹ค. ๋ช‡ ๊ฐœ์˜ time-step๋งˆ๋‹ค update๋ฅผ ํ•  ๊ฑด์ง€๋Š” ์ž์‹ ์ด ๊ฒฐ์ •ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค. ์ด n-step์ด terminal state๊นŒ์ง€ ๊ฐ„๋‹ค๋ฉด ๊ทธ๊ฒŒ ๋ฐ”๋กœ Monte-Carlo์ด ๋ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋‘˜์˜ ์žฅ์ ์„ ๋‹ค ์ทจํ•˜๊ธฐ ์œ„ํ•ด์„œ ๊ทธ ์‚ฌ์ด์˜ ์ ๋‹นํ•œ n-step์„ ์„ ํƒํ•ด์ฃผ๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค.

ํ•˜์ง€๋งŒ ์–ด๋–ป๊ฒŒ ์ ๋‹นํ•œ n-step์ด๋ผ๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๊ณ  ๊ทธ ๊ธฐ์ค€์ด ๋ฌด์—‡์ผ๊นŒ์š”??

2. Forward-View of TD($$\lambda$$)

Random walker๋ผ๋Š” prediction example์ž…๋‹ˆ๋‹ค. C state์—์„œ ์‹œ์ž‘ํ•ด์„œ ์™ผ์ชฝ ์˜ค๋ฅธ์ชฝ์œผ๋กœ randomํ•˜๊ฒŒ ์›€์ง์ด๋Š” policy๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ฌธ์ œ์— ๋Œ€ํ•œ true value function ์•„๋ž˜ ๊ทธ๋ž˜ํ”„์— ๊ทธ๋ ค์ง„ ๊ฒƒ์ฒ˜๋Ÿผ ๋˜์–ด์žˆ๊ณ  ์ € value function์„ experience๋ฅผ ํ†ตํ•ด์„œ ์ฐพ์•„๋‚ด์•ผํ•ฉ๋‹ˆ๋‹ค.

์ด ๋ฌธ์ œ์— n-step TD prediction์„ ์ ์šฉ์‹œ์ผœ๋ณผ ๊ฒฝ์šฐ์— n์œผ๋กœ ์ ๋‹นํ•œ ์ˆซ์ž๊ฐ€ ์–ผ๋งˆ์ธ์ง€ ํŒ๋ณ„ํ•˜๊ธฐ ์‰ฝ์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์•„๋ž˜ ๊ทธ๋ž˜ํ”„๋Š” x์ถ•์€ $$\alpha$$, y์ถ•์€ true value function๊ณผ์˜ error์ž…๋‹ˆ๋‹ค. ์•„๋ž˜ ๊ทธ๋ž˜ํ”„์™€ ๊ฐ™์ด $$\alpha$$์˜ ๊ฐ’์— ๋”ฐ๋ผ์„œ ์–ด๋–ค n-step์ด ํ•™์Šต์— ์ข‹์€์ง€๋Š” ๋‹ฌ๋ผ์ง€๊ธฐ ๋•Œ๋ฌธ์— ์‚ฌ์‹ค์€ ์—ฌ๋Ÿฌ n-step TD๋ฅผ ํ•ฉํ•  ์ˆ˜ ์žˆ๋‹ค๋ฉด ๊ฐ n์˜ ์žฅ์ ์„ ๋‹ค ์ทจํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ ์ด ๋ชจ๋“  n-step return์„ ๋ชจ๋‘ ๋”ํ•ด์„œ ์‚ฌ์šฉํ•˜๊ธฐ๋กœ ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋‹จ์ˆœํžˆ ๋”ํ•ด์„œ ํ‰๊ท ์„ ๊ตฌํ•˜๋Š” ๋ฐฉ์‹์ด ์•„๋‹ˆ๊ณ  ์•„๋ž˜์™€ ๊ฐ™์ด $$\lambda$$๋ผ๋Š” weight๋ฅผ ์‚ฌ์šฉํ•ด์„œ geometrically weighted sum์„ ์ด์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๋ชจ๋“  n-step์„ ๋‹ค ํฌํ•จํ•˜๋ฉด์„œ ๋‹ค ๋”ํ•˜๋ฉด 1์ด ๋‚˜์˜ต๋‹ˆ๋‹ค. ์ด๊ฒƒ์„ ํ†ตํ•ด์„œ ๊ตฌํ•œ $$\lambda$$-return์„ ์›๋ž˜ MC์˜ return์ž๋ฆฌ์— ๋„ฃ์–ด์ฃผ๋ฉด forward-view TD($$\lambda$$)๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

์ด๊ฒƒ์„ ๊ทธ๋ฆผ์œผ๋กœ ๋‚˜ํƒ€๋‚ด์ž๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

๋‹ค์‹œ ์ •๋ฆฌ๋ฅผ ํ•ด๋ณด์ž๋ฉด TD๋Š” time-step๋งˆ๋‹ค ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š” ์žฅ์ ์€ ์žˆ์—ˆ์ง€๋งŒ ๋˜ํ•œ bias๊ฐ€ ๋†’๊ณ  ํ•™์Šต์ •๋ณด๊ฐ€ ๋ณ„๋กœ ์—†๊ธฐ ๋•Œ๋ฌธ์— TD์™€ MC์˜ ์žฅ์ ์„ ๋‘˜ ๋‹ค ์‚ด๋ฆฌ๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ n-step TD๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๊ฐ n-step์ด ์ƒํ™ฉ๋งˆ๋‹ค ๋‹ค๋ฅธ ์žฅ์ ์ด ์žˆ์–ด์„œ ์ด ๋ชจ๋“  ์žฅ์ ์„ ํฌํ•จํ•˜๊ธฐ ์œ„ํ•ด์„œ $$\lambda$$๋ผ๋Š” weight๋ฅผ ๋„์ž…ํ•ด์„œ $$\lambda$$-return์„ ๊ณ„์‚ฐํ•ด์„œ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด forward-view TD($$\lambda$$)์ž…๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด ๋ฐฉ๋ฒ•์—๋„ ๋‹จ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ”๋กœ MC์™€ ๋˜‘๊ฐ™์ด episode๊นŒ ๋๋‚˜์•ผ update๋ฅผ ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.(๋ชจ๋“  n-step์„ ํฌํ•จํ•˜๋‹ˆ๊นŒ)

3. Backward-View of TD($$\lambda$$)

๋”ฐ๋ผ์„œ ๋ณธ๋ž˜ TD์˜ ์žฅ์ ์ด์—ˆ๋˜ time-step๋งˆ๋‹ค updateํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์ ์ด ์‚ฌ๋ผ์กŒ์Šต๋‹ˆ๋‹ค. MC์˜ ์žฅ์ ์€ ์‚ด๋ฆฌ๋ฉด์„œ๋„ ๋ฐ”๋กœ ๋ฐ”๋กœ updateํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์ด ์—†์„๊นŒ์š”? ์—ฌ๊ธฐ์„œ ๋ฐ”๋กœ eligibility trace๋ผ๋Š” ๊ฐœ๋…์ด ๋‚˜์˜ต๋‹ˆ๋‹ค. ์•„๋ž˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด ๊ณผ๊ฑฐ์— ์žˆ์—ˆ๋˜ ์ผ๋“ค ์ค‘์—์„œ ํ˜„์žฌ ๋‚ด๊ฐ€ ๋ฐ›์€ reward์— ๊ธฐ์—ฌํ•œ ๊ฒƒ์ด ๋ฌด์—‡์ผ๊นŒ?๋ผ๋Š” credit assignment๋ฌธ์ œ์—์„œ "์–ผ๋งˆ๋‚˜ ์ตœ๊ทผ์— ์ผ์–ด๋‚ฌ๋˜ ์ผ์ด์—ˆ๋‚˜"์™€ "์–ผ๋งˆ๋‚˜ ์ž์ฃผ ๋ฐœ์ƒํ–ˆ์—ˆ๋‚˜"๋ผ๋Š” ๊ฒƒ์„ ๊ธฐ์ค€์œผ๋กœ ๊ณผ๊ฑฐ์˜ ์ผ๋“ค์„ ๊ธฐ์–ตํ•ด๋†“๊ณ  ํ˜„์žฌ ๋ฐ›์€ reward๋ฅผ ๊ณผ๊ฑฐ์˜ state๋“ค๋กœ ๋ถ„๋ฐฐํ•ด์ฃผ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

์ฆ‰ TD(0)์ฒ˜๋Ÿผ ํ˜„์žฌ updaateํ•  $$\delta$$๋ฅผ ๊ณ„์‚ฐํ•˜๋ฉด ํ˜„์žฌ์˜ value function๋งŒ updateํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ๊ณผ๊ฑฐ์— ์ง€๋‚˜์™”๋˜ ๋ชจ๋“  state์— eligibility trace๋ฅผ ๊ธฐ์–ตํ•ด๋‘์—ˆ๋‹ค๊ฐ€ ๊ทธ ๋งŒํผ ์ž์‹ ์„ updateํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์•„๋ž˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด ํ˜„์žฌ์˜ ๊ฒฝํ—˜์„ ํ†ตํ•ด ํ•œ ๋ฒˆ์— ๊ณผ๊ฑฐ์˜ ๋ชจ๋“  state๋“ค์˜ value function์„ updateํ•˜๊ฒŒ ๋˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ํ˜„์žฌ์˜ ๊ฒฝํ—˜์ด ๊ณผ๊ฑฐ์˜ value function์— ์–ผ๋งˆ๋‚˜ ์˜ํ–ฅ์„ ์ฃผ๊ณ  ์‹ถ์€๊ฐ€๋Š” $$\lambda$$๋ฅผ ํ†ตํ•ด ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ update ๋ฐฉ์‹์„ backward-view TD($$\lambda$$)๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

4. Sarsa($$\lambda$$)

์œ„์—์„œ๋Š” TD prediction๋งŒ ๋‹ค๋ฃจ์—ˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” TD Control์— ๋Œ€ํ•ด์„œ ๋‹ค๋ฃจ๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. Sarsa์—๋„ n-step Sarsa๊ฐ€ ์žˆ๊ณ  forward-view Sarsa($$\lambda$$)๊ฐ€ ์žˆ๊ณ  backward-view Sarsa($$\lambda$$)๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐ ๊ฐ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. ์„ค๋ช…์€ ์ƒ๋žตํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

backward-view Sarsa($$\lambda$$)๋ฅผ algorithm์œผ๋กœ ํ‘œํ˜„ํ•˜์ž๋ฉด ์•„๋ž˜๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

๋˜ํ•œ Eligibility trace๋ฅผ ์ด๋ก ์œผ๋กœ๋งŒ ๋ด์„œ ์ดํ•ด๊ฐ€ ์•ˆ๊ฐ€๋Š” ๋ถ„๋“ค์ด ๊ณ„์‹คํ…๋ฐ ์•„๋ž˜ ๊ทธ๋ฆผ์„ ๋ณด๋ฉด ์ดํ•ด๊ฐ€ ์‰ฝ์Šต๋‹ˆ๋‹ค. Path taken์„ ๋ณด๋ฉด ์ง€๊ธˆ๊นŒ์ง€ ์ง€๋‚˜์™”๋˜ ๊ณณ์— ๋Œ€ํ•ด์„œ ๊ฐ ๊ฐ์˜ eligibility trace๋ฅผ ๊ธฐ์–ตํ•ด๋†จ๋‹ค๊ฐ€ ์–ด๋А ์ง€๊ธˆ์— ์ด๋ฅด๋Ÿฌ์„œ reward๋ฅผ ์–ป์–ด์„œ update๋ฅผ ํ•˜๋ คํ•˜๋ฉด ์ž๋™์œผ๋กœ ๊ณผ๊ฑฐ์— ์ง€๋‚˜์™”๋˜ ๋ชจ๋“  state์˜ value function์ด ํ•จ๊ป˜ update๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

Last updated