Monte-Carlo Prediction

1. Model-Free

์ด์ „ ์ฑ•ํ„ฐ์—์„œ Dynamic Programming์— ๋Œ€ํ•ด์„œ ๋ฐฐ์›Œ๋ณด์•˜์Šต๋‹ˆ๋‹ค. Dynamic programming์€ Bellman Equation์„ ํ†ตํ•ด์„œ optimalํ•œ ํ•ด๋ฅผ ์ฐพ์•„๋‚ด๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ์„œ MDP์— ๋Œ€ํ•œ ๋ชจ๋“  ์ •๋ณด๋ฅผ ๊ฐ€์ง„ ์ƒํƒœ์—์„œ ๋ฌธ์ œ๋ฅผ ํ’€์–ด๋‚˜๊ฐ€๋Š” ๋ฐฉ๋ฒ•์„ ์ด์•ผ๊ธฐํ•ฉ๋‹ˆ๋‹ค.

ํŠนํžˆ Environment์˜ model์ธ "Reward function"๊ณผ "state transition probabilities"๋ฅผ ์•Œ์•„์•ผํ•˜๊ธฐ ๋•Œ๋ฌธ์— Model-basedํ•œ ๋ฐฉ๋ฒ•์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•์—๋Š” ์•„๋ž˜๊ณผ ๊ฐ™์€ ๋ฌธ์ œ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

(1) Full-width Backup --> expensive computation (2) Full knowledge about Environment

์ด๋Ÿฌํ•œ ๋ฐฉ์‹์œผ๋กœ๋Š” ๋ฐ”๋‘‘๊ฐ™์€ ๊ฒฝ์šฐ์˜ ์ˆ˜๊ฐ€ ๋งŽ์€ ๋ฌธ์ œ๋ฅผ ํ’€ ์ˆ˜๊ฐ€ ์—†๊ณ  ์‹ค์žฌ ์„ธ์ƒ์— ์ ์šฉ์‹œํ‚ฌ ์ˆ˜๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค. ์‚ฌ์‹ค ์œ„์™€ ๊ฐ™์ด ํ•™๋ฌธ์ ์œผ๋กœ ์ ‘๊ทผํ•˜์ง€ ์•Š๋”๋ผ๋„ ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•์ด ์‚ฌ๋žŒ์ด ๋ฐฐ์šฐ๋Š” ๋ฐฉ๋ฒ•๊ณผ ๋งŽ์ด ๋‹ค๋ฅด๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด์ „์—๋„ ์–ธ๊ธ‰ํ–ˆ์—ˆ์ง€๋งŒ ์‚ฌ๋žŒ์€ ๋ชจ๋“  ๊ฒƒ์„ ๋‹ค ์•ˆ ํ›„์— ์›€์ง์ด์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋งŒ์ ธ๋ณด๋ฉด์„œ, ๋ฐŸ์•„๋ณด๋ฉด์„œ ์กฐ๊ธˆ์”ฉ ๋ฐฐ์›Œ๋‚˜๊ฐ‘๋‹ˆ๋‹ค. ์ด์ „์—๋„ ๋งํ–ˆ๋“ฏ์ด Trial and error๋ฅผ ํ†ตํ•ด์„œ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ๊ฐ•ํ™”ํ•™์Šต์˜ ํฐ ํŠน์ง•์ž…๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ DP์ฒ˜๋Ÿผ full-width backup์„ ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ์‹ค์žฌ๋กœ ๊ฒฝํ—˜ํ•œ ์ •๋ณด๋“ค๋กœ์„œ update๋ฅผ ํ•˜๋Š” sample backup์„ ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ์‹ค์žฌ๋กœ ๊ฒฝํ—˜ํ•œ ์ •๋ณด๋“ค์„ ์‚ฌ์šฉํ•จ์œผ๋กœ์„œ ์ฒ˜์Œ๋ถ€ํ„ฐ environment์— ๋Œ€ํ•ด์„œ ๋ชจ๋“  ๊ฒƒ์„ ์•Œ ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค. Environment์˜ model์„ ๋ชจ๋ฅด๊ณ  ํ•™์Šตํ•˜๊ธฐ ๋•Œ๋ฌธ์— Model-free๋ผ๋Š” ๋ง์ด ๋ถ™๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

ํ˜„์žฌ์˜ policy๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์›€์ง์—ฌ๋ณด๋ฉด์„œ sampling์„ ํ†ตํ•ด value function์„ updateํ•˜๋Š” ๊ฒƒ์„ model-free prediction์ด๋ผ ํ•˜๊ณ  policy๋ฅผ update๊นŒ์ง€ ํ•˜๊ฒŒ ๋œ๋‹ค๋ฉด model-free control์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

์ด๋ ‡๊ฒŒ Sampling์„ ํ†ตํ•ด์„œ ํ•™์Šตํ•˜๋Š” model-free ๋ฐฉ๋ฒ•์—๋Š” ๋‹ค์Œ ๋‘ ๊ฐ€์ง€๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

(1) Monte-Carlo (2) Temporal Difference

Monte-Carlo๋Š” episode๋งˆ๋‹ค updateํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๊ณ  Temporal Difference๋Š” time step๋งˆ๋‹ค updateํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ์ด๋ฒˆ chapter์—์„œ๋Š” Monte-Carlo Learning์„ ์‚ดํŽด๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

2. Monte-Carlo

Monte-Carlo๋ผ๋Š” ๋ง์— ๋Œ€ํ•ด์„œ Sutton ๊ต์ˆ˜๋‹˜์€ ์ฑ…์—์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ด์•ผ๊ธฐํ•ฉ๋‹ˆ๋‹ค.

The term "Monte Carlo" is often used more broadly for any estimation method whose operation involves a significant random component. Here we use it specifically for methods based on averaging complete returns

Monte-Carlo ๋‹จ์–ด ์ž์ฒด๋Š” ๋ฌด์—‡์ธ๊ฐ€๋ฅผ randomํ•˜๊ฒŒ ์ธก์ •ํ•˜๋Š” ๊ฒƒ์„ ๋œปํ•˜๋Š” ๋ง์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ•ํ™”ํ•™์Šต์—์„œ๋Š” "averaging complete returns"ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์˜๋ฏธํ•œ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ๋ฌด์—‡์„ ์˜๋ฏธํ• ๊นŒ์š”?

Monte-Carlo์™€ Temporal Difference๋กœ ๊ฐˆ๋ฆฌ๋Š” ๊ฒƒ์€ value function์„ estimationํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋”ฐ๋ผ์„œ ์ž…๋‹ˆ๋‹ค. value function์ด๋ผ๋Š” ๊ฒƒ์€ expected accumulative future reward๋กœ์„œ ์ง€๊ธˆ ์ด state์—์„œ ์‹œ์ž‘ํ•ด์„œ ๋ฏธ๋ž˜๊นŒ์ง€ ๋ฐ›์„ ๊ธฐ๋Œ€๋˜๋Š” reward์˜ ์ดํ•ฉ์ž…๋‹ˆ๋‹ค. ์ด๊ฒƒ์„ DP๊ฐ€ ์•„๋‹ˆ๋ผ๋ฉด ์–ด๋–ป๊ฒŒ ์ธก์ •ํ•  ์ˆ˜ ์žˆ์„๊นŒ์š”?

๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ ์ƒ๊ฐ์€ episode๋ฅผ ๋๊นŒ์ง€ ๊ฐ€๋ณธ ํ›„์— ๋ฐ›์€ reward๋“ค๋กœ ๊ฐ state์˜ value function๋“ค์„ ๊ฑฐ๊พธ๋กœ ๊ณ„์‚ฐํ•ด๋ณด๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ MC(Monte-Carlo)๋Š” ๋๋‚˜์ง€ ์•Š๋Š” episode์—์„œ๋Š” ์‚ฌ์šฉํ•  ์ˆ˜ ์—†๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. initial state S1์—์„œ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•ด์„œ terminal state St๊นŒ์ง€ ํ˜„์žฌ policy๋ฅผ ๋”ฐ๋ผ์„œ ์›€์ง์ด๊ฒŒ ๋œ๋‹ค๋ฉด ํ•œ time step๋งˆ๋‹ค reward๋ฅผ ๋ฐ›๊ฒŒ ๋  ํ…๋ฐ ๊ทธ reward๋“ค์„ ๊ธฐ์–ตํ•ด๋‘์—ˆ๋‹ค๊ฐ€ St๊ฐ€ ๋˜๋ฉด ๋’ค๋Œ์•„๋ณด๋ฉด์„œ ๊ฐ state์˜ value function์„ ๊ณ„์‚ฐํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์•„๋ž˜ recall that the return์ด๋ผ๊ณ  ๋˜์–ด์žˆ๋Š”๋ฐ ์ œ๊ฐ€ ๋งํ•œ ๊ฒƒ๊ณผ ๊ฐ™์€ ๋ง์ž…๋‹ˆ๋‹ค. ์ˆœ๊ฐ„ ์ˆœ๊ฐ„ ๋ฐ›์•˜๋˜ reward๋“ค์„ ์‹œ๊ฐ„ ์ˆœ์„œ๋Œ€๋กœ discount์‹œ์ผœ์„œ sample return์„ ๊ตฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

3. First-Visit MC vs Every-Visit MC

์œ„์—์„œ๋Š” ํ•œ ์—ํ”ผ์†Œ๋“œ๊ฐ€ ๋๋‚˜๋ฉด ์–ด๋–ป๊ฒŒ ํ•˜๋Š” ์ง€์— ๋Œ€ํ•ด์„œ ๋งํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ multiple episode๋ฅผ ์ง„ํ–‰ํ•  ๊ฒฝ์šฐ์—๋Š” ํ•œ episode๋งˆ๋‹ค ์–ป์—ˆ๋˜ return์„ ์–ด๋–ป๊ฒŒ ๊ณ„์‚ฐํ•ด์•ผํ• ๊นŒ์š”? MC์—์„œ๋Š” ๋‹จ์ˆœํžˆ ํ‰๊ท ์„ ์ทจํ•ด์ค๋‹ˆ๋‹ค. ํ•œ episode์—์„œ ์–ด๋–ค state์— ๋Œ€ํ•ด return์„ ๊ณ„์‚ฐํ•ด๋†จ๋Š”๋ฐ ๋‹ค๋ฅธ episode์—์„œ๋„ ๊ทธ state๋ฅผ ์ง€๋‚˜๊ฐ€์„œ ๋‹ค์‹œ ์ƒˆ๋กœ์šด return์„ ์–ป์—ˆ์„ ๊ฒฝ์šฐ์— ๊ทธ ๋‘๊ฐœ์˜ return์„ ํ‰๊ท ์„ ์ทจํ•ด์ฃผ๋Š” ๊ฒƒ์ด๊ณ  ๊ทธ return๋“ค์ด ์Œ“์ด๋ฉด ์Œ“์ผ์ˆ˜๋ก true value function์— ๊ฐ€๊นŒ์›Œ์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

ํ•œ ๊ฐ€์ง€ ๊ณ ๋ฏผํ•ด์•ผํ•  ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋งŒ์•ฝ์— ํ•œ episode๋‚ด์—์„œ ์–ด๋– ํ•œ state๋ฅผ ๋‘ ๋ฒˆ ๋ฐฉ๋ฌธํ•œ๋‹ค๋ฉด ์–ด๋–ป๊ฒŒ ํ•ด์•ผํ• ๊นŒ์š”? ์ด ๋•Œ ์–ด๋–ป๊ฒŒ ํ•˜๋ƒ์— ๋”ฐ๋ผ์„œ ๋‘ ๊ฐ€์ง€๋กœ ๋‚˜๋‰˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

  • First-visit Monte-Carlo Policy evaluation

  • Every-visit Monte-Carlo Policy evaluation

๋ง ๊ทธ๋Œ€๋กœ First-visit์€ ์ฒ˜์Œ ๋ฐฉ๋ฌธํ•œ state๋งŒ ์ธ์ •ํ•˜๋Š” ๊ฒƒ์ด๊ณ (๋‘ ๋ฒˆ์งธ ๊ทธ state ๋ฐฉ๋ฌธ์— ๋Œ€ํ•ด์„œ๋Š” return์„ ๊ณ„์‚ฐํ•˜์ง€ ์•Š๋Š”) every-visit์€ ๋ฐฉ๋ฌธํ•  ๋•Œ๋งˆ๋‹ค ๋”ฐ๋กœ ๋”ฐ๋กœ return์„ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ๋‘ ๋ฐฉ๋ฒ•์€ ๋ชจ๋‘ ๋ฌดํ•œ๋Œ€๋กœ ๊ฐ”์„ ๋•Œ true value function์œผ๋กœ ์ˆ˜๋ ดํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ First-visit์ด ์ข€ ๋” ๋„๋ฆฌ ์˜ค๋žซ๋™์•ˆ ์—ฐ๊ตฌ๋˜์–ด ์™”์œผ๋ฏ€๋กœ ์—ฌ๊ธฐ์„œ๋Š” First-visit MC์— ๋Œ€ํ•ด์„œ ๋‹ค๋ฃจ๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ์•„๋ž˜๋Š” First-Visit Monte-Carlo Policy Evaluation์— ๋Œ€ํ•œ Silver ๊ต์ˆ˜๋‹˜ ์ˆ˜์—…์˜ ์ž๋ฃŒ์ž…๋‹ˆ๋‹ค.

4. Incremental Mean

์œ„์˜ ํ‰๊ท ์„ ์ทจํ•˜๋Š” ์‹์„ ์ข€ ๋” ๋ฐœ์ „์‹œ์ผœ๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. ์ €ํฌ๊ฐ€ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์—ฌ๋Ÿฌ๊ฐœ๋ฅผ ๋ชจ์•„๋†“๊ณ  ํ•œ ๋ฒˆ์— ํ‰๊ท ์„ ์ทจํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๊ณ  ํ•˜๋‚˜ ํ•˜๋‚˜ ๋”ํ•ด๊ฐ€๋ฉฐ ํ‰๊ท ์„ ๊ณ„์‚ฐํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์•„๋ž˜๊ณผ ๊ฐ™์€ Incremental Mean์˜ ์‹์œผ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด Incremental Mean์„ ์œ„์˜ First-visit MC์— ์ ์šฉ์‹œํ‚ค๋ฉด ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๊ฐ™์€ ์‹์„ ๋‹ค๋ฅด๊ฒŒ ํ‘œํ˜„ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด ๋•Œ, ๋ถ„์ˆ˜๋กœ ๊ฐ€์žˆ๋Š” N(St)๊ฐ€ ์ ์  ๋ฌดํ•œ๋Œ€๋กœ ๊ฐ€๊ฒŒ ๋˜๋Š”๋ฐ ์ด๋ฅผ ์•ŒํŒŒ๋กœ ๊ณ ์ •์‹œ์ผœ๋†“์œผ๋ฉด ํšจ๊ณผ์ ์œผ๋กœ ํ‰๊ท ์„ ์ทจํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๋งจ ์ฒ˜์Œ ์ •๋ณด๋“ค์— ๋Œ€ํ•ด์„œ ๊ฐ€์ค‘์น˜๋ฅผ ๋œ ์ฃผ๋Š” ํ˜•ํƒœ๋ผ๊ณ  ๋ณด์‹œ๋ฉด ๋  ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. (Complementary filter์— ๋Œ€ํ•ด์„œ ์•„์‹œ๋Š” ๋ถ„์€ ์ดํ•ด๊ฐ€ ์‰ฌ์šธ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค) ์ด์™€ ๊ฐ™์ด ํ•˜๋Š” ์ด์œ ๋Š” ๊ฐ•ํ™”ํ•™์Šต์ด stationary problem์ด ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ๋งค episode๋งˆ๋‹ค ์ƒˆ๋กœ์šด policy๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์—(์•„์ง policy์˜ update์— ๋Œ€ํ•ด์„œ๋Š” ์ด์•ผ๊ธฐํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค) non-stationary problem์ด๋ฏ€๋กœ updateํ•˜๋Š” ์ƒ์ˆ˜๋ฅผ ์ผ์ •ํ•˜๊ฒŒ ๊ณ ์ •ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

5. Backup Diagram

์ด๋Ÿฌํ•œ MC์˜ backup๊ณผ์ •์„ ๊ทธ๋ฆผ์œผ๋กœ ๋‚˜ํƒ€๋‚ด๋ฉด ์•„๋ž˜๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

DP์˜ backup diagram์—์„œ๋Š” one step๋งŒ ํ‘œ์‹œํ•œ ๊ฒƒ์— ๋น„ํ•ด์„œ MC์—์„œ๋Š” terminal state๊นŒ์ง€ ์ญ‰ ์ด์–ด์ง‘๋‹ˆ๋‹ค. ๋˜ํ•œ DP์—์„œ๋Š” one-step backup์—์„œ ๊ทธ ๋‹ค์Œ์œผ๋กœ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  state๋“ค๋กœ ๊ฐ€์ง€๊ฐ€ ๋ป—์—ˆ์—ˆ๋Š”๋ฐ MC์—์„œ๋Š” sampling์„ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ•˜๋‚˜์˜ ๊ฐ€์ง€๋กœ terminal state๊นŒ์ง€ ๊ฐ€๊ฒŒ๋ฉ๋‹ˆ๋‹ค.

Monte-Carlo๋Š” ์ฒ˜์Œ์— random process๋ฅผ ํฌํ•จํ•œ ๋ฐฉ๋ฒ•์ด๋ผ๊ณ  ๋งํ–ˆ์—ˆ๋Š”๋ฐ episode๋งˆ๋‹ค updateํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ฒ˜์Œ ์‹œ์ž‘์ด ์–ด๋””์—ˆ๋ƒ์— ๋”ฐ๋ผ์„œ ๋˜ํ•œ ๊ฐ™์€ state์—์„œ ์™ผ์ชฝ์œผ๋กœ ๊ฐ€๋ƒ, ์˜ค๋ฅธ ์ชฝ์œผ๋กœ ๊ฐ€๋ƒ์— ๋”ฐ๋ผ์„œ ์ „ํ˜€ ๋‹ค๋ฅธ experience๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ randomํ•œ ์š”์†Œ๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ์–ด์„œ MC๋Š” variance๊ฐ€ ๋†’์Šต๋‹ˆ๋‹ค. ๋Œ€์‹ ์— random์ธ๋งŒํผ ์–ด๋”˜๊ฐ€์— ์น˜์šฐ์น˜๋Š” ๊ฒฝํ–ฅ์€ ์ ์–ด์„œ bias๋Š” ๋‚ฎ์€ ํŽธ์ž…๋‹ˆ๋‹ค.

6. Example

Silver๊ต์ˆ˜๋‹˜ ๊ฐ•์˜์—์„œ๋Š” Blackjack ๊ฒŒ์ž„์„ Monte-Carlo policy evaluation์˜ ์˜ˆ์ œ๋กœ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. ์ €์—๊ฒŒ๋Š” ๋ธ”๋ž™์žญ ์ž์ฒด์˜ ๋ฃฐ์„ ๋ชฐ๋ผ์„œ ์ด ์˜ˆ์ œ๊ฐ€ ์–ด๋ ต๊ฒŒ ๋‹ค๊ฐ€์™”๊ธฐ ๋•Œ๋ฌธ์— ์˜ˆ์ œ ์ž์ฒด์— ๋Œ€ํ•ด์„œ๋Š” ์„ค๋ช…ํ•˜์ง€ ์•Š๊ฒ ์Šต๋‹ˆ๋‹ค. ๊ฒŒ์ž„์˜ ์„ค์ •์€ ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

์ด์™€ ๊ฐ™์€ ๊ฒŒ์ž„์—์„œ policy๋ฅผ ์ •ํ•ด๋†“๊ณ  ๊ณ„์† computer๋กœ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์„ ๋Œ๋ฆฌ๋ฉด์„œ ์–ป์€ reward๋“ค๋กœ sample return์„ ๊ตฌํ•˜๊ณ  ๊ทธ return๋“ค์„ incrementally mean์„ ์ทจํ•˜๋ฉด ์•„๋ž˜๊ณผ ๊ฐ™์€ ๊ทธ๋ž˜ํ”„๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Ace์˜ ์„ค์ •์— ๋”ฐ๋ผ์„œ ์œ„ ์•„๋ž˜๋กœ ๋‚˜๋‰˜๋Š” ๋ฐ ์œ„์˜ ๋‘ ๊ฐœ์˜ ๊ทธ๋ž˜ํ”„๋งŒ ๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. policy๋Š” ๊ทธ๋ž˜ํ”„ ์•„๋ž˜์— ์„ค์ •๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. state๋Š” x,y 2์ฐจ์›์œผ๋กœ ํ‘œํ˜„๋˜๋ฉฐ z์ถ•์€ ๊ฐ state์˜ value function์ด ๋ฉ๋‹ˆ๋‹ค. episode๋ฅผ ์ง€๋‚จ์— ๋”ฐ๋ผ ์ ์ฐจ ์–ด๋– ํ•œ ๋ชจ์–‘์œผ๋กœ ์ˆ˜๋ ดํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 500,000๋ฒˆ์ฏค ๊ฒŒ์ž„์„ playํ–ˆ์„ ๊ฒฝ์šฐ์— ๊ฑฐ์˜ ์ˆ˜๋ ดํ•œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์•ž์œผ๋กœ ํ•  Control์—์„œ๋Š” ์ด value function์„ ํ† ๋Œ€๋กœ policy๋ฅผ updateํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ์–ด๋–ค ๊ฒƒ์ด ์ข‹์€ policy์ธ์ง€ ๋น„๊ต๊ฐ€ ๊ฐ€๋Šฅํ•ด์ง‘๋‹ˆ๋‹ค.

Last updated

Was this helpful?