Porovnání metod explorace v částečně pozorovatelných stochastických hrách

Jakub Rada

Comparing Exploration Methods in Partially Observable Stochastic Games

Typ dokumentu

bakalářská práce
bachelor thesis

Autor

Jakub Rada

Vedoucí práce

Bošanský Branislav

Oponent práce

Šír Gustav

Studijní obor

Základy umělé inteligence a počítačových věd

Studijní program

Otevřená informatika

Instituce přidělující hodnost

katedra kybernetiky

Práva

A university thesis is a work protected by the Copyright Act. Extracts, copies and transcripts of the thesis are allowed for personal use only and at one?s own expense. The use of thesis should be in compliance with the Copyright Act http://www.mkcr.cz/assets/autorske-pravo/01-3982006.pdf and the citation ethics http://knihovny.cvut.cz/vychova/vskp.html
Vysokoškolská závěrečná práce je dílo chráněné autorským zákonem. Je možné pořizovat z něj na své náklady a pro svoji osobní potřebu výpisy, opisy a rozmnoženiny. Jeho využití musí být v souladu s autorským zákonem http://www.mkcr.cz/assets/autorske-pravo/01-3982006.pdf a citační etikou http://knihovny.cvut.cz/vychova/vskp.html

Metadata

Zobrazit celý záznam

Abstrakt

Částečně pozorovatelné stochastické hry modelují mnoho reálných situací skládající se ze dvou nezávislých agentů. Jejich podtřídu jednostranných her lze přibližně vyřešit algoritmem HSVI, který pomocí dvou value funkcí, jedné spodní a jedné horní meze, odhaduje optimální value funkci hry. V každé iteraci se aplikuje Bellmanův operátor na obě meze, který aktualizuje jejich hodnotu v bodech, které byly vybrány heuristickou funkcí. Nicméně, není dokázáno, že tento heuristický přístup, který je založený na strategiích obou hráčů a velikosti mezety mezi mezními funkcemi, je optimální metodou explorace pro prohledávání prostoru bodů beliefu. Mnohorucí bandité jsou algoritmy používané v posilovaném učení, které řeší problém vyvažování explorace a exploitace. Je tedy možné použít tyto mnohoruké bandity jako alternativní způsob prohledávání prostoru bodů beliefu a tím zlepšovat meze HSVI algoritmu. Mnohorucí bandité mohou také zajistit podobný alternativní přístup k řešení fázových her v plně pozorovatelných stochastických hrách řešených metodou iterace hodnoty. Navíc, použití banditů eliminuje použití metod lineárního programování, které mohou způsobovat špatnou škálovatelnost původních algoritmů. Cílem této práce byla integrace tohoto nového přístupu explorace do iterace hodnoty a HSVI a porovnání některých mnohorukých banditů na plně i částečně pozorovatelných stochastických hrách.

The partially observable stochastic games model many situations consisting of two independent agents. Their one-sided subclass can be approximately solved by the HSVI algorithm, which estimates the optimal value function with lower and upper bound value functions. The approximation is refined by iteratively performing Bellman-style point-based updates on both bounding value functions in belief-points selected by a heuristic approach. However, this heuristic based on the strategies of both players and the gap between the bounding functions is not proven to be the optimal exploration method for searching the space of belief-points. In reinforcement learning, multi-armed bandit algorithms are a tool for solving the exploration-exploitation problem. It is thus possible to use the bandits as an alternative approach for exploring the belief-point search space and refine the bounds in the HSVI algorithm. Additionally, the multi-armed bandits can provide similar alternative approach for solving stage games in the value iteration algorithm for the fully observable stochastic games. Moreover, the need of linear programming is thus eliminated, which could lead to improved scalability. The goals of this thesis were the integration of this novel exploration method into the existing solving algorithms and comparing subset of the multi-armed bandit algorithms on both SGs and OS-POSGs.