Homework_ _ Graduate AI Class Fall - UH



Christoph F. Eick et al. COSC 4368 Course Project Spring 2020Using Reinforcement Learning to Play UH Leduc Hold’em Poker IntelligentlyGroup Project (usually 4-5 Students per Group)Version 3Last Updated: March 9, 3pDeadlines: Project deliverable will be due April 11—Group project presentations will be on April 13!In this project we will use reinforcement learning, particularly Q-learning, to develop an intelligent agent to play a variation the game of UH Leduc Hold’em Poker—also called UHLPO—a simplified form of Poker that is played by two players. The agents you develop will learn useful game strategies by playing tournaments against each other and against agents that use a predefined strategy and will use the obtained feedback to learn to become a better player. Learning objectives of the group project include:Understanding basic reinforcement learning concepts such as utilities, policies, learning rates, discount rates and their interactions.Obtain experience in designing agent-based systems that explore and learn in an initially unknown environment and which are capable to adapt to changes. Learn how to conduct tournaments to train gaming agents. Learning how to conduct experiments that evaluate the performance of reinforcement learning systems and learning to interpret such results.Development of path visualization and analysis techniques to summarize the strategy that was learnt by a particular agent.Learning to develop AI software in a team. Fig. 1: Happy Poker PlayerRules of the UH-Leduc-Holdem Poker Game: UHLPO is a two player poker game. The deck used contains multiple copies of eight different cards: aces, king, queens, and jacks in hearts and spades, and is shuffled prior to playing a hand. At the beginning of a hand, each player pays a one chip ante to the pot and receives one private card. A round of betting then takes place starting with player one. After the round of betting, a single public card is revealed from the deck, which both players use to construct their hand. This card is called the flop. Another round of betting occurs after the flop, again starting with player one, and then a showdown takes place. At a showdown, if either player has paired their private card with the public card they win all the chips in the pot. In the event neither player pairs, the player with the higher card is declared the winner. The players split the money in the pot if they have the same private card. Moreover, players might fold their “bad” hand, avoiding further losses. Player 1 is called S and player 2 is called T in the remainder of the document. 35648906540514795501130304069080565153050540730259417051276354216401276352538095952501998980113030359854571628015138407645404103370707390308483072390097599577914545529577914525717507461252033270764540452628093345458787574041015494001468755Fig. 2: The 18 Card UH-Leduc-Hold’em Poker DeckFig. 2: The 18 Card UH-Leduc-Hold’em Poker DeckUH-Leduc Hold’em Deck: This is a “queeny” 18-card deck from which we draw the players’ card sand the flop without replacement. The deck contains three copies of the heart and spade Q and 2 copies of each other card. That is the deck contains 50% more queens in comparison to aces, kings and jacks. Therefore, your chance to getting a pair after receiving a queen—although the Q is low in rank—significantly higher than holding an ace, king or jack. Special UH-Leduc-Hold’em Poker Betting Rules: Ante is $1, raises are exactly $3. Each player can only check once and raise once; in the case a player is not allowed to check again if she did not bid any money in phase 1, she has either to fold her hand, losing her money, or raise her bet. Only player 2 can raise a raise. Implement the following UHLPO Agents (Agents S and T learn to play the game relying on Q-learning):S: The agent your group develops to play position one for UH Leduc Hold’em Poker T: The agent your groups develops to play position two for UH Leduc Hold’em Poker SSALLY plays position 1; Strategy of Phase1: checks or folds in Phase1 holding a J or the heart K, checks or calls/raises if no check is available holding a Q or the S king, and raises (or calls if no raise is available) holding an A. Strategy of Phase 2: Raises (or calls if no raise is available) holding a pair or the spade ace, does not fold and prefers to check or call, holding no pair but the heart A or the spade K, and checks or folds having no pair and the heart king or a queen or a jack. TSALLY: plays position 2; Strategy of Phase1: checks or folds in Phase1 holding a J or the H king, checks or calls if no check is available holding a queen or the spade K, and raises (or calls if no raise is available) holding an A. Strategy of Phase 2: Raises (or calls if not raise is available) holding a pair or the spade ace, checks or calls but does not fold holding no pair but the heart A or the spade K, and checks or folds having no pair and the heart king or a queen or a jack. Train your S agent by letting her play against your T agent, and against TSALLY and train your T agent by playing him play against your S agent, and SSALLY. As the Q-table contains what you learnt in a run, it is important that the software you develop has the capability to store the “learnt” Q-Table in a file and to load Q-Table from files. In the case that there is no previously prior knowledge, initialize the entries of the Q-Table with 0. In particular in COSC 4368 Group Project you will use Q-learning and SARSA to play UH Leduc Poker () intelligently; you will also conduct experiments using different q-learning variations, different parameters and policies, and summarize and interpret the experimental results. If feasible, we will also try to have a tournament in which each group’s agents compete in early April. Moreover, you will develop techniques that will try to summarize the playing strategy the agent employs after being trained. In the training you perform and the experiments you conduct use the learning rate of =0.35 when training agents and =0.15 when competing in tournaments; the discount rate is assumed to be =1. The following 3 policies will be used in the experiments and tournaments:PRANDOM: choose an applicable operator randomly; every operator has the same chance to be chosen. PEGREEDY: Choose the operator with the highest q value with probability 1- (break ties by rolling the dice for operators with the same highest q-value) and chose a different operator with probability —each other operator has the same chance to be chosen. If you use this policy chose to be 0.08 when playing tournaments and to be 0.20 when training agents. The policy is called -gready.PGREEDY: Apply the applicable operator with the highest q-value (break ties by rolling a dice for operators with the same q-value). Figure 3: Visualization of an Attractive Path for a Search ProblemBenchmarks: (missing: add links to the Benchmark Files)Benchmark0: 8 examples (for initial testing) Benchmark1: 800 examplesBenchmark2: 1600 examplesBenchmark3: 2800 examples Experiments to be conducted: (very preliminary draft, subject to change)Let your S agent play against TSALLY using Benchmark2! Use Q-learning to update the Q-table; initialize S’s q-tables entries to 0, use the policy PRANDOM for the first 400 games and then switch to policy OEGREEDY with =0.2 for the remaining games; use and learning rate =0.35 and discount rate =1. Report the winnings and losses that occurred and analyze if agent S’s performance improved during the run! Same as experiment 1 but use SARSA instead of q-learning! Report the winnings and losses that occurred and analyze if agent S’s performance improved during the run! Compare the results! Interpret the Q-table of the “better” approach (the one whose bank account was higher after the run), save the Q-Table of the “better” S agent for further use. Moreover, compare the two obtained Q-Tables in experiments 1 and 2; are they similar or are there significant differences? Let your T agent play against SSALLY using Benchmark2; Use either Q-learning or SARSA to update the q-tables; initialize T’s q-tables to 0, use the policy PRANDOM for the first 400 steps and then switch to policy OEGREEDY with =0.2 for the remaining games; use and learning rate =0.35 and discount rate =1. Report the winnings and losses that occurred and analyze if Agent improved during the run. Save the final Q-Table. Let either S play against TSALLY again or T against SSALLY again for Benchmark1 , but this time you load the q-Table you learnt in experiment1, 2, and 3 at the beginning of the tournament—that is, q-Tables are not initialized to 0. Use the policy OEGREEDY with =0.08 or the policy GREEDY for the 800 games; use and learning rate =0.15 and discount rate =1. Report the winnings and losses that occurred and analyze if your agent’s performance did improve in comparison to the earlier experiment. Let your S agent play against your T agent using a q-learning strategy of your own choice for Benchmark2 , use the policy PRANDOM for the first 400 steps and then switch to policy OEGREEDY with =0.2 for the remaining 1200 games; use and learning rate =0.35 and discount rate =1. Report the experimental finding! Compare agent S’s q-table in this experiment with the q-tables of the first 2 experiments. Alternatively, instead of Experiment5, you could have your S (T) agent play against the T (S) agent of another group using Benchmark1. In this case each group loads its best q-table before the start of the tournament. Use the policy OEGREEDY with =0.08 for the 800 games; use learning rate =0.15 and discount rate =1. The respective agents use either Q-learning or SARSA (your choice!) during the game. Report the winnings and losses that occurred. If you succeed to have you agent play against the agent of another group, you will receive a small amount of extra credit. Finally, take what you believe is the “best” Q-Table and assess to which extend the employed reinforcement learning approach was able to learn how to play UHLPO intelligently! Also assess how its underlying strategy compares with the respective SALLY agents (SSALLY and TSALLY). For all experiment, assess which experiment obtained the best results. Next, analyze the various q-tables you created and try identify the underlying strategy in the obtained q-tables, if there are any. Moreover assess in every experiment if your system gets better during play. Project Demos: TBDLMoreover,Make sure that you use different random generator seeds in different runs of the same experiment to obtain different results—having identical results for the 2 runs of the same experiment is unacceptable. It is okay just to report and interpret the Q-tables for the better of the two runs for each experiment, but you should report the performance variables for all ten runs. Associate a bank account with each agent that is initialized with 0 and updated based on money won/list in a game. We recommend the reduced state space recommended in the Leduc World slide show! You should use the traditional Q-learning and SARSA algorithm in the project and not any other Q-learning variations or reinforcement learning algorithms!Students that provide good methods for visualizing q-tables and good visualizations for the analysis of attractive paths obtain extra credit. Evidence of the running of your system has to be provided (see demo secuibs(Groups that develop a very well designed and visually appealing component that allows to watch tournaments between 2 agents, will receive a small amount of extra credit. Write a 8-12 page report that summarizes the findings of the project. Be aware of the fact that about 20% of the points available for this project are allocated to the interpretation of the experimental results. Finally, submit the source code of the software you wrote in addition to your project report and be ready to demo the system you developed. More detailed project submission guidelines will be added to this specification no later than March 21, 2020. Project Links ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download