Christoph F. Eick et al. COSC 4368 Course Project Spring 2020Using Reinforcement Learning to Play UH Leduc Hold’em Poker IntelligentlyIndividual ProjectVersion 5Last Updated: March 22, 3pDeadlines: Friday, April 17, 11p. In this project we will use reinforcement learning, particularly Q-learning, to develop an intelligent agent to play a variation the game of UH Leduc Hold’em Poker—also called UHLPO—a simplified form of Poker that is played by two players. The agents you develop will learn useful game strategies by playing tournaments against each other and against agents that use a predefined strategy and will use the obtained feedback to learn to become a better player. Learning objectives of the group project include:Understanding basic reinforcement learning concepts such as utilities, policies, learning rates, discount rates and their interactions.Obtain experience in designing agent-based systems that explore and learn in an initially unknown environment and which are capable to adapt to changes. Learn how to conduct tournaments to train gaming agents. Learning how to conduct experiments that evaluate the performance of reinforcement learning systems and learning to interpret such results.Development of path visualization and analysis techniques to summarize the strategy that was learnt by a particular agent.Learning to develop AI software in a team. Fig. 1: Happy Poker PlayerRules of the UH-Leduc-Holdem Poker Game: UHLPO is a two player poker game. The deck used contains multiple copies of eight different cards: aces, king, queens, and jacks in hearts and spades, and is shuffled prior to playing a hand. At the beginning of a hand, each player pays a one chip ante to the pot and receives one private card. A round of betting then takes place starting with player one. After the round of betting, a single public card is revealed from the deck, which both players use to construct their hand. This card is called the flop. Another round of betting occurs after the flop, again starting with player one, and then a showdown takes place. At a showdown, if either player has paired their private card with the public card they win all the chips in the pot. In the event neither player pairs, the player with the higher card is declared the winner. The players split the money in the pot if they have the same private card. Moreover, players might fold their “bad” hand, avoiding further losses. Player 1 is called S and player 2 is called T in the remainder of the document. 35648906540514795501130304069080565153050540730259417051276354216401276352538095952501998980113030359854571628015138407645404103370707390308483072390097599577914545529577914525717507461252033270764540452628093345458787574041015494001468755Fig. 2: The 18 Card UH-Leduc-Hold’em Poker DeckFig. 2: The 18 Card UH-Leduc-Hold’em Poker DeckUH-Leduc Hold’em Deck: This is a “queeny” 18-card deck from which we draw the players’ card sand the flop without replacement. The deck contains three copies of the heart and spade Q and 2 copies of each other card. That is the deck contains 50% more queens in comparison to aces, kings and jacks. Therefore, your chance to getting a pair after receiving a queen—although the Q is low in rank—significantly higher than holding an ace, king or jack. Special UH-Leduc-Hold’em Poker Betting Rules: Ante is $1, raises are exactly $3. Each player can only check once and raise once; in the case a player is not allowed to check again if she did not bid any money in phase 1, she has either to fold her hand, losing her money, or raise her bet. Only player 2 can raise a raise. A subset of the following agents will be implemented in this project. S: An agent which uses reinforcement learning to play position one for UH Leduc Hold’em Poker T: An agent which uses reinforcement learning to play position two for UH Leduc Hold’em Poker SSALLY plays position 1; Strategy of Phase1: checks or folds in Phase1 holding a J or the heart K, checks or calls holding a Q or the spade K, and raises (or calls if no raise is available) holding an A. Strategy of Phase 2: Raises (or calls if no raise is available) holding a pair or the spade ace, does not fold and prefers to check or call, holding no pair but the heart A or the spade K, and checks or folds having no pair and the heart king or a queen or a jack. TSALLY: plays position 2; Strategy of Phase1: checks or folds in Phase1 holding a J or the heart K, checks or calls if no check is available holding a queen or the spade K, and raises (or calls if no raise is available) holding an A. Strategy of Phase 2: Raises (or calls if no raise is available) holding a pair or the spade ace, checks or calls but does not fold holding no pair but the heart A or the spade K, and checks or folds having no pair and the heart king or a queen or a jack. In the COSC 4368 Couse Project you will use Q-learning and SARSA to play UH Leduc Poker intelligently; you will also conduct experiments using different q-learning variations, different parameters and policies, and summarize and interpret the experimental results. Moreover, you will develop techniques that will try to summarize the playing strategy the agent employs after being trained. You will implement 2 of these agents: either S and TSALLY or T and SSALLY: your choice; however, as agent S has a larger state space a bonus of 5% will be given to students who implement S and TSALLY. As the Q-table contains what you learnt in a run, it is desirable that the software you develop has the capability to store the “learnt” Q-Table in a file and to load Q-Table from files. In the case that there is no previously prior knowledge, initialize the entries of the Q-Table with 0. In the training you perform and the experiments you conduct use the learning rate of =0.35 when training agents and =0.15 when competing in tournaments; the discount rate is assumed to be =1. The following 3 policies will be used in the experiments and tournaments:PRANDOM: choose an applicable operator randomly; every operator has the same chance to be chosen. PEGREEDY: Choose the operator with the highest q value with probability 1- (break ties by rolling the dice for operators with the same highest q-value) and chose a different operator with probability —each other operator has the same chance to be chosen. If you use this policy chose to be 0.08 when playing tournaments and to be 0.20 when training agents. The policy is called -gready.PGREEDY: Apply the applicable operator with the highest q-value (break ties by rolling a dice for operators with the same q-value). Figure 3: Visualization of an Attractive Path for a Search ProblemBenchmarks: (links to files are in the 4368 webpage)Benchmark0: 8 examples (for initial testing and demos) Benchmark1: 800 examples Benchmark2: 1600 examples Benchmark3: 2800 examples Experiments to be conducted: Either let your S agent play against TSALLY or let your Agent T play against SSALLY using Benchmark2! Use Q-learning to update the Q-table; initialize S’s q-tables entries to 0, use the policy PRANDOM for the first 400 games and then switch to policy OEGREEDY with =0.2 for the remaining games; use and learning rate =0.35 and discount rate =1. Report the winnings and losses that occurred. Analyze if agent S’s/T’s performance improved during the run! Repeat Experiment1 using Benchmark2; make sure that you use a different seed for your random generator. Evaluate if the obtained results are similar or not. Same as experiment 1 but use SARSA instead of q-learning! Report the winnings and losses that occurred and analyze if agent S’s performance improved during the run! Compare the results! Interpret the Q-table of the “better” approach (the one whose bank account was higher after the run), save the Q-Table of the “better” S/T agent for further analysis. Moreover, compare the two obtained Q-Tables in experiments 1 and 2; are they similar or are there significant differences? Rerun experiment1 or experiment3 with a learning rate =0.2; assess if the results are better or worse. Demo how your two agents actually play UHLPO by running Benchmark0 after one of the 4 experiments; output the hands played, the two agents’ betting decisions and who won how much for each of the 8 hands. Finally, take what you believe is the “best” Q-Table and assess to which extend the employed reinforcement learning approach was able to learn how to play UHLPO intelligently! Also assess how its underlying strategy compares with the respective SALLY agents (SSALLY and TSALLY). Moreover assess in every experiment if your player gets better during play. Finally, for all experiment, assess which experiment obtained the best results. Project Demos: You should be able to demo your system by reading a benchmark file and the running your system outputting a game summary at the end. Moreover,Make sure that you use different random generator seeds in different runs of the same experiment to obtain different results—having identical results for the 2 runs of the same experiment is unacceptable. It is okay just to report and interpret the Q-tables for the better of the two runs for each experiment, but you should report the performance variables for all ten runs. Associate a bank account with each agent that is initialized with 0 and updated based on money won/lost in a game. We recommend the reduced state space recommended in the Leduc World slide show! You should use the traditional Q-learning and SARSA algorithms in the project and not any other Q-learning variations or reinforcement learning algorithms!Students that provide good methods for visualizing q-tables and good visualizations for the analysis of attractive paths obtain extra credit. Evidence of the running of your system has to be provided; evidence of a partially running system will help you to get partial credit. Write a 6-10 page report that summarizes the findings of the project. Be aware of the fact that about 20% of the points available for this project are allocated to the interpretation of the experimental results. Finally, submit the source code of the software you wrote in addition to your project report and be ready to demo the system you developed. More detailed project submission guidelines will be added to this specification no later than March 27, 2020.

