Homework

CS 2951-F: Learning and Sequential Decision Making

Homework #1

Instructor: Michael L. Littman

(Due: 10/01/19)

Read all the instructions below carefully before you start working on the assignment, and before you make a submission.

? Assignments are due at 11:59 pm on the due date on Gradescope. Please submit a PDF on Gradescope. If you haven't signed up to Gradescope yet, you'll need the code: M8DJV4. Be sure to use your Banner ID as your name and also sign up for the correct course! Your PDF should contain written answers and any relevant images.

There are two parts to this homework assignment: Part 1 introduces the simple rl framework and value iteration, and Part 2 requires reading 2 papers: One seminal, and one contemporary, with corresponding questions.

Problem 1: Value Iteration in simple rl

In this problem, you will be using demo code from the simple rl library to learn more about MDPs and Value Iteration (VI).

If you haven't done this already, view the simple rl GitHub repository at [ simple_rl] and follow the instructions to install it on your machine.

For this assignment, the file you will be working with is at [/examples/value iteration update viz example.py]. Start by running the code with the -h flag to look at the arguments and understand what you need to pass in. Feel free to look through the code (it should be fairly straightforward!) to understand what it is doing, though we would not recommend you look through all the various imported files such as GridWorldMDP or ValueIterationClass because these are tightly-integrated with the rest of the library and may be difficult to understand (and aren't necessary to do this assignment).

Note: Be sure to read the documentation provided with the -h flag, especially the preamble and notes. This demo is extremely finicky with the way arguments must be passed to it, so reading the -h documentation will probably save you (and us!) a lot of trouble with this assignment!

For subproblems a, b and c, instantiate a 4 ? 4 gridworld with 1 wall and 1 lava location. Place the goal in the top right corner and the lava in the cell right below the goal. Have the agent start in the bottom left cell and place the wall in the cell above the start location.

(a) Provide a screenshot of your gridworld (It's okay if it has the policy arrows/colors in it!)

(b) Run the example visualizer with the following discount-factor values: 0.9, 0.7, 0.1, 0.001 (be sure to keep your delta value constant).

(i) Comment on how the policy changes (i.e, does it always take the shortest path to the goal?). Explain what you observe. What does it tell you about how gamma affects VI?

(ii) Run the visualizer with gamma as 1. Does Value Iteration converge to a policy (remember, if it doesn't converge in 500 iterations, the program will terminate automatically)? When gamma 1, Value Iteration is no longer theoretically guaranteed to converge. Thus, if it did converge in this case, explain why this might be. How might you change the gridworld so that VI would not converge?

(c) Run the example visualizer with the following delta values: 1, 0.1, 0.01, 0.001, 0.0001 (be sure to keep your gamma value constant). What happens to the number of iterations taken to reach convergence as you decrease delta? Explain what you observe.

(d) Instantiate a 4 ? 4 gridworld with the agent starting at the bottom-left corner (location 1,1). Use walls such that it is impossible for the agent to reach the goal. Specifically, you should put three (and only

1

CS2951F ? Homework #1

2

three) walls in the cells immediately surrounding the agent's initial location. Place the goal in the topright corner. Do not place any other goals, lava or walls. Leave all other parameters as their defaults. If you have done this correctly, gamma should have a value of 0.95, The transition function should be such that the agent will take the intended movement action (i.e: up, down, left or right) with probability 0.95 and with probability 0.05 will decide to move in a randomly-chosen perpendicular direction. The reward of every state that is not a goal or lava state is -0.1. The reward of a lava state is -1.0 and that of a goal state is +1.0

(i) Is the value function 0 at states that the agent cannot reach? Explain what you observe.

(ii) Does the initial state converge to a specific value? If so, prove that the value you obtained for the initial state is correct. Include a screenshot of your gridworld. Hint for ii: Recall the Bellman-backup formula for VI. What happens when you apply this formula iteratively to the initial state?

(e) Instantiate gridworlds of the following sizes: 5 ? 5, 10 ? 10, 15 ? 15, 20 ? 20, and 25 ? 25. Leave all parameters other than the width and height as their defaults.

(i) What is the relationship between the number of states in the gridworld and the total time taken to run VI for each gridworld? (Reminder: The time for all iterations is printed in the console when you run the program). Include a table with the number of states and the times required for VI to run on each of these.

(ii) Derive the time complexity (big O) of one iteration of the VI algorithm (pseudocode included below) with respect to the state space S and the action space A. Assume that V0 is a constant and that Vk-1 can be obtained in constant time for any state s.

(iii) What is the relationship you observe between answers (i) and (ii)? Are your answers to both parts the same? Why or why not?

Problem 2: Paper Write-ups

This course is an introduction to learning and sequential decision making, as well as an introduction to research methods in this field. To understand state-of-the-art results, it is important to read and parse research papers. Throughout the course, there will be a heavy emphasis on reading important and seminal papers that have shaped the discipline.

If this is your first time reading a research paper, this link [ Handouts/HowtoReadPaper.pdf] may be helpful. This resource may be even useful for veterans in research. It describes a multi-pass approach to reading papers.

The papers that you can choose to read may vary in difficulty, and we do not expect anyone to understand all of the content in the papers. Nonetheless, it is important to start building your ability to read research, especially research out of your comfort zone. The notation, mathematics, and jargon of the paper matter less than the big ideas and context of the paper. We recommend blocking 4 hours total for the first and second pass of the two papers. This time estimate may increase if you don't have a background in deep learning, specifically for the second paper. Read the papers, take notes, and answer the questions for both papers (see below).

(i) The Theory of Dynamic Programming by Bellman (1954) This classic paper . org/download/pdf_1/euclid.bams/1183519147 introduces the idea of Dynamic Programming and Optimality. Feel free to ignore Sections 9 and 10.

(ii) Value Iteration Networks by Tamar et al. (2017) This contemporary paper . org/abs/1602.02867 uses the basic idea of VI to design a fully differentiable neural network based planning module. It assumes some background in convolutional neural networks and backpropogation. For a good refresher on backprop, check out Michael Nielsen's excellent book on Neural Networks and Deep learning , specifically Chapters 1 and 2. For an overview of Convolutional Networks, Andrej Karpathy's excellent CS231n Vision course https:

CS2951F ? Homework #1

3

//cs231n.github.io/convolutional-networks/ provides sufficient background. This is a challenging paper, and it will take some time to understand the structure and argument. Use your peers, online resources, and TA staff if you feel overwhelmed.

Questions (Answer the following for both papers, about one page of writing total)

? What are the papers' main contributions? Describe the core idea.

? What was surprising, difficult, or confusing in the paper?

? Were the experiments (if there are any) convincing?

? How can these methods be applied in ways not described in the paper? Have you seen these ideas in other papers? Feel free to cite relevant work.

This paper write-up should be about a page long, in PDF format submitted on Gradescope. You may use whatever word processor you want, although it might be easier to use LATEX if you want to type out math. We recommend using Overleaf for that purpose.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download