Hierarchical Relational Reinforcement Learning



Hierarchical Relational Reinforcement Learning

Meg Aycinena

Department of Computer Science

Stanford University

Stanford, California, 94305

aycinena@cs.stanford.edu

Abstract

Experiments with relational reinforcement learning, a combination of reinforcement learning and inductive logic programming, are presented. This technique offers greater expressive power than that offered by traditional reinforcement learning. We use it to find general solutions in the blocks world domain. We discuss some difficulties associated with relational reinforcement learning, specifically its decreased effectiveness when presented with more complex problems. Finally, we explore ways in which hierarchical methods and learning subroutines may be able to overcome some of these weaknesses.

1. Introduction

Reinforcement learning can be a powerful machine learning technique, but it suffers from an inability to generalize well over large state spaces, or to reuse the results of one learning session on a similar, but slightly different or more complicated problem. Džeroski et al. [Džeroski et al., 1998], [Džeroski et al., 2001], have developed a way to combine reinforcement learning with inductive logic programming, to produce a more flexible, expressive approach that circumvents many of the problems that have plagued traditional reinforcement learning. Specifically, the method modifies the traditional Q-learning technique to storeso that it stores the Q-values for of state-action pairs in a top-down induced logical decision tree., which The method allows for the introduction use of variables, eliminates the need for an explicit, memory- expensive state-action table, and generalizes already learned knowledge to unseen parts of the state space.

Although relational reinforcement learning greatly improves upon traditional reinforcement learning, it tends to work less effectively with more complicated goals. Its problem solving ability in the classic blocks world domain, for example, is impressive with simple goals such as clearing one block, or placing one block on top of another. However, when faced with a more complex goal such as building an ordered tower of blocks, relational reinforcement learning works less effectively than its traditional reinforcement learning precursor.[Meg: isn’t this comparing apples with oranges? The “traditional reinforcement learning precursor” led to huge state-action tables that could only solve problems with specific, named blocks. RRL could solve problems with arbitrarily named blocks.]

The introduction of a hierarchical learning strategy may harness the effectiveness of relational reinforcement learning in solving simpler problems to the solution of more complicated goals. Our hypothesis states that by first learning a set of useful subroutines and then learning to combine these subroutines to achieve a stated goal, an agent may potentially converge more consistently and more quickly upon an optimal solution.

Some work has been done on hierarchical learning in traditional reinforcement learning [Kaelbling et al., 1996], but has been hindered due to the lack of variablized solutions to subproblems and the inability to generalize those solutions to slightly different problems. Because relational reinforcement learning can produce variablized solutions to subproblems and because it has been shown to perform well on simpler problems, hierarchical learning is easier to implement and possibly more powerful than its traditional counterpart. Hierarchical relational reinforcement learning promises to exploit the advantages and overcome some of the weaknesses of relational reinforcement learning in order to improve the actual process of learning, as well as the applicability of the learned solution. In this paper, we investigate whether this actually occurs.

The paper is organized as follows. In Section 2 we describe traditional reinforcement learning, Q-learning, and how P-learning extends it. In Section 3 we briefly summarize inductive logic programming, and describe top-down induction of logical decision trees, as implemented in the TILDE system, [Blockeel, 1998]. In Section 4 we describe relational reinforcement learning, and how we have implemented it according to the design of [Džeroski et al., 2001]. Section 5 describes a simple implementation strategy for hierarchical relational reinforcement learning, and Section 6 briefly offers the results of the preliminary experiments we have conducted in order to explore its effectiveness. Section 7 concludes and discusses possible further work.

2. Traditional reinforcement learning, Q-learning, and P-learning

The term “reinforcement learning” and some of the general characteristics of the technique have been borrowed from cognitive psychology, but in recent years it has become increasingly popular within the computer science disciplines of artificial intelligence and machine learning. See [Kaelbling et al., 1996] for a broad survey of the field as it has been applied within computer science. In its simplest form, reinforcement learning describes any machine learning technique in which a system learns a policy to achieve a goal through a series of trial-and-error training sessions, receiving a reward or punishment after each session, and learning from this “reinforcement” in future sessions. (Note: we shall use the terms “training session” and “episode” interchangeably to refer to one cycle of state-action sequences culminating in a goal state.)

Formally, [Kaelbling et al., 1996] describe the reinforcement learning problem as follows:

• a discrete set of environment states, S,

• a discrete set of agent actions, A, and

• a set of scalar reinforcement signals, typically {0, 1}, or the real numbers.

The goal of the agent is to find a policy π, mapping states to actions, that maximizes some long-run measure of reinforcement. We will denote the optimal policy as π*. Various techniques exist within reinforcement learning for choosing an action to execute in a particular state. Q-learning is one of the more traditional of these techniques.

Q-learning

Q-learning falls within the “model-free” branch of the reinforcement learning family, because it does not require the agent to learn a model of the world or environment (i.e. how actions lead from state to state, and how states give rewards) in order to learn an optimal policy. Instead, the agent interacts with the environment by executing actions and perceiving the results they produce. In Q-learning, each possible state-action pair is assigned a quality value, or “Q-value” and actions are chosen by selecting the action in a particular state which has the highest Q-value.

The process works as follows. The agent can be in one state of the environment at a time. In a given state si ( S, the agent selects an action ai ( A to execute according to its policy π. Executing this action puts the agent in a new state si+1, and it receives the reward ri+1 associated with the new state. The policy in Q-learning is given by:

π(s) = arg maxa Q(s, a),

and thus by learning an optimal function Q*(s, a) such that we maximize the total reward r, we can learn the optimal policy π*(s). We update Q(s, a) after taking each action so that it better approximates the optimal function Q*(s, a). We do this with the following equation:

Q (si, ai) := ri + ( maxa’ Q (si+1, a’),

[Meg: you should explain somewhere what a’ refers to. And you probably want a' instead of a’. That is ‘prime’ instead of a single closed-quote mark.]

where ( is the discount factor, 0 < ( < 1.

In the first few training sessions, the agent has not yet learned even a crude approximation of the optimal Q-function, and therefore the initial actions must will be random if the initial Q’s are. However, due to this randomness, the danger exists that, although the Q-function may eventually converge to a solution to the problem, it may not find the optimal solution to the problem, because the initial path it stumbles upon may be a correct but sub-optimal path to the goal. Therefore it is necessary that every reinforcement learning agent have some policy of exploration of its environment, so that, if a better solution exists than the one it has already found, it may stumble upon it.

One popular method (e.g. [Džeroski et al., 2001]) for arbitrating the tradeoff between exploitation (using knowledge already learned) and exploration (acting randomly in the hopes of learning new knowledge) is to choose each action with a probability proportionate to its Q-value. Specifically:

Pr (ai | s) = T-Q (s, ai) / (j T-Q (s, aj) .

[Meg: can you fix the subscripts so they aren’t so low?]

The temperature, T, starts at a high value and is decreased as learning goes on. At the beginning, the high value of T ensures that some actions associated with low Q-values are taken – encouraging exploration. As the temperature is decreased, most actions are those associated with high Q-values – encouraging exploitation. This is the method of choosing actions that we will use.

P-learning

Like Q-learning, P-learning is a technique for choosing actions in a particular state, presented by [Džeroski et al., 2001]. Indeed, P-learning is almost identical to Q-learning, except that rather than remembering a real-valued scalar to represent the quality of each state-action pair, the agent only remembers a 1 if the action in the given state is optimal, and a 0 if the action is nonoptimal. Formally:

if a ( π*(s) then P(s, a) = 1 else P(s, a) = 0.

[Meg: since you are talking about optimal actions here, shouldn’t you be using P*? And, what’s the rationale (throughout) for your use of boldface symbols?]

In practice, the P function is computed from the Q function (which must also be learned), but it can be stored more compactly and used later more easily. The P function is expressed in terms of the Q function as follows:

if a ( arg maxa Q(s, a) then P(s, a) = 1 else

P(s, a) = 0.

We can also incorporate exploration into P-learning through the analogous equation to the one for choosing actions in Q-learning:

Pr (ai | s) = T-P (s, ai) / (j T-P (s, aj) .

[Fix subscripts]

We will see that P-learning, although it is built directly on top of Q-learning and may seem rather redundant, produces smaller, simpler, and more efficient representations of the optimal solution and therefore is extremely valuable towards our goal of creating hierarchical programs.

3. Inductive logic programming and top-down induction of logical decision trees

Inductive logic programming (ILP) refers to a subdiscipline of machine learning that deals with learning logic programs. ILP attempts to use reasoning to derive useful and general principles, rules, and relationships from a set of specific data instances, hence the use of the term “inductive”. This contrasts with the well-established field of deductive reasoning, in which a conclusion is inferred from a given set of premises through the application of accepted logical rules. In most cases, ILP has been used to learn concepts so that a set of unseen examples may be classified correctly.

Decision trees

One popular method for performing the classification of examples is the decision tree. A decision tree consists of a set of nodes, each of which is either an internal node (meaning it has a set of other nodes as its children), or a leaf node (in which case it has no children).

Each internal node contains a test that is used to split the possible set of examples into several mutually exclusive groups. The node then passes the current example down the branch corresponding to the result of the test. Often, the tests return Boolean results, so the tree is binary, with each node classifying the example as either passing or failing its particular test. Decision trees with this property are called binary decision trees.

Leaf nodes do not contain tests, but rather hold the classification label for examples that are sorted into that node. Thus an example can be deterministically classified by a decision tree.

Logical decision trees

According to [Blockeel and De Raedt, 1997], logical decision trees are binary decision trees which fulfill the following two properties:

• The tests in the internal nodes are first-order logical conjunctions of literals.

• A variable that is introduced in a particular node (i.e. one which does not exist higher in the tree), can only occur in the subtree corresponding to successful tests.

The second requirement is necessary due to the fact that if a test referring to a newly introduced variable fails, the agent cannot be sure that the variable refers to anything at all, and therefore it does not make sense to continue to refer to it.

We illustrate the concept of logical decision trees with this simple example, taken from [Džeroski et al., 2001]:

[pic]

This logical decision tree determines whether all the blocks are unstacked in the 3-blocks world.

[Meg: All of your other figures have numbers. How come this one doesn’t?]

The fact that we are using variables also necessitates the use of more complicated tests than those used in the propositional case. Internal nodes must contain a query including both the unique predicate introduced in the current node, and the predicates introduced in all the parent nodes of the current node. The following process for associating clauses and queries to nodes is adapted from [Blockeel and De Raedt, 1997]. We assume that examples sorted down the tree are passed to the left subtree if the node’s test returns true, and to the right subtree if it returns false. [Meg: This convention is different than the one used in the figure above. Is that a problem?] We also assume that N.test returns the unique predicate assigned to the node N:

• With the root node T, the clause c0 ← T.test and the empty query are associated.

• For every internal node N with associated clause ci ← P and associated query Q, the query P is associated with its left subtree. The query Q, not(ci) is associated with the right subtree.

• With each internal node N with associated query Q, the clause ci ← Q, N.test is associated; i is a number unique for this node.

• Each leaf node N has no associated test, but only the query Q that it inherited from its parent.

It is important to note that we associate the query Q, not(ci) with the right subtree, and not the expected query Q, not(N.test), which would be symmetric with the left subtree. This is necessary because the queries of the left and right subtrees must be complementary; only one of the two queries associated with the two children of each node must succeed on any example passing through that node. Because it is possible that the N.test shares variables with Q, the two queries Q, N.test and Q, not(N.test) are not necessarily complementary for all possible substitutions. This distinction does not exist in the propositional case. For a more detailed description of the logical inequivalence of the two queries, see ([Blockeel and De Raedt, 1997], p.4-6).

[Meg: I didn’t find the bulleted items and the paragraph that followed them very clear. As a matter of fact, I don’t think I understand them at all. If you are assuming that the reader has read and understands Blockeel and De Raedt, this might be ok as a quick summary. But if you want your paper to be at all self-contained (which I think would be a good idea), it would be important to explain all this. And an example of a tree with tests and queries would be important. (I don’t understand the difference between a test and a query.] I’d be glad to have you explain all this to me in person, and then maybe I could give some advice about explaining it in writing.]

Thus logical decision trees allow the classification of examples described in first order (rather than propositional or attribute value) logic, and therefore offer much greater expressiveness and flexibility than traditional decision trees. This characteristic makes them ideal candidates for use in relational reinforcement learning, which seeks to allow variablization and generalization over similar inputs.

Top-down induction of logical decision trees

In order to use logical decision trees for classification, however, we must have an algorithm for inducing them from a set of example instances. [Blockeel and De Raedt, 1997] offer the following algorithm, based on the classic TDIDT (top-down induction of decision trees) algorithm, but modified to induce logical decision trees. It assumes as input a set of labeled examples E, background knowledge B, and a language L specifying what kind of tests may be used in the decision tree. Both B and L are assumed to be available implicitly.

buildtree (T, E, Q):

if E is homogeneous enough

then

K := most_frequent_class (E)

T := leaf (K)

else

Qb := best element of ρ(Q), according to some heuristic

T.test := C’ where Qb ← Q, C’

E1 := { E ( E | E U B |= Qb }

E2 := { E ( E | E U B |≠ Qb }

buildtree (T.left, E1, Q’ )

buildtree (T.right, E2, Q )

[Replace quotes by primes in above (and throughout)]

The refinement operator ρ determines the language bias (and thus serves the role of the language L). [Blockeel et. al., 2001] use a refinement operator that adds one or more literals to the clause under θ-subsumption. A clause c1 subsumes another clause c2 if and only if there is a substitution θ such that c1 θ ( c2. In other words, a refinement operator under θ-subsumption maps a clause onto a set of clauses, each of which is subsumed by the initial clause subsumes. It is from this set of new clauses that the conjunction to be put in a new node is chosen.

[Can you think of a short example to illustrate how the algorithm works?]

The TILDE system, which [Blockeel and De Raedt, 1997], [Blockeel, 1998], and [Blockeel et. al., 2001] introduce and describe, implements the algorithm above. It can be used to induce both classification and regression logical decision trees, using the equivalent regression tree system TILDE-RT. (Regression trees are identical in definition to classification trees, except that they contain real-valued numbers in their leaves rather than classifier labels.) We will see how logical decision trees, induced with the TILDE system, can be used to store Q and P-values, while introducing variables and allowing for generalization in the reinforcement learning process.

4. Relational reinforcement learning

As described in section 2, traditional reinforcement learning stores the learned Q-values in an explicit state-action table, with one value for each possible combination of the state and action. The relational reinforcement learning method introduced by [Džeroski et al., 1998] and [Džeroski et al., 2001] modifies this aspect of reinforcement learning in order to introduce variables and to increase the ability of the agent to generalize learned knowledge to unseen parts of the state space. Rather than using an explicit state-action table, relational reinforcement learning stores the Q-values in a logical regression tree. There are several important advantages to this approach:

• It allows for larger or even infinite state spaces, or state spaces of unknown size.

• By using a relational representation, it takes advantage of the structural aspects of the task.

• It provides a way to apply the learned solution to of one problem towards to other similar or more complicated problems.

Thus relational reinforcement learning offers the potential to overcome many of the serious drawbacks that traditional reinforcement learning has presented in the past.

Modifications to traditional Q- learning

Relational reinforcement learning as a technique is a more specific version of its traditional Q-learning roots. As in reinforcement learning, the algorithm has available to it:

• a set of possible states, S,

• a set of agent actions, A,

• a transition function which, although it may not be explicitly known by the algorithm, will return the next state given the current state and an action to perform.

• a reward function that returns scalar reinforcement signals given a state.

In relational reinforcement learning, however, more specific rules govern the learning process. States are represented as sets of first-order logical facts, and the algorithm can only see one state at a time. Actions are also represented relationally. [What does it mean to represent an action relationally? Does it just mean that the action (schema) has variables?] Because not all actions are possible in all states, the algorithm can only consider actions that are possible given the current state, and it must be able to determine from the state description whether a given action is possible. Also, because of the relational representation of states and actions and the inductive logic programming component of the algorithm, there must exist some body of background knowledge which is generally true for the entire domain, as well as a language bias determining how the learned policy will be represented.

As in traditional reinforcement learning, the agent then seeks to find an optimal policy π*, mapping states to actions, that maximizes some long-run measure of reinforcement.

Perhaps the most important modification, however, is the method of storage of the learned Q-values. In relational reinforcement learning, sets of state-action pairs and their corresponding Q-values are used to induce a logical regression tree after each training session. This tree then provides the Q-values for state-action pairs for the following training session. It also represents the policy which is being learned. In the version of relational reinforcement learning corresponding to P-learning, the tree is induced with sets of state-action pairs and their corresponding label (optimal or nonoptimal), and the resulting tree is a logical classification tree mapping state-action pairs to labels.

The Q relational reinforcement learning algorithm

The following algorithm, developed by [Džeroski et al., 1998] and [Džeroski et al., 2001], is essentially identical to a traditional Q-learning algorithm, except in its method of updating the Q-function, Qe (represented as a regression tree after an episode e Qe).

Q-RRL

Initialize Q0 to assign 0 to all (s, a) pairs.

Initialize Examples to the empty set.

e := 0

do forever

e := e + 1

i := 0

generate random state s0

while not goal (si) do

select an action ai stochastically

using the Q-exploration strategy

using the current hypothesis for Qe

perform action ai

receive immediate reward ri = r (si, ai)

observe the new state si + 1

i := i + 1

endwhile

for j = i – 1 to 0 do

generate example x = (sj, aj, qj),

where qj := rj + ( maxa’ Qe (sj+1, a’)

if an example (sj, aj, qold) exists in

Examples, replace it with x, else add

x to Examples

update Qe using TILDE-RT to produce

Qe +1 using Examples

The algorithm uses the TILDE-RT system (described in part 3) to induce the Q-tree. Because TILDE-RT is not incremental, a set of all examples encountered over all training sessions is maintained, with the most recent Q-value for each one, and used to reinduce the entire tree after each episode.

An example for a state-action-Q-value set is represented as a set of relational facts, including all the facts of the state itself, the relational form of the action, and the Q-value. State-action-Q-value sets are translated into examples in this format in order to induce the tree, and then a state-action pair can be translated into example format in order to retrieve the corresponding Q-value from the tree during a training session. Figure 1 demonstrates a sequence of state-action-Q-value sets encountered during a training session in the blocks world domain, and Figure 2 shows how these sets are then translated into example format for input into the TILDE-RT system. Figure 3 illustrates the Q-tree induced with the given sequence of examples as input. (These examples, as well as Figures 4 and 5, are taken from [Džeroski et al., 1998]).

[pic]

[The letters on the blocks got squished.]

Figure 1: A schematic representation illustrating one possible training session while using relational reinforcement learning in the blocks world to accomplish the ordered stacking goal.

[pic]

Figure 2: The state-action-Q-value sets, from the sequence in Figure 1, given as examples in relational format. This is exactly the input that would be given to TILDE-RT.

[pic]

[Three comments about the above figure: You should explain what the goal_on( . ) formalism means and how it is used. (Is it different than goal(on( , ))?) And, it seems there are some lines missing from the figure? Also, wouldn’t this be a good place to mention the Prolog convention of capital letters for variables and l.c. for constants?

]

Figure 3: The Q-tree induced from the examples in Figure 2.

If the state is a goal state, the Q-value is automatically 0 regardless of the action; goal states are absorbing, so all actions that would cause the agent to leave the goal state must be considered nonoptimal (and thus have an absolute quality of 0).

From Figure 3, it is apparent that the logical regression tree used in relational reinforcement learning differs slightly from the definition of logical decision trees given in Part 3, in that it contains a root node with facts that are guaranteed to be true for all possible examples. [Which is the root node in Fig. 3?] This is necessary in order to bind the variables of the tree to the appropriate constants, and this binding is then propagated to all nodes of the tree. Because we can bind the variables in the root node to any set of appropriate constants, the learned tree can be used as a solution to any problem of the same form, regardless of the actual names of the blocks. This is one of the most significant improvements of the relational reinforcement learning over traditional Q-learning.

The P relational reinforcement learning algorithm

Just as traditional P-learning is almost identical to Q-learning, the P-learning counterpart to relational reinforcement learning introduces very few distinctions. Because our definition of the P-function depends upon the Q-function, we must learn both a Q and P-tree. However, the P-tree will in most cases prove to be simpler, and thus a more efficient method for choosing actions. The Q-tree is only learned in order to define the P-tree; it is not used to select actions.

[It isn’t exactly clear how the P tree is learned. I take it that a Q tree is learned first, and then it is used to construct training samples for the P-tree, which is then learned from the samples by TILDE?]

The algorithm is identical except for three differences: the initial P-tree assigns a label of “optimal” to all state-action pairs, actions are chosen using the P-exploration strategy given in part 2, rather than the Q-exploration strategy, and the following loop is added after the Q-function update loop, to update the P-function. Again, this psuedocode was taken from [Džeroski et al., 2001]:

for j = i – 1 to 0 do

for all actions ak possible in state sj do

if state-action pair (sj, ak) is optimal

according to Qe + 1

then generate example (sj, ak, c)

where c = 1

else generate example (sj, ak, c)

where c = 0

update Pe using TILDE to produce Pe + 1 using

these examples (sj, ak, c)

Again, all actions from a goal state are assigned a label of nonoptimal, for the reason discussed above.

The same sequence of state-action-Q-value sets from Figure 1 is given in Figure 4, but in the format that would be given as input to the TILDE system in order to induce the logical classification P-tree. The only difference is that the classifier label “optimal” or “nonoptimal” is given, rather than a Q-value. Figure 5 illustrates the corresponding P-tree.

[pic]

Figure 4: The state-action-P-value sets, from the sequence in Figure 1, given as examples in relational format. It is identical to the input in Figure 2, but with classifier labels rather than Q-values. This is exactly the input that would be given to TILDE.

[pic]

[Is there something missing from the above figure? Where is the action mentioned?]

Figure 5: The P-tree induced from the examples in Figure 3.

5. Hierarchical relational reinforcement learning

[Džeroski et al., 2001] thoroughly describe the method of relational reinforcement learning as described in part 4. They also offer a set of thorough experimental results testing the effectiveness of the method at planning in the blocks world domain. One of their most significant findings is that relational reinforcement learning works extremely well on very simple goals, but becomes less effective at solving even slightly more complex problems. For the goals of unstacking all the blocks onto the table, or building all the blocks into a simple tower (in any order), the method gives very impressive convergence rates to an optimal solution. However, it converges more slowly with the goal of placing one specific block on top of another specific block. [I don’t understand why putting one block on top of another should be harder than building a tower. Is it because that specific blocks are named in the former task?]

Our implementation of relational reinforcement learning seems to support this conclusion. We attempted to learn a more complicated goal: building all the blocks in the world into a single ordered tower. [Shouldn’t you say that “all the blocks in the world” amounted to only 3 or 4 blocks?] We found great difficulty producing optimal solutions, or even improving the accuracy rate while learning. Part 6 of this paper of gives the results of these preliminary experiments, and offers some discussion.

This relative lack of success in solving more complex problems suggests that the benefits of hierarchical learning in a relational reinforcement learning context could be quite profound. By learning first a set of useful routines for accomplishing subgoals, and then using these subroutines as the available action set to learn the original goal, the agent could significantly reduce the number of training sessions required to converge upon an optimal policy, and in the end could produce a simpler representation of the optimal policy. Figure 6 illustrates schematically how this idea could be approached in the blocks world domain. In order to test this hypothesis, we have developed a simple implementation strategy for hierarchical relational reinforcement learning.

[pic]

Figure 6: The main goal in the ordered stacking task can be accomplished by achieving some combination of subgoals in a particular order. It is this structure of a learning problem that hierarchical learning hopes to exploit.

An implementation of hierarchical relational reinforcement learning

In this simple implementation, the choice of subgoals that will be learned is supplied to the agent as a form of background knowledge. In a more advanced implementation, the agent may learn what subgoals to use, as well as the solutions to the subgoals themselves.

Each subroutine policy is learned independently using relational reinforcement learning, which produces a (presumably optimal) Q or P-tree. The agent that is learning the larger goal then creates instances of each subroutine by instantiating the variablized root nodes of each tree once for each possible binding. In our blocks world domain, for example, if the subroutine is makeClear(A) and there are three blocks in the world named a, b, and c, the agent would create three instances:

makeClear(a)

makeClear(b)

makeClear(c).

[I’m beginning to worry a bit about the method. If we used all instances of each subroutine, that would amount to a lot of specific actions to consider in big problems. Maybe we can talk a bit about how to get around this.]

(We use the Prolog convention, which uses uppercase characters for variables and lowercase characters for constants. [This convention should be mentioned earlier.]) For a subroutine makeOn(A,B) with three blocks, the agent would create nine instances:

makeOn(a,b)

makeOn(a,c)

makeOn(a,table)

makeOn(b,a)

makeOn(b,c)

makeOn(b,table)

makeOn(c,a)

makeOn(c,b)

makeOn(c,table).

For a subroutine with no arguments, such as unstack, a single instantiation is created.

Note that the instantiation of subroutines is only possible because relational reinforcement learning produces variablized trees, so one solution can be used to solve many problems. Past attempts at hierarchical reinforcement learning have been slowed without this feature.

After instantiating all subroutines, the agent then uses the available subroutines as its set of possible actions while learning a policy for achieving the main goal. We define the preconditions of a subroutine to be simply that its subgoal is not already accomplished in the given state. Thus subroutines have the power to be much more effective than simple primitive actions, as they can make a particular subgoal true, in the fewest number of steps, starting in any state.

In the Q-learning process, subroutines, like primitives, are assigned Q values that reflect the quality of taking that action in a particular state. (The P-function is defined as usual using the Q-function). However, the original Q update function proves to be inaccurate when used with subroutines. Consider the example depicted in Figure 7.

[pic]

Figure 7: A possible sequence of states visited by the agent during a training session. The actions explicitly taken by the agent are designated by the bold arrows, while the primitive actions taken by the subroutine are shown with dotted arrows.

Usually the state action pair (si, a) would be assigned a new Q-value as follows:

newQ = rj + ( maxa’ Qe (sj + 3, a’),

where maxa’ Qe (sj+3, a’) in this case is 0.81 and rj is 0, so newQ would be 0.72 (with a discount factor ( of 0.9). But this is misleading, because it implies that si is much closer to the goal than it actually is. In later training sessions, the agent will be unfairly biased towards choosing a path through state si.

In order to preserve the accuracy of the Q-function, we remember the number of primitive moves taken by the subroutine from state si, say n, to its next state si + 3, and we calculate the newQ with the following modified update equation:

newQ = rj + ( n maxa’ Qe (sj + 3, a’),

where n is the number of primitive moves taken by the subroutine just completed. Because ordinary primitive moves trivially contain only one primitive (n = 1), the modified function works exactly the same as the original function for the case of primitive actions.

Since we assume the subroutines have already been learned to be optimal, this is effectively a shortcut on the learning path. Because the subroutine is definitely the shortest path between states si and si + 3, the only way another action out of state si would be chosen is if there exists a shorter path to the goal that does not pass through si + 3.

Using this modified Q update function, we can run the relational reinforcement learning algorithm as usual, but using subroutines to improve the efficiency of the learning.

6. Experiments

In order to test the hypothesis that hierarchical relational reinforcement learning overcomes some of the ineffectiveness of relational reinforcement learning on more complicated goals, we conducted a set of experiments aimed at comparing the accuracy over time of the learned regular policies and the learned hierarchical policies.

Experimental setup

We conducted two sets of experiments. In the each set we performed 10 total runs of 100 training sessions each in the 3-blocks world, 5 runs using very limited background knowledge, and 5 runs using a more extended body of background knowledge. In the first set of experiments, we did not use hierarchical learning, and in the second set we did. [NOTE: Some runs take over 10 hours to complete. Due to time constraints, experiments were only conducted in the 3-blocks world. The results from experiments conducted in the 4 and 5- blocks world are forthcoming.] [You might comment here that your interfacing of code from others with your java code was the main reason the runs took a long time to complete.]

The current state was described using the following primitive predicates:

• clear(A): true if block A has no other blocks on top of it

• on(A,B): true if block A is directly on top of block B. B may be a block or the table.

The goal task was to stack a set of blocks in an ordered tower. Specifically, the optimal solution for each number of blocks was:

• 3 blocks: on(a,b), on(b,c), on(c,table).

• 4 blocks: on(a,b), on(b,c), on(c,d), on(d,table).

• 5 blocks: on(a,b), on(b,c), on(c,d), on(d,e), on(e,table).

For the non-hierarchical learning experiments, the available action set consisted of all possible forms of the primitive move(A,B), which may only be executed if both A and B are clear. The second argument can be the table.

For the hierarchical learning experiments, we first learned optimal policies for the following subroutines:

• unstack: goal reached if all blocks in the world are on the table

• makeOn(A,B): goal reached if block A is on block B

The subroutine makeClear(A), although it is useful in achieving the goal of an ordered stack, is itself a subroutine of makeOn(A,B), and thus was omitted. [I don’t understand---did you use a 3-level hierarchy?] Figures 8 and 9 give the optimal learned P-trees used for each subroutine. The instantiations of each of these subroutines then became the possible action set for accomplishing the goal task above.

[pic]

Figure 8: An optimal P-tree for the subroutine unstack, learned after 10 training sessions.

[pic][Should goal_on(A) in the above be goal_on(A,C) or something? And shouldn’t there be more to this figure?]

Figure 9: An optimal P-tree for the subroutine makeOn(A,B), learned after 51 training sessions.

Our implementation uses the equations and algorithms described in parts 2–5 of this paper. We used a discount factor ( of 0.9 for the Q update function and a starting temperature T of 5 for the Q and P exploration functions. We used the following equation as a schedule for temperature decrease, given a factor f and a current training session number e:

T (f, e) = f / (e + f),

where f = (total number of training sessions in the current run) / 10. We also used the TILDE and TILDE-RT systems, as part of the ACE Data Mining System, [Blockeel et al., 2001], provided by Hendrik Blockeel at Katholieke Universiteit Leuven, Belgium.

Experiments conducted with a limited body of background knowledge provided to TILDE and TILDE-RT used only the following pieces of informations:

• eq(X,Y): returns true if blocks X and Y are the same block

• block(X): a variable is considered a block (as opposed to the table) if and only if it is found as the first argument in some fact on(X,Y).

Experiments conducted with the more extended body of background knowledge used essentially the same information used by [Džeroski et al., 2001]. These included:

• above(X,Y): returns true if block X is above block Y in some stack.

• height(X,N): returns true if block X is at height N above the table.

• numberofblocks(N): returns true if there are N blocks in the world.

• numberofstacks(N): returns true if there are N stacks in the current state.

• diff(X,Y,Z): returns true if X – Y = Z.

Note that this information only affects the learning process in that it is made available to TILDE during the tree induction process. It does not affect the way states are represented or how actions are carried out.

Experimental results

The results of the 3-block runs with limited background knowledge, and extended background knowledge are given in Figures 10 and 11, respectively. The accuracy of the learned P function for hierarchical and non-hierarchical learning is plotted over time. The accuracy is defined as the percentage of state-action pairs that are correctly classified as optimal or nonoptimal using the learned P function. The accuracy of all 5 runs of hierarchical learning is averaged and plotted after each training session. The equivalent representation for non-hierarchical learning is given on the same graph in order to illustrate the contrast.

Error! Not a valid link.Figure 10: Hierarchical and non-hierarchical learning in the 3-blocks world, with limited background knowledge.

Error! Not a valid link. Figure 11: Hierarchical and non-hierarchical learning in the 3-blocks world, with extended background knowledge.

[Whoops---no figures]

[NOTE: Again, due to time constraints, only one execution of the hierarchical learning with extended background knowledge was completed. The accuracy of this one run is plotted against the average accuracy of the non-hierarchical case. More statistically sound results are forthcoming.]

The first observation we can make from the results in both Figures 10 and 11 is that relational reinforcement learning, without hierarchical learning, makes very little increase in accuracy over time. Whether or not extended background knowledge is used, the average accuracy of the learning program seems to hover around 80% accuracy. This contributes more evidence to the conclusion drawn by [Džeroski et al., 2001], that relational reinforcement learning becomes much less effective when used on slightly harder problems.

Second, From Figure 10, we see that hierarchical learning, with limited background knowledge, contrary to our hypothesis, in fact makes very little improvement with respect to the non-hierarchical version. The individual runs in both cases do at times hit 100% accuracy, but the averaged results displayed on the graph demonstrate that this does not occur consistently. However, we can give examples of optimal P-trees for both hierarchical and non-hierarchical learning produced at these inconsistent occasions; these are shown in Figures 12 and 13.

[pic]

[Adjust figure to fit in column.]

Figure 12: An optimal P-tree for ordered stacking, learned after 24 training sessions with non-hierarchical learning and limited background knowledge.

[pic]

[Adjust figure to fit in column.]

Figure 13: An optimal P-tree for ordered stacking, learned after 45 training sessions with hierarchical learning and limited background knowledge.

Perhaps, however, this lack of improvement is due to the limited background knowledge available. We turn to Figure 11, the results of the preliminary experiments conducted with extended background knowledge, to determine if this is true.

While the results of the hierarchical run illustrated in Figure 11 is by no means statistically sound, it suggests that the results of more thorough tests would be very similar to the hierarchical learning performed with limited background knowledge. In both cases, the hierarchical learning does not seem to show significant improvement over non-hierarchical learning. [It’s possible (probable?) that the lack of improvement is due to the fact that the problem is too small for hierarchical learning to be needed.]

7. Conclusions and further work

Naturally the first and most important avenue for further work will be to complete the experimental setup described here, first for the remainder of the 3-block hierarchical runs with extended background knowledge, and then to repeat the entire set of experiments in the 4 and 5-blocks world. It is possible that as the state space and the complexity of the problem increases, the contrast between hierarchical and non-hierarchical accuracy will become more pronounced. In addition, it is possible to observe a very slight but perceptible increase in the average accuracy of all the learning methods over the course of the 100 training sessions. With enough time and computing power, it would be interesting to learn how the graph appeared after 500 or 1000 training sessions.

However, it may be the case that our simple strategy for hierarchical learning is not sophisticated enough to significantly improve the overall performance of the relational reinforcement learning algorithm. This possibility warrants further research into more sophisticated hierarchical methods: using recursion, creating nested hierarchies, taking greater advantage of the structure of the problem, or working more closely with the available background knowledge.

Finally, further work into ways of improving relational reinforcement learning in general would be fascinating. For example, as stated above, the program at times stumbles upon a completely optimal Q or P-tree, and then, due to a high temperature value, wanders away from the optimal solution.

[I’ll bet that the annealing schedule needs to be accelerated. That is, the temperature ought to be decreased more rapidly.] It would be interesting to investigate ways of introducing a type of evaluation function, similar to that used in genetic programming, which would greatly decrease the temperature or even halt the program when an optimal or near-optimal solution is found, even if it is early in the run.

Relational reinforcement learning offers great potential: it combines two extremely powerful machine learning techniques, and uses each one to overcome the other’s drawbacks. Similarly, a hierarchical approach to the learning problem appears as though it would exploit the benefits and avoid the weaknesses of relational reinforcement learning. Despite the rather unpromising results of the simple and small-scale experiments conducted here, I believe that relational reinforcement learning warrants a great more attention and thought, if only to discover more concretely its specific strengths and weaknesses.

Acknowledgements

This work has been supported by the Computer Science Undergraduate Research Internship program at Stanford University. A special thanks to Professor Nils Nilsson for his boundless support, encouragement, and enthusiasm, and to my research colleagues Mykel Kochenderfer and Praveen Srinivasan for their valuable commentary on my work. Also, grateful acknowledgements are due to Hendrik Blockeel and Kurt Driessens at Katholieke Universiteit Leuven for their patience in answering my questions regarding the TILDE system, and for making the system available to me for this project.

References

[Blockeel, 1998] Blockeel, H. Top-down Induction Of

First Order Logical Decision Trees. Department of

Computer Science, Katholieke Universiteit Leuven,

Belgium, December 1998.

[Blockeel et al., 2001] Blockeel, H., Dehaspe, L., and

Ramon, J. “The ACE Data Mining System: User’s

Manual.” Department of Computer Science,

Katholieke Universiteit Leuven, Belgium, October

2001.

[Blockeel and De Raedt, 1997] Blockeel, H. and De

Raedt, L. “Top-down Induction of Logical Decision

Trees.” Department of Computer Science, Katholieke

Universiteit Leuven, Belgium, January 1997.

[De Raedt and Blockeel, 1997] De Raedt, L. and

Blockeel, H. “Using Logical Decision Trees For

Clustering.” Proceedings of the Seventh International

Workshop on Inductive Logic Programming, 133-141,

1997.

[Džeroski et al., 1998] Džeroski, S., De Raedt, L., and

Blockeel, H. “Relational Reinforcement Learning.”

Proceedings of the Fifteenth International Conference

on Machine Learning, 136-143. Morgan Kaufmann,

1998.

[Džeroski et al., 2001] Džeroski, S., De Raedt, L., and

Driessens, K. “Relational Reinforcement Learning.”

Machine Learning, 43:7-52. The Netherlands: Kluwer

Academic Publishers, 2001.

[Finney et al., 2002] Finney, S., Gardiol, N., Kaelbling,

L. P., and Oates, T. “Learning with Deictic

Representation.” Massachusetts Institute of

Technology, Artificial Intelligence Laboratory,

Cambridge, MA, April 2002.

[Kaelbling et al., 1996] Kaelbling, L. P., Littman, M.,

and Moore, A. “Reinforcement Learning: A Survey.”

Journal of Artificial Intelligence Research, 4:237-285,

1996.

[Nilsson, 1996] Nilsson, N. J. Introduction To Machine

Learning. Department of Computer Science, Stanford

University, CA, September, 1996. Unpublished.

-----------------------

Subgoal:

unstack

Subgoal:

on(A,B)

Subgoal:

clear(A)

Main Goal:

orderedStack(a,b,c)

a a

b a

c a

c a

b a

a a

c a

b a

a a

c a

b a

a a

move(c,table) move(b,c) move(a,b) move(a,table)

r = 0 r = 0 r = 1 r = 0

Q = 0.81 Q = 0.9 Q = 1 Q = 0

goal_on(A,table),goal_on(B,A),goal_on(C,B)

action_move(D,table) ?

+--yes: eq(D,B) ?

| +--yes: clear(C) ?

| | +--yes: [nonoptimal]

| | +--no: clear(A) ?

| | +--yes: [nonoptimal]

| | +--no: [optimal]

| +--no: [optimal]

+--no: action_move(B,A) ?

+--yes: on(A,C) ?

| +--yes: [nonoptimal]

| +--no: [optimal]

+--no: on(B,A) ?

+--yes: action_move(C,B) ?

| +--yes: [optimal]

| +--no: [nonoptimal]

+--no: [nonoptimal]

[0.81]

subroutine a

(primitive actions taken

by subroutine a)

si+2

si+3

si+4

goal_on(A,B),goal_on(B,C),goal_on(C,table),

numberofblocks(D)

on(A,B) ?

+--yes: [nonoptimal]

+--no: [optimal]

si

si+1

goal_on(A,B),goal_on(B,C),goal_on(C,table),

numberofblocks(D)

on(A,B) ?

+--yes: [0.0]

+--no: clear(A) ?

+--yes: [1.0]

+--no: clear(B) ?

+--yes: [0.9]

+--no: [0.81]

qvalue(0.81). qvalue(1.0).

action(move(c,table)). action(move(a,b)).

goal(on(a,b)). goal(on(a,b)).

goal(on(b,c)). goal(on(b,c)).

goal(on(c,table)). goal(on(c,table)).

clear(c). clear(a).

on(c,b). clear(b).

on(b,a). on(b,c).

on(a,table). on(a,table).

on(c,table).

qvalue(0.9).

action(move(b,c)). qvalue(0.0).

goal(on(a,b)). action(move(a,table)).

goal(on(b,c)). goal(on(a,b)).

goal(on(c,table)). goal(on(b,c)).

clear(b). goal(on(c,table)).

clear(c). clear(a).

on(b,a). on(a,b).

on(a,table). on(b,c).

on(c,table). on(c,table).

optimal. optimal.

action(move(c,table)). action(move(a,b)).

goal(on(a,b)). goal(on(a,b)).

goal(on(b,c)). goal(on(b,c)).

goal(on(c,table)). goal(on(c,table)).

clear(c). clear(a).

on(c,b). clear(b).

on(b,a). on(b,c).

on(a,table). on(a,table).

on(c,table).

optimal.

action(move(b,c)). nonoptimal.

goal(on(a,b)). action(move(a,table)).

goal(on(b,c)). goal(on(a,b)).

goal(on(c,table)). goal(on(b,c)).

clear(b). goal(on(c,table)).

clear(c). clear(a).

on(b,a). on(a,b).

on(a,table). on(b,c).

on(c,table). on(c,table).

goal_on(A),numberofblocks(B)

action_move(D,table) ?

+--yes: clear(A) ?

| +--yes: clear(B) ?

| | +--yes: [nonoptimal]

| | +--no: on(A,B) ?

| | +--yes: on(B,table) ?

| | | +--yes: [nonoptimal]

| | | +--no: [optimal]

| | +--no: [optimal]

| +--no: [optimal]

+--no: action_move(A,B) ?

+--yes: [optimal]

+--no: [nonoptimal]

goal_on(A,table),goal_on(B,A),goal_on(C,B)

action_makeOn(A,D) ?

+--yes: [nonoptimal]

+--no: clear(B) ?

+--yes: on(A,C) ?

| +--yes: [nonoptimal]

| +--no: action_makeOn(B,A) ?

| +--yes: clear(A) ?

| | +--yes: [optimal]

| | +--no: [nonoptimal]

| +--no: action_makeOn(C,table) ?

| +--yes: clear(C) ?

| | +--yes: [optimal]

| | +--no: [nonoptimal]

| +--no: on(B,A) ?

| +--yes: action_makeOn(C,B)?

| | +--yes:[optimal]

| | +--no: [nonoptimal]

| +--no: [nonoptimal]

+--no: action_makeOn(B,A) ?

+--yes: on(C,table) ?

| +--yes: [nonoptimal]

| +--no: [optimal]

+--no: on(C,table) ?

+--yes: action_makeOn(C,A) ?

| +--yes: [nonoptimal]

| +--no: [optimal]

+--no: on(A,table) ?

+--yes: action_makeOn(C,R) ?

| +--yes: on(B,R) ?

| | +--yes: [optimal]

| | +--no: [nonoptimal]

| +--no: [nonoptimal]

+--no: [nonoptimal]

tM[pic]M[pic]€M[pic]µM[pic]ëM[pic]!N[pic] ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download