Main flexible, for each run of each algorithm, we store its computation times (Bi) – 1 i, with i indexing the time step, and B-1 the offline learning time. Then a feature function ((Bi)-1 i) is Nutlin-3a chiral site extracted from this data. This function is used as a metric to characterise and discriminate algorithms based on their time requirements. In our protocol, which is detailed in the next section, two types of characterisation are used. For a set of experiments, algorithms are classified based on their offline computation time only, i.e. we use ((Bi)-1 i) = B-1. Afterwards, the constraint is defined as ((Bi)-1 i) K, K > 0 in case it is required to only compare the algorithms that have an offline computation time lower than K. For another set of experiments, algorithms are separated according to their empirical averP 1 age online computation time. In this case, Bi 1 i ??n 0 i 0. This formalisation could be used for any other computation time characterisation. For example, one could want to analyse algorithms based on the longest computation time of a trajectory, and define ((Bi)-1 i) = max-1 i Bi.3 A new Bayesian Reinforcement Learning benchmark protocol 3.1 A comparison criterion for BRLIn this paper, a real Bayesian evaluation is proposed, in the sense that the different algorithms are compared on a large set of problems drawn according to a test probability distribution. This is in contrast with the Bayesian literature [5?], where authors pick a fixed number of MDPs on which they evaluate their algorithm. Our criterion to compare algorithms is to measure their average rewards against a given random distribution of MDPs, using another distribution of MDPs as a prior knowledge. In our experimental protocol, an experiment is defined by a prior distribution p0 ?and a test M distribution pM ? Both are random distributions over the set of possible MDPs, not stochastic transition functions. To illustrate the difference, let us take an example. Let (x, u, x0 ) be a transition. Given a transition function f: X ?U ?X ! [0; 1], f(x, u, x0 ) is the probability of observing x0 if we chose u in x. In this paper, this function f is assumed to be the only unknown part of the MDP that the agent faces. Given a certain test case, f corresponds to a unique MDP M 2 M. A Bayesian learning problem is then defined by a probability distribution over a set M of possible MDPs. We call it a test distribution, and denote it pM ? Prior knowledge can then be encoded as another distribution over M, and denoted p0 ? We call “purchase GS-9620 accurate” a M prior which is identical to the test distribution (p0 ??pM ?, and we call “inaccurate” a M prior which is different (p0 ?6?pM ?. M In practice, the “accurate” case is optimistic in the sense that a perfect knowledge of the test distribution is generally a strong assumption. We decided to include a more realistic setting with the “inaccurate” case, by considering a test distribution slightly different from the prior distribution. This will help us to identify which algorithms are more robust to initialisation errors. More precisely, our protocol can be described as follows: Each algorithm is first trained on the prior distribution. Then, their performances are evaluated by estimating the expectation of the discounted sum of rewards, when they are facing MDPs drawn from the test distribution. Let JpMM be this value: JpMM ?Ep 0 ?.Main flexible, for each run of each algorithm, we store its computation times (Bi) – 1 i, with i indexing the time step, and B-1 the offline learning time. Then a feature function ((Bi)-1 i) is extracted from this data. This function is used as a metric to characterise and discriminate algorithms based on their time requirements. In our protocol, which is detailed in the next section, two types of characterisation are used. For a set of experiments, algorithms are classified based on their offline computation time only, i.e. we use ((Bi)-1 i) = B-1. Afterwards, the constraint is defined as ((Bi)-1 i) K, K > 0 in case it is required to only compare the algorithms that have an offline computation time lower than K. For another set of experiments, algorithms are separated according to their empirical averP 1 age online computation time. In this case, Bi 1 i ??n 0 i 0. This formalisation could be used for any other computation time characterisation. For example, one could want to analyse algorithms based on the longest computation time of a trajectory, and define ((Bi)-1 i) = max-1 i Bi.3 A new Bayesian Reinforcement Learning benchmark protocol 3.1 A comparison criterion for BRLIn this paper, a real Bayesian evaluation is proposed, in the sense that the different algorithms are compared on a large set of problems drawn according to a test probability distribution. This is in contrast with the Bayesian literature [5?], where authors pick a fixed number of MDPs on which they evaluate their algorithm. Our criterion to compare algorithms is to measure their average rewards against a given random distribution of MDPs, using another distribution of MDPs as a prior knowledge. In our experimental protocol, an experiment is defined by a prior distribution p0 ?and a test M distribution pM ? Both are random distributions over the set of possible MDPs, not stochastic transition functions. To illustrate the difference, let us take an example. Let (x, u, x0 ) be a transition. Given a transition function f: X ?U ?X ! [0; 1], f(x, u, x0 ) is the probability of observing x0 if we chose u in x. In this paper, this function f is assumed to be the only unknown part of the MDP that the agent faces. Given a certain test case, f corresponds to a unique MDP M 2 M. A Bayesian learning problem is then defined by a probability distribution over a set M of possible MDPs. We call it a test distribution, and denote it pM ? Prior knowledge can then be encoded as another distribution over M, and denoted p0 ? We call “accurate” a M prior which is identical to the test distribution (p0 ??pM ?, and we call “inaccurate” a M prior which is different (p0 ?6?pM ?. M In practice, the “accurate” case is optimistic in the sense that a perfect knowledge of the test distribution is generally a strong assumption. We decided to include a more realistic setting with the “inaccurate” case, by considering a test distribution slightly different from the prior distribution. This will help us to identify which algorithms are more robust to initialisation errors. More precisely, our protocol can be described as follows: Each algorithm is first trained on the prior distribution. Then, their performances are evaluated by estimating the expectation of the discounted sum of rewards, when they are facing MDPs drawn from the test distribution. Let JpMM be this value: JpMM ?Ep 0 ?.

## Be First to Comment