Bayesian Networks Reinforcement Learning Outline Introduction to Reinforcement Learning (RL) Passive RL Active RL Generalization of RL Introduction to RL Introduction to RL RL = learning what to do...

i need this assignment as per discription


Bayesian Networks Reinforcement Learning Outline Introduction to Reinforcement Learning (RL) Passive RL Active RL Generalization of RL Introduction to RL Introduction to RL RL = learning what to do without labelled examples Only feedback is a reward given for doing actions Example: in chess, we won't know if move is good, but may know we end up with a won game Rewards are part of percept input, but agent must explicitly recognize the reward part Introduction to RL Consider MDP model Goal = maximize expected total reward Task of reinforcement learning is to use observed rewards to learn the optimal policy for the environment Introduction to RL 3 agent design options: Utility-based agent: learns utility function on states and uses it to select actions that max{EU} Q-learning: learns action-utility functions (aka Q-function), giving the expected utility of taking a given action in a given state Reflex agent: learn a policy that maps directly from states to actions Introduction to RL 2 types of learning: Passive: agent's policy is fixed and just learns utilities Active: agent must also learn what to do (involves exploration) Passive RL Passive RL Policy is fixed Agent always executes in state Goal: learn the performance of i.e. learn Passive RL Example policy for 4x3 world Passive RL Corresponding utilities of states Passive RL Passive learning task policy evaluation task (part of policy iteration algorithm) Main difference: is unknown Only is known We'll assume full observability Passive RL Checking performance Agent executes trials using , e.g.: Each trial: agent executes action based on and perceives reward and new state repeats until encountering terminal state Passive RL Need to use sequence of to learn Utility = expected sum of (discounted) rewards Passive RL Simplest method: direct utility estimation (Widrow and Hoff, 1960) Idea: utility of state is expected total reward from that state onwards (called - expected reward-to-go) each trial provides a sample of this quantity for each state visited Passive RL Example of direct utility estimation: Sample total reward, from 1st trial: (1,1) -> 0.72 (1,2) -> 0.76 (1,3) -> 0.80 (1,2) -> 0.84 (1,3) -> 0.88 … Passive RL Data is mappings from states to utilities RL reduced to supervised learning (regression task) many techniques known to solve this can use different representation of states to aggregate actual states (more samples per state) Problem: ignores fact that utilities of states are not independent! (slow convergence) Passive RL Utility values are constrained by Bellman equations So hypothesis space can be constrained Passive RL Example in trial 2: Transition reaches (3,2) for 1st time (no utility samples) then reaches (3,3) for 3rd time (2 samples with high values) yet can't estimate value until end of trial (it's obvious it should be high, since U for 3,3 is high) Passive RL Adaptive Dynamic Programming (ADP) Takes advantage of constraints in transition model Solves corresponding MDP using DP method For passive RL -> plug learned and into Bellman equations to compute utilities of states Equations are linear -> solve by linear algebra Can also use modified policy iteration Passive RL Learns transition model Passive RL The utility estimates for a selected subset of states, as a function of the number of trials. Notice the large changes occurring around the 78th trial—this is the first time that the agent falls into the −1 terminal state at (4,2). 22 Passive RL The root-mean-square error in the estimate for U(1, 1), averaged over 20 runs of 100 trials each. 23 Passive RL ADP only does MLE for transition model Bayesian reinforcement learning Assume for each hypothesis model Estimate by Bayes' rule from observations Then compute optimal policy: Passive RL Bayesian RL usually intractable More practical solution, based on robust control theory: Keep a set of possible models H and define an optimal robust policy as one that gives the best outcome in the worst case over H: Passive RL Temporal-Difference (TD) Learning Updates utilities based on difference between successive states: = learning rate Doesn't explicitly solve MDP Doesn't keep track of transition model Passive RL Intuition: Assume transition (1,3) -> (2,3) and based on trial but if this transition always happened, then we should have , so current estimate may be too low Passive RL Note: learning rate must be a function of the number of times a state is visited to guarantee convergence. Example: 28 Passive RL Passive RL Active RL Active RL Active RL agent must decide on actions Needs to learn a complete model with outcome probabilities for all actions, rather than just the model for the fixed policy Needs to take into account choice of actions Goal: learn utilities of optimal policy, i.e. must obey: Active RL Exploration vs. exploitation Agent cannot just choose actions based on currently known model (can't be a greedy agent) This may not provide accurate estimates for certain transitions (biased samples) For optimal performance, must explore environment Involves exploration vs. exploitation tradeoff Active RL Example of greedy agent performance. 34 Active RL Policy obtained by greedy approach Active RL What's the optimal tradeoff? Open problem in general Can solve in special cases (bandit problems) Solution must be greedy in the limit of infinite exploration, or GLIE Example of GLIE scheme: 1/t of time: random action rest of time: follow greedy policy Active RL Improvement in simple GLIE scheme: weigh actions that the agent has not tried very often, while tend to avoid actions believed to be of low utility Formally: assign higher (optimistic) estimate, to unexplored state-action pairs is the exploration function, e.g.: Active RL Performance of exploratory ADP agent Active RL Error rate and policy loss for exploratory ADP agent Active RL Active TD Learning Must learn transition model to make actions choices Utility update formula can stay the same Can be shown to converge to same values as ADP in the limit Active RL Q-Learning Similar to TD agent, but finds action-utility function (aka Q-function), rather than utility function Model-free property: TD agent that learns a Q-function does not need a model of the form , either for learning or for action selection Active RL Equilibrium condition: Direct use of this update requires transition model Instead, can use TD approach: Active RL Active RL State-Action-Reward-State-Action (SARSA) Alternative to Q-learning Similar update rule: Q-learning backs up the best Q-value from the state reached in the observed transition SARSA waits until an action is actually taken and backs up the Q-value for that action Active RL SARSA vs. Q-Learning Same for greedy agents that select best Q-value Q-learning doesn't pay attention to policy (off-policy) SARSA is an on-policy algorithm Q-learning is more flexible SARSA may be more realistic Q-learning is ON-POLICY more flexible than SARSA, in the sense that a Q-learning agent can learn how to behave well even when guided by a random or adversarial exploration policy. On the other hand, SARSA is more realistic: for example, if the overall policy is even partly controlled by other agents, it is better to learn a Q-function for what will actually happen rather than what the agent would like to happen. 45 Generalization in RL Generalization in RL Real-world applications - too many states and actions Backgammon Chess Storing all U or Q values is intractable Generalization in RL Solution: function approximation -using more compact representation than lookup table Like the evaluation function for use in minimax Common representation: weighted linear function of features: Generalization in RL Instead of learning U or Q values directly, RL can learn parameters of functional approximation Usually # params is small (e.g. 20 vs. ) Compresses state/action space to small number of features Allows learning agent to generalize from states it has visited to states it has not visited => we don't have to sample every state Note: function should be easily computable and compact, otherwise, there is no benefit. 49 Generalization in RL Simplest case: direct utility estimation Supervised learning For linear function -> linear regression problem Example for 4x3 world with 3 parameters: If , then -can find values by minimizing squared error Generalization in RL Online approach: adjust parameters after each trial Define error as: Then update rule is (Widrow-Hoff or delta rule): Note: 2 in the denominator of error is there for convenience. 51 Generalization in RL For 3 params: Generalization in RL TD-learning version: Q-learning version: Generalization in RL Issues Convergence: linear OK, non-linear ? it's common to use a neural net, which is non-linear Partial observability: open problem can use a Bayes net if structure is known
Apr 29, 2022
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here