Causal Bandits: Learning Good Interventions via Causal Inference Finnian Lattimore Australian National University and Data61/NICTA XXXXXXXXXX Tor Lattimore University of Alberta XXXXXXXXXX Mark D....

1 answer below »
I'm implementing algorithm 2 in this paper:
https://arxiv.org/pdf/1606.03203.pdf. I'm done with most of the stuff


Causal Bandits: Learning Good Interventions via Causal Inference Finnian Lattimore Australian National University and Data61/NICTA [email protected] Tor Lattimore University of Alberta [email protected] Mark D. Reid Australian National University and Data61/NICTA [email protected] Abstract We study the problem of using causal models to improve the rate at which good interventions can be learned online in a stochastic environment. Our formalism combines multi-arm bandits and causal inference to model a novel type of bandit feedback that is not exploited by existing approaches. We propose a new algorithm that exploits the causal feedback and prove a bound on its simple regret that is strictly better (in all quantities) than algorithms that do not use the additional causal information. 1 Introduction Medical drug testing, policy setting, and other scientific processes are commonly framed and analysed in the language of sequential experimental design and, in special cases, as bandit problems (Robbins, 1952; Chernoff, 1959). In this framework, single actions (also referred to as interventions) from a pre-determined set are repeatedly performed in order to evaluate their effectiveness via feedback from a single, real-valued reward signal. We propose a generalisation of the standard model by assuming that, in addition to the reward signal, the learner observes the values of a number of covariates drawn from a probabilistic causal model (Pearl, 2000). Causal models are commonly used in disciplines where explicit experimentation may be difficult such as social science, demography and economics. For example, when predicting the effect of changes to childcare subsidies on workforce participation, or school choice on grades. Results from causal inference relate observational distributions to interventional ones, allowing the outcome of an intervention to be predicted without explicitly performing it. By exploiting the causal information we show, theoretically and empirically, how non-interventional observations can be used to improve the rate at which high-reward actions can be identified. The type of problem we are concerned with is best illustrated with an example. Consider a farmer wishing to optimise the yield of her crop. She knows that crop yield is only affected by temperature, a particular soil nutrient, and moisture level but the precise effect of their combination is unknown. In each season the farmer has enough time and money to intervene and control at most one of these variables: deploying shade or heat lamps will set the temperature to be low or high; the nutrient can be added or removed through a choice of fertilizer; and irrigation or rain-proof covers will keep the soil wet or dry. When not intervened upon, the temperature, soil, and moisture vary naturally from season to season due to weather conditions and these are all observed along with the final crop yield at the end of each season. How might the farmer best experiment to identify the single, highest yielding intervention in a limited number of seasons? 1 ar X iv :1 60 6. 03 20 3v 1 [ st at .M L ] 1 0 Ju n 20 16 Contributions We take the first step towards formalising and solving problems such as the one above. In §2 we formally introduce causal bandit problems in which interventions are treated as arms in a bandit problem but their influence on the reward — along with any other observations — is assumed to conform to a known causal graph. We show that our causal bandit framework subsumes the classical bandits (no additional observations) and contextual stochastic bandit problems (observations are revealed before an intervention is chosen) before focusing on the case where, like the above example, observations occur after each intervention is made. Our focus is on the simple regret, which measures the difference between the return of the optimal action and that of the action chosen by the algorithm after T rounds. In §3 we analyse a specific family of causal bandit problems that we call parallel bandit problems in which N factors affect the reward independently and there are 2N possible interventions. We propose a simple causal best arm identification algorithm for this problem and show that up to logarithmic factors it enjoys minimax optimal simple regret guarantees of Θ̃( √ m/T ) where m depends on the causal model and may be much smaller than N . In contrast, existing best arm identification algorithms suffer Ω( √ N/T ) simple regret (Thm. 4 by Audibert and Bubeck (2010)). This shows theoretically the value of our framework over the traditional bandit problem. Experiments in §5 further demonstrate the value of causal models in this framework. In the general casual bandit problem interventions and observations may have a complex relationship. In §4 we propose a new algorithm inspired by importance-sampling that a) enjoys sub-linear regret equivalent to the optimal rate in the parallel bandit setting and b) captures many of the intricacies of sharing information in a causal graph in the general case. As in the parallel bandit case, the regret guarantee scales like O( √ m/T ) where m depends on the underlying causal structure, with smaller values corresponding to structures that are easier to learn. The value of m is always less than the number of interventionsN and in the special case of the parallel bandit (where we have lower bounds) the notions are equivalent. Related Work As alluded to above, causal bandit problems can be treated as classical multi-armed bandit problems by simply ignoring the causal model and extra observations and applying an existing best-arm identification algorithm with well understood simple regret guarantees (Jamieson et al., 2014). However, as we show in §3, ignoring the extra information available in the non-intervened variables yields sub-optimal performance. A well-studied class of bandit problems with side information are “contextual bandits” Langford and Zhang (2008); Agarwal et al. (2014). Our framework bears a superficial similarity to contextual bandit problems since the extra observations on non-intervened variables might be viewed as context for selecting an intervention. However, a crucial difference is that in our model the extra observations are only revealed after selecting an intervention and hence cannot be used as context. There have been several proposals for bandit problems where extra feedback is received after an action is taken. Most recently, Alon et al. (2015), Kocák et al. (2014) have considered very general models related to partial monitoring games (Bartók et al., 2014) where rewards on unplayed actions are revealed according to a feedback graph. As we discuss in §6, the parallel bandit problem can be captured in this framework, however the regret bounds are not optimal in our setting. They also focus on cumulative regret, which cannot be used to guarantee low simple regret (Bubeck et al., 2009). The partial monitoring approach taken by Wu et al. (2015) could be applied (up to modifications for the simple regret) to the parallel bandit, but the resulting strategy would need to know the likelihood of each factor in advance, while our strategy learns this online. Yu and Mannor (2009) utilize extra observations to detect changes in the reward distribution, whereas we assume fixed reward distributions and use extra observations to improve arm selection. Avner et al. (2012) analyse bandit problems where the choice of arm to pull and arm to receive feedback on are decoupled. The main difference from our present work is our focus on simple regret and the more complex information linking rewards for different arms via causal graphs. To the best of our knowledge, our paper is the first to analyse simple regret in bandit problems with extra post-action feedback. Two pieces of recent work also consider applying ideas from causal inference to bandit problems. Bareinboim et al. (2015) demonstrate that in the presence of confounding variables the value that a variable would have taken had it not been intervened on can provide important contextual information. Their work differs in many ways. For example, the focus is on the cumulative regret and the context is observed before the action is taken and cannot be controlled by the learning agent. 2 Ortega and Braun (2014) present an analysis and extension of Thompson sampling assuming actions are causal interventions. Their focus is on causal induction (i.e., learning an unknown causal model) instead of exploiting a known causal model. Combining their handling of causal induction with our analysis is left as future work. The truncated importance weighted estimators used in §4 have been studied before in a causal framework by Bottou et al. (2013), where the focus is on learning from observational data, but not controlling the sampling process. They also briefly discuss some of the issues encountered in sequential design, but do not give an algorithm or theoretical results for this case. 2 Problem Setup We now introduce a novel class of stochastic sequential decision problems which we call causal bandit problems. In these problems, rewards are given for repeated interventions on a fixed causal model Pearl (2000). Following the terminology and notation in Koller and Friedman (2009), a causal model is given by a directed acyclic graph G over a set of random variables X = {X1, . . . , XN} and a joint distribution P over X that factorises over G. We will assume each variable only takes on a finite number of distinct values. An edge from variable Xi to Xj is interpreted to mean that a change in the value of Xi may directly cause a change to the value of Xj . The parents of a variable Xi, denoted PaXi , is the set of all variables Xj such that there is an edge from Xj to Xi in G. An intervention or action (of size n), denoted do(X = x), assigns the values x = {x1, . . . , xn} to the corresponding variables X = {X1, . . . , Xn} ⊂ X with the empty intervention (where no variable is set) denoted do(). The intervention also “mutilates” the graph G by removing all edges from Pai to Xi for each Xi ∈X . The resulting graph defines a probability distribution P {Xc|do(X = x)} over Xc := X −X . Details can be found in Chapter 21 of Koller and Friedman (2009). A learner for a casual bandit problem is given the casual model’s graph G and a set of allowed actions A. One variable Y ∈ X is designated as the reward variable and takes on values in {0, 1}. We denote the expected reward for the action a = do(X = x) by µa := E [Y |do(X = x)] and the optimal expected reward by µ∗ := maxa∈A µa. The causal bandit game proceeds over T rounds. In round t, the learner intervenes by choosing at = do(Xt = xt) ∈ A based on previous observations. It then observes sampled values for all non-intervened variables Xct drawn from P {X c t |do(Xt = xt)}, including the reward Yt ∈ {0, 1}. After T observations the learner outputs an estimate of the optimal action â∗T ∈ A based on its prior observations. The objective of the
Answered 6 days AfterApr 07, 2022

Answer To: Causal Bandits: Learning Good Interventions via Causal Inference Finnian Lattimore Australian...

Sandeep Kumar answered on Apr 09 2022
97 Votes
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here