- Questions & Answers
- Accounting
- Computer Science
- Automata or Computationing
- Computer Architecture
- Computer Graphics and Multimedia Applications
- Computer Network Security
- Data Structures
- Database Management System
- Design and Analysis of Algorithms
- Information Technology
- Linux Environment
- Networking
- Operating System
- Software Engineering
- Big Data
- Android
- iOS
- Matlab

- Economics
- Engineering
- Finance
- Thesis
- Management
- Science/Math
- Statistics
- Writing
- Dissertations
- Essays
- Programming
- Healthcare
- Law

- Log in | Sign up

Causal Bandits: Learning Good Interventions via

Causal Inference

Finnian Lattimore

Australian National University and Data61/NICTA

XXXXXXXXXX

Tor Lattimore

University of Alberta

XXXXXXXXXX

Mark D. Reid

Australian National University and Data61/NICTA

XXXXXXXXXX

Abstract

We study the problem of using causal models to improve the rate at which good

interventions can be learned online in a stochastic environment. Our formalism

combines multi-arm bandits and causal inference to model a novel type of bandit

feedback that is not exploited by existing approaches. We propose a new algorithm

that exploits the causal feedback and prove a bound on its simple regret that is

strictly better (in all quantities) than algorithms that do not use the additional causal

information.

1 Introduction

Medical drug testing, policy setting, and other scientific processes are commonly framed and analysed

in the language of sequential experimental design and, in special cases, as bandit problems (Ro

ins,

1952; Chernoff, XXXXXXXXXXIn this framework, single actions (also refe

ed to as interventions) from a

pre-determined set are repeatedly performed in order to evaluate their effectiveness via feedback from

a single, real-valued reward signal. We propose a generalisation of the standard model by assuming

that, in addition to the reward signal, the learner observes the values of a number of covariates drawn

from a probabilistic causal model (Pearl, XXXXXXXXXXCausal models are commonly used in disciplines

where explicit experimentation may be difficult such as social science, demography and economics.

For example, when predicting the effect of changes to childcare subsidies on workforce participation,

or school choice on grades. Results from causal inference relate observational distributions to

interventional ones, allowing the outcome of an intervention to be predicted without explicitly

performing it. By exploiting the causal information we show, theoretically and empirically, how

non-interventional observations can be used to improve the rate at which high-reward actions can be

identified.

The type of problem we are concerned with is best illustrated with an example. Consider a farme

wishing to optimise the yield of her crop. She knows that crop yield is only affected by temperature,

a particular soil nutrient, and moisture level but the precise effect of their combination is unknown.

In each season the farmer has enough time and money to intervene and control at most one of these

variables: deploying shade or heat lamps will set the temperature to be low or high; the nutrient

can be added or removed through a choice of fertilizer; and i

igation or rain-proof covers will keep

the soil wet or dry. When not intervened upon, the temperature, soil, and moisture vary naturally

from season to season due to weather conditions and these are all observed along with the final crop

yield at the end of each season. How might the farmer best experiment to identify the single, highest

yielding intervention in a limited number of seasons?

1

a

X

iv

:1

60

6.

03

20

3v

1

[

st

at

.M

L

]

1

0

Ju

n

20

16

Contributions We take the first step towards formalising and solving problems such as the one

above. In §2 we formally introduce causal bandit problems in which interventions are treated as

arms in a bandit problem but their influence on the reward — along with any other observations

— is assumed to conform to a known causal graph. We show that our causal bandit framework

subsumes the classical bandits (no additional observations) and contextual stochastic bandit problems

(observations are revealed before an intervention is chosen) before focusing on the case where, like

the above example, observations occur after each intervention is made.

Our focus is on the simple regret, which measures the difference between the return of the optimal

action and that of the action chosen by the algorithm after T rounds. In §3 we analyse a specific

family of causal bandit problems that we call parallel bandit problems in which N factors affect the

eward independently and there are 2N possible interventions. We propose a simple causal best arm

identification algorithm for this problem and show that up to logarithmic factors it enjoys minimax

optimal simple regret guarantees of Θ̃(

√

m/T ) where m depends on the causal model and may

e much smaller than N . In contrast, existing best arm identification algorithms suffer Ω(

√

N/T )

simple regret (Thm. 4 by Audibert and Bubeck XXXXXXXXXXThis shows theoretically the value of ou

framework over the traditional bandit problem. Experiments in §5 further demonstrate the value of

causal models in this framework.

In the general casual bandit problem interventions and observations may have a complex relationship.

In §4 we propose a new algorithm inspired by importance-sampling that a) enjoys sub-linear regret

equivalent to the optimal rate in the parallel bandit setting and b) captures many of the intricacies of

sharing information in a causal graph in the general case. As in the parallel bandit case, the regret

guarantee scales like O(

√

m/T ) where m depends on the underlying causal structure, with smalle

values co

esponding to structures that are easier to learn. The value of m is always less than the

number of interventionsN and in the special case of the parallel bandit (where we have lower bounds)

the notions are equivalent.

Related Work As alluded to above, causal bandit problems can be treated as classical multi-armed

andit problems by simply ignoring the causal model and extra observations and applying an existing

est-arm identification algorithm with well understood simple regret guarantees (Jamieson et al.,

2014). However, as we show in §3, ignoring the extra information available in the non-intervened

variables yields sub-optimal performance.

A well-studied class of bandit problems with side information are “contextual bandits” Langford

and Zhang (2008); Agarwal et al XXXXXXXXXXOur framework bears a superficial similarity to contextual

andit problems since the extra observations on non-intervened variables might be viewed as context

for selecting an intervention. However, a crucial difference is that in our model the extra observations

are only revealed after selecting an intervention and hence cannot be used as context.

There have been several proposals for bandit problems where extra feedback is received after an

action is taken. Most recently, Alon et al. (2015), Kocák et al XXXXXXXXXXhave considered very general

models related to partial monitoring games (Bartók et al., 2014) where rewards on unplayed actions

are revealed according to a feedback graph. As we discuss in §6, the parallel bandit problem can be

captured in this framework, however the regret bounds are not optimal in our setting. They also focus

on cumulative regret, which cannot be used to guarantee low simple regret (Bubeck et al., XXXXXXXXXXThe

partial monitoring approach taken by Wu et al XXXXXXXXXXcould be applied (up to modifications for the

simple regret) to the parallel bandit, but the resulting strategy would need to know the likelihood

of each factor in advance, while our strategy learns this online. Yu and Mannor XXXXXXXXXXutilize

extra observations to detect changes in the reward distribution, whereas we assume fixed reward

distributions and use extra observations to improve arm selection. Avner et al XXXXXXXXXXanalyse bandit

problems where the choice of arm to pull and arm to receive feedback on are decoupled. The main

difference from our present work is our focus on simple regret and the more complex information

linking rewards for different arms via causal graphs. To the best of our knowledge, our paper is the

first to analyse simple regret in bandit problems with extra post-action feedback.

Two pieces of recent work also consider applying ideas from causal inference to bandit problems.

Bareinboim et al XXXXXXXXXXdemonstrate that in the presence of confounding variables the value that a

variable would have taken had it not been intervened on can provide important contextual information.

Their work differs in many ways. For example, the focus is on the cumulative regret and the context

is observed before the action is taken and cannot be controlled by the learning agent.

2

Ortega and Braun XXXXXXXXXXpresent an analysis and extension of Thompson sampling assuming actions

are causal interventions. Their focus is on causal induction (i.e., learning an unknown causal model)

instead of exploiting a known causal model. Combining their handling of causal induction with ou

analysis is left as future work.

The truncated importance weighted estimators used in §4 have been studied before in a causal

framework by Bottou et al. (2013), where the focus is on learning from observational data, but

not controlling the sampling process. They also

iefly discuss some of the issues encountered in

sequential design, but do not give an algorithm or theoretical results for this case.

2 Problem Setup

We now introduce a novel class of stochastic sequential decision problems which we call causal

andit problems. In these problems, rewards are given for repeated interventions on a fixed causal

model Pearl XXXXXXXXXXFollowing the terminology and notation in Koller and Friedman (2009), a causal

model is given by a directed acyclic graph G over a set of random variables X = {X1, . . . , XN}

and a joint distribution P over X that factorises over G. We will assume each variable only takes

on a finite number of distinct values. An edge from variable Xi to Xj is interpreted to mean that a

change in the value of Xi may directly cause a change to the value of Xj . The parents of a variable

Xi, denoted PaXi , is the set of all variables Xj such that there is an edge from Xj to Xi in G. An

intervention or action (of size n), denoted do(X = x), assigns the values x = {x1, . . . , xn} to the

co

esponding variables X = {X1, . . . , Xn} ⊂ X with the empty intervention (where no variable is

set) denoted do(). The intervention also “mutilates” the graph G by removing all edges from Pai

to Xi for each Xi ∈X . The resulting graph defines a probability distribution P {Xc|do(X = x)}

over Xc := X −X . Details can be found in Chapter 21 of Koller and Friedman (2009).

A learner for a casual bandit problem is given the casual model’s graph G and a set of allowed actions

A. One variable Y ∈ X is designated as the reward variable and takes on values in {0, 1}. We denote

the expected reward for the action a = do(X = x) by µa := E [Y |do(X = x)] and the optimal

expected reward by µ∗ := maxa∈A µa. The causal bandit game proceeds over T rounds. In round t,

the learner intervenes by choosing at = do(Xt = xt) ∈ A based on previous observations. It then

observes sampled values for all non-intervened variables Xct drawn from P {X

c

t |do(Xt = xt)},

including the reward Yt ∈ {0, 1}. After T observations the learner outputs an estimate of the optimal

action â∗T ∈ A based on its prior observations.

The objective of the

Answered 6 days AfterApr 07, 2022

SOLUTION.PDF## Answer To This Question Is Available To Download

- EAI6010 - Module 5: Robotic AI Applications Assignment (Mini-Project) Wavefront planner is a search algorithm for robot navigation in a domain that may have obstacles. It’s closely related to the...SolvedMay 15, 2022
- ASSESSMENT GUIDE Unit Code: ITEC203 Introduction to Data Science and Machine Learning, Study Period, S1 2022 Assessment number (2) Assessment Artefact: Python Codes and Comments Weighting [30%] Why...SolvedMay 14, 2022
- Hi, I need a Python code that solves the following problem by(Fibonacci Variation) using recursion and I want visualization as well the problem is: A single pair of rabbits (male and female) is born...SolvedMay 11, 2022
- Modify the skeleton .py file to translate words from English to Pig Latin. Please do not import anything other than what is already in the skeleton .py file. I've also attached the lecture slides...SolvedMay 11, 2022
- 1 Week 8 Deliverables Overview: This week, you have studied Web application vulnerabilities, password complexity, logs and cryptographic algorithms. The Lab for this week demonstrates your knowledge...SolvedMay 10, 2022
- deep learning assignmentSolvedMay 09, 2022
- Task Summary You will be provided with a Microsoft excel .csv file. This file contains a large volume of data. You must develop an executable that respects the permissions and rules that the file has...SolvedMay 08, 2022
- CAP 6635: Artificial Intelligence (2022 Spring) Course Report Instruction (For students participating final exam) (10 points) Due date [May 6 2022, Firm] This instruction only applies to students who...SolvedMay 03, 2022
- 1 Week 7 Deliverables Overview: This week, you studied additional Flask functionality for creating a secure login form and associated files for a web site. The Lab for this week demonstrates your...SolvedMay 03, 2022
- Course Project 1. The goal of the project is to predict the following from images of the faces provided in the dataset: i. Age: is an integer from 0 to 116 ii. Gender: is either 0 (male) or 1 (female)...SolvedMay 01, 2022

Copy and Paste Your Assignment Here

About Us | Contact Us | Help | Privacy Policy | Revision and Refund Policy | Terms & Conditions | Honor Code

Copyright © 2022. All rights reserved.