Final Project: Reinforcement LearningCS XXXXXXXXXX), Fall 2022, Introduction to Data ScienceDue Date: Dec. 14, 11:59 PM (EST)WARNING: This project might be hard for some of you: please start as...

1 answer below »
The assignment is in the file provided. As stated in the file, please write a short essay explaining how you approach the questions and the answer to the question. The subject of the assignment is Data Science, but there was no option for Data Science in the Computer Science Section.


Final Project: Reinforcement Learning CS 301 (001), Fall 2022, Introduction to Data Science Due Date: Dec. 14, 11:59 PM (EST) WARNING: This project might be hard for some of you: please start as soon as possible! Remarks. You are expected to write a short essay, which covers in detail your approaches and answers to the below questions. It is highly recommended that you first state your approaches and ideas at a high level and then show how your ideas apply to the two concrete examples as shown here. Your score of this project will be evaluated against both your answers to specific questions and the overall writing skills. Consider such an interesting game as follows. There is a special die with N sides, where the ith side has the number i for each 1 ≤ i ≤ N . Let [N ] .= {1, 2, 3, . . . , N}, the set of integers ranging from 1 to N . Let p ∈ [0, 1]N be a vector of length N such that the ith entry of p, denoted by pi, represents the probability that we will end with the ith side (thus, we will see the number i) if rolling the die once. For example, N = 4 and p = (0, 1/2, 1/4, 1/4), which means that if we roll the die once, we will see the number 1, 2, 3, and 4, with probability 0, 1/2, 1/4 and 1/4, respectively. There is another binary vector q ∈ {0, 1}N , where the ith entry of q, denoted by qi, indicates if the ith side is BAD (qi = 1) or not (qi = 0). Game Rules. At the beginning, you have $0 at hand. Suppose at some time, you have x < k dollars at hand, where k is a parameter known in advance. you have two choices to make, either “accept” the challenge or “quit”. (case 1) if your choice is “quit”, then game is over and you walk away with x dollars. (case 2) if your choice is “accept”, then you will roll the die once and see a random number x ∈ [n ] with a probability specified by p. here are two subcases. (1) if qx = 1, i.e., the xth side is bad, then you lose all current money at hand; (2) if qx = 0, i.e., the xth side is not bad, then you will get a reward of f(x) where f is a function of x. in this case, you will have x + f(x) dollars. here is a tricky part: if x + f(x) ≥ k (bear in mind that k is a parameter known in advance), then game is over, and you take x + f(x) dollars and go away; otherwise, you will continue the game with x+ f(x) dollars at hand. attention: if you accept the challenge, roll the die, and get x such that qx = 1, you lose all the money at hand but game is not over: you can still continue to play the game with $0 at hand. game is over only when either you choose to quit or you have at least k dollars at hand. note that the following key components uniquely define the game: (n , p, q, f , k). (question 1) consider a simple case where n = 6, p = (1/6, 1/6, 1/6, 1/6, 1/6, 1/6). in other words, we have a “normal” die with six sides, and each side will appear with the same chance if we roll once. let q = (1, 0, 1, 0, 1, 0), f(x) = max(x2, 23), and k = 150. you are asked to do the following. (a) formulate the above game as a reinforcement learning system. please specify the key components in the game (s,a,p, r), where s is the state space, a is the action space, p is the transition probability matrix, r is the reward function. for simplicity, you can assume the discounted factor γ = 1. please specify clearly the terminal state space (st ) and the non-terminal state space (sn ). (b) compute the optimal value function v ∗ and the optimal policy π∗. you can try either the value iteration method or the dynamic programming method. please make sure to state explicitly the values of v ∗(s) and π∗(s) for all s ∈ sn , where sn refers to the non-terminal state space. based on your results, state explicitly 1 the maximum expected total rewards you will get in this game when starting with $0. (if you use the value iteration method, please try different tolerance parameters � to make sure your algorithm converges properly.) (c) please try the approach of linear programming (lp) to compute the optimal value function v ∗ and the optimal policy π∗. you should explicitly specify the following elements in the lp: variables, objective function, and constraints. again, please state explicitly the values of v ∗(s) and π∗(s) for all s ∈ sn . based on your results, state explicitly the maximum expected total rewards you will get in this game when starting with $0. (question 2) consider a special case where n = 5, p = (1/2, 1/4, 1/8, 1/16, 1/16), q = (0, 1, 0, 1, 0), f(x) = min(5, 2x), and k = 150. answer the same questions (a), (b), and (c), as shown in question 1. 2 k="" dollars="" at="" hand,="" where="" k="" is="" a="" parameter="" known="" in="" advance.="" you="" have="" two="" choices="" to="" make,="" either="" “accept”="" the="" challenge="" or="" “quit”.="" (case="" 1)="" if="" your="" choice="" is="" “quit”,="" then="" game="" is="" over="" and="" you="" walk="" away="" with="" x="" dollars.="" (case="" 2)="" if="" your="" choice="" is="" “accept”,="" then="" you="" will="" roll="" the="" die="" once="" and="" see="" a="" random="" number="" x="" ∈="" [n="" ]="" with="" a="" probability="" specified="" by="" p.="" here="" are="" two="" subcases.="" (1)="" if="" qx="1," i.e.,="" the="" xth="" side="" is="" bad,="" then="" you="" lose="" all="" current="" money="" at="" hand;="" (2)="" if="" qx="0," i.e.,="" the="" xth="" side="" is="" not="" bad,="" then="" you="" will="" get="" a="" reward="" of="" f(x)="" where="" f="" is="" a="" function="" of="" x.="" in="" this="" case,="" you="" will="" have="" x="" +="" f(x)="" dollars.="" here="" is="" a="" tricky="" part:="" if="" x="" +="" f(x)="" ≥="" k="" (bear="" in="" mind="" that="" k="" is="" a="" parameter="" known="" in="" advance),="" then="" game="" is="" over,="" and="" you="" take="" x="" +="" f(x)="" dollars="" and="" go="" away;="" otherwise,="" you="" will="" continue="" the="" game="" with="" x+="" f(x)="" dollars="" at="" hand.="" attention:="" if="" you="" accept="" the="" challenge,="" roll="" the="" die,="" and="" get="" x="" such="" that="" qx="1," you="" lose="" all="" the="" money="" at="" hand="" but="" game="" is="" not="" over:="" you="" can="" still="" continue="" to="" play="" the="" game="" with="" $0="" at="" hand.="" game="" is="" over="" only="" when="" either="" you="" choose="" to="" quit="" or="" you="" have="" at="" least="" k="" dollars="" at="" hand.="" note="" that="" the="" following="" key="" components="" uniquely="" define="" the="" game:="" (n="" ,="" p,="" q,="" f="" ,="" k).="" (question="" 1)="" consider="" a="" simple="" case="" where="" n="6," p="(1/6," 1/6,="" 1/6,="" 1/6,="" 1/6,="" 1/6).="" in="" other="" words,="" we="" have="" a="" “normal”="" die="" with="" six="" sides,="" and="" each="" side="" will="" appear="" with="" the="" same="" chance="" if="" we="" roll="" once.="" let="" q="(1," 0,="" 1,="" 0,="" 1,="" 0),="" f(x)="max(X2," 23),="" and="" k="150." you="" are="" asked="" to="" do="" the="" following.="" (a)="" formulate="" the="" above="" game="" as="" a="" reinforcement="" learning="" system.="" please="" specify="" the="" key="" components="" in="" the="" game="" (s,a,p,="" r),="" where="" s="" is="" the="" state="" space,="" a="" is="" the="" action="" space,="" p="" is="" the="" transition="" probability="" matrix,="" r="" is="" the="" reward="" function.="" for="" simplicity,="" you="" can="" assume="" the="" discounted="" factor="" γ="1." please="" specify="" clearly="" the="" terminal="" state="" space="" (st="" )="" and="" the="" non-terminal="" state="" space="" (sn="" ).="" (b)="" compute="" the="" optimal="" value="" function="" v="" ∗="" and="" the="" optimal="" policy="" π∗.="" you="" can="" try="" either="" the="" value="" iteration="" method="" or="" the="" dynamic="" programming="" method.="" please="" make="" sure="" to="" state="" explicitly="" the="" values="" of="" v="" ∗(s)="" and="" π∗(s)="" for="" all="" s="" ∈="" sn="" ,="" where="" sn="" refers="" to="" the="" non-terminal="" state="" space.="" based="" on="" your="" results,="" state="" explicitly="" 1="" the="" maximum="" expected="" total="" rewards="" you="" will="" get="" in="" this="" game="" when="" starting="" with="" $0.="" (if="" you="" use="" the="" value="" iteration="" method,="" please="" try="" different="" tolerance="" parameters="" �="" to="" make="" sure="" your="" algorithm="" converges="" properly.)="" (c)="" please="" try="" the="" approach="" of="" linear="" programming="" (lp)="" to="" compute="" the="" optimal="" value="" function="" v="" ∗="" and="" the="" optimal="" policy="" π∗.="" you="" should="" explicitly="" specify="" the="" following="" elements="" in="" the="" lp:="" variables,="" objective="" function,="" and="" constraints.="" again,="" please="" state="" explicitly="" the="" values="" of="" v="" ∗(s)="" and="" π∗(s)="" for="" all="" s="" ∈="" sn="" .="" based="" on="" your="" results,="" state="" explicitly="" the="" maximum="" expected="" total="" rewards="" you="" will="" get="" in="" this="" game="" when="" starting="" with="" $0.="" (question="" 2)="" consider="" a="" special="" case="" where="" n="5," p="(1/2," 1/4,="" 1/8,="" 1/16,="" 1/16),="" q="(0," 1,="" 0,="" 1,="" 0),="" f(x)="min(5," 2x),="" and="" k="150." answer="" the="" same="" questions="" (a),="" (b),="" and="" (c),="" as="" shown="" in="" question="" 1.="">
Answered 21 days AfterNov 04, 2022

Answer To: Final Project: Reinforcement LearningCS XXXXXXXXXX), Fall 2022, Introduction to Data ScienceDue...

Banasree answered on Nov 26 2022
43 Votes
1.a)
S = State space
A = Action space
P = transition probability matrix
R = Reward function
b =
Behavior policy
ꭋ = discounted factor
With respect to the given policy
Terminal state space (ST) =
Loop for the each episode:
Initialize and store S0 ≠ terminal
Select and store A0~b(.|S0)
T←∞
Loop for t = 0,1….:
If tAction At
Observe and store the next reward as Rt+1, and the...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here