submit two things (can be submitted as a single file):
- Your informative tentative project title.
- Your answers to the Probability Questions. (they are in the file provided)
Microsoft Word - W23 MATH341_345 Project V1.docx 1 Winter 2023 MATH 341/345 Project Deployments of Safety Cars in Formula One in 2010-2019 (Version 1. February 5, 2023) Introduction: This project aims at modeling the frequencies of safety car deployments per race in Formula One and the time intervals between safety car deployments in 2010-2019. A safety car in Formula One is deployed while the “yellow flags” are waved by the marshals and the Race Director decides that it is necessary to remove any hazards on the race track or that the racing cars need to slow down due to unfavorable track conditions (i.e., heavy rain). When a safety car is deployed, in addition to the yellow flags, each driver sees “SC” boards on the sides of the track. Moreover, the same information is displayed on the steering wheel of each racing car. Safety cars and yellow flags are important components of Formula One racing to protect drivers’ and marshals’ lives. When the safety car is leading the race, each racing car needs to bunch up and follow the safety car without overtaking any other cars, unless they are allowed to unlap themselves. As the safety car goes around the track at a much slower speed than the normal racing pace, marshals can quickly remove any hazards on the track and improve the track condition without worrying about fast-moving racing cars. However, even with strict regulations under the yellow flag condition, accidents happen, especially during wet weather races. A notable recent incident happened at the 2014 Japanese Grand Prix, when a very promising young French driver Jules Bianchi of Marussia collided with a tractor crane under the “double yellow flag” condition. A “double yellow flag” condition indicates that marshals may be present on the track and the driver needs to prepare to stop, if necessary. Bianchi lost control of the car due to aquaplaning on the wet surface and suffered a fatal injury as a result of the collision with the tractor crane. The FIA (governing body of the Formula One races) took the incident very seriously and implemented a number of safety measures. One of them is an introduction of the “virtual safety car (VSC)”. Under VSC condition, each driver needs to slow down their car to the posted speed limit, usually resulting in a 35 to 40% speed reduction. Because it is a “virtual” safety car, under VSC, the actual safety car is not deployed; rather, each racing car is equipped with the device which automatically slows down to the posted speed limit under VSC. Even with the introduction of VSC in 2015, under severe conditions, safety cars are deployed once in a while. Here, an interesting question arises: Did the introduction of VSC change the frequency of safety car deployments? This is an important question to answer for race strategists, as the deployment of a safety car means that each team needs to react quickly to adjust their tire strategies. Each driver is required to make at least one pit stop to change their tires during the race, and a pit stop under the safety car condition implies that they can save about 20 seconds, possibly gaining several precious positions in the race without overtaking. At 2 the same time, fresh tires typically make the racing car more drivable, increasing the chances of catching and overtaking the other racing cars in front after the pit stop. Related article: https://www.mclaren.com/racing/2019/canadian-grand-prix/how-make-right- call-safety-car/ Note that the importance of understanding probability is emphasized in this article. Your Tasks in This Project: Your main task in this project is to analyze the safety car deployment data in Formula One to determine whether there are any changes in the frequency of safety car deployments between the pre-VSC era (2010-2014) and post-VSC era (2015-2019). That involves fitting reasonable distribution(s) to the data for the number of safety car deployments per race and time intervals between the safety car deployments in these two time periods. Then, by comparing these two distributions, you are asked to conclude whether strategic adjustments were necessary to account for increased/decreased safety car deployments after VSC was introduced in 2015. The dataset is originally retrieved from Kaggle (https://www.kaggle.com/datasets/jtrotman/formula-1-race-events), but it was further augmented by adding Type, Round, TotalRounds, TotalLaps, and Condition. These additional pieces of information were taken from the Wikipedia entries for the Formula One races. A thorough and complete analysis of the main task above is sufficient to receive full credit for this project. That is, you are not required to do any additional programming beyond what is given if you choose to do so. However, you are probably interested in doing a more detailed analysis of the dataset to make your analysis useful and interesting for the participating Formula One teams. To help you analyze the dataset in more detail, the dataset provided (augmented_safety_cars.csv) contains additional information such as type of the circuit (permanent or street) and track condition (dry, mixed, or wet). In addition, you will be asked to watch an interesting video titled “What Does An F1 Strategist Do?” (https://youtu.be/4CFkltWIc8o) so that you can see what Formula One strategists actually do before, during, and after each race. At the same time, you will see how they interact with racers, mechanics, race engineers, data analysts, and team principals. How This Project Works: This project consists of three parts; Probability Questions, Statistics Questions, and project write-up. For the Probability and Statistics Questions, you need to answer the questions given below. For the project write-up, you may choose to summarize the results based on the R code given. However, to make the project more interesting, you are encouraged to carry out additional analysis. If you find anything interesting, you may choose to write about your interesting finding(s) instead. To make sure that what you decide to write in your write-up is appropriate, please talk to the instructor before you do anything. The instructor will be happy to assist you with additional programming if necessary. 3 Probability Questions (12 Points in Total): 1. Watch “What Does An F1 Strategist Do?” (https://youtu.be/4CFkltWIc8o) and describe how the Formula One strategist position is related to your major(s) in a paragraph or two. Note: Everyone on your team needs to write a separate paragraph or two. (2pts) 2. Suppose that you look at each of ? different laps in Formula One races. Why is checking whether or not each of these laps was led by a safety car is a binomial experiment? (2pts) 3. Why is it reasonable to assume that the number of safety car deployments in a fixed period of time (i.e., five seasons) follows the Poisson distribution (approximately)? Recall the relationship between the binomial and Poisson distribution, and state what happens to ? (the number of laps) and ? (the probability that each lap is led by a safety car). (2pts) 4. Why is it reasonable to assume that the time intervals between safety car deployments are (approximately) exponentially distributed? (2pts) 5. Suppose that we consider two time periods of Formula One racing (2010 – 2014 and 2015 – 2019). Is it safe to assume that the number of safety car deployments in each of these two time periods is independent of each other? In other words, is it reasonable to say that the number of safety car deployments in 2010 – 2014 does not significantly influence the number of safety car deployments in 2015 – 2019? Justify. (2pts) 6. Recall the memoryless property of the exponential distribution, which says that ?(? ≥ ?! + ?"|? ≥ ?!) = ?(? ≥ ?"), ?! ≥ 0, ?" ≥ 0, if and only if ? is exponentially distributed. What does this imply regarding the probability that the next safety car deployment is 5 races from now given that it has been 3 races since the last safety car deployment? Comment. (2pts) Note: The above phenomenon is known as “the waiting time paradox”. 7. (Optional) Any questions you have about this project. Statistics Questions (Read W23MATH341Project.R and run the program to answer these questions. Look for “SQ” in the comments in the R code to identify which part of the code is referring to which question.) (20 Points in Total): Note: The length of each race is set to 1, which is reasonable given that each race has an approximately the same race distance. According to the data, the first safety deployment in 2010 occurred at lap 2 of Round 2 (which was a 58-lap race) and the second deployment occurred at lap 1 of Round 4 (which was a 56-lap race). Thus, the first duration is simply (Round 1) + (Deployment in Round 2) = 1 + 2/58 = 1.034483. Then, the second duration is the duration between these two deployments is given by (Remaining laps in Round 2) + Round 3 + (Deployment in Round 4) = (1 – 2/58) + 1 + 1/56 = 1.983374. 1. Look at the histograms of the number of safety car deployments per race. Do these histograms suggest that the data are Poisson distributed (approximately)? Or, is there any clear evidence against that? Comment. (2pts) 2. The best-fit Poisson pmfs, as represented by the blue dotted lines, use lambda=mean(first_half) and lambda=mean(second_half) for the first and second half of the 2010’s, respectively. Explain why it makes sense to use these values. (2pts) 4 3. Look at the histograms of the time intervals between two safety car deployments. Do these histograms suggest that the data are exponentially distributed (approximately)? Or, is there any clear evidence against that? Comment. (2pts) 4. The best-fit exponential pmfs, as represented by the blue dotted lines, use rate=1/mean(interval1) and rate=1/mean(interval2) for the first and second half of the 2010’s, respectively. Explain why it makes sense to use these values. (2pts) 5. Report mean(first_half), mean(interval1), mean(second_half), and mean(interval2). Then, describe how mean(first_half) and mean(interval1), as well as mean(second_half) and mean(interval2), are approximately related to each other. After that, explain why that happens by recalling the distributions you identified for the number of safety car deployments and the time interval between two safety car deployments. (2pts) 6. Running a two-sample t-test for comparing means or to construct a confidence interval for the difference in means using the time interval data may potentially lead to wrong results. Explain why in terms of normality and independence. (2pts) 7. Explain why the concerns you mentioned in the previous question are actually not concerning for this dataset. (2pts) 8. The t.test() function in R gives the one- and two-sample t-test results for the mean or difference in means, including the confidence intervals and p-values. The parameter var.equal in the t.test() function specifies whether or not the common variance can be assumed (if yes, TRUE, and otherwise, FALSE). For comparing the time intervals, can we assume common variance? Comment. Recall that the mean and standard deviation are equal to each other in the case of exponential distribution. (2pts) 9. Report the results of the t.test() function (95% confidence interval, degrees of freedom used, and p-value) for the var.equal=TRUE and var.equal=FALSE cases. (2pts) 10. Based on the results above, discuss whether or not there is any statistically significant change in the distribution of the safety car deployments between these two time periods. (2pts) 11. (Optional) The Kolmogorov-Smirnov test is a one- and two-sample test that directly compares the cumulative distribution function(s) of the data. In the one-sample case, a researcher hypothesizes the underlying distribution and see how well the cumulative distribution function (cdf) estimated from the data (known as the empirical cdf) matches that of the hypothesized distribution. In the two-sample case, the two empirical cdf’s are directly compared. Do the test results show any evidence against the deviation from the exponential distribution for the time interval data? Also, are these two datasets significantly different from each other? Justify your conclusion by reporting the p-values and interpreting these p-values. (Extra credit: 1pt) 12. (Optional) The quantile-quantile (Q-Q) plot is a visual tool to see if the dataset of interest follows a certain distribution. Although the Q-Q plot is typically used for the normal distribution, for this project, we use the Q-Q plot for the exponential distribution. If the points on the plot follows a straight line on the Q-Q plot, that is an indication that the dataset follows the exponential distribution well. Present the Q-Q plots for the time interval datasets (pre- and post-VSC) and comment. (Extra credit: 1pt) 13. (Optional) Another important aspect of the dataset is the independence of the observations. A common assumption