Assignment 1 Statistics for AI and CS XXXXXXXXXX Please take notice of the following: • Discussion with fellow students during the preparation of the assignments does stim- ulate statistical thinking...

Assigment description in the file. Programming in r needed



Assignment 1 Statistics for AI and CS 2020-2021 Please take notice of the following: • Discussion with fellow students during the preparation of the assignments does stim- ulate statistical thinking as well as its implementation. However, the very completion of the assignments is strictly personal and any evidence of plagiarism will be taken seriously. • Write your answer to each exercise in clear English language in a modern text-editor. Include ONLY key R-code (so that we can check) and key R-output, such as tables, figures and analysis, necessary to answer the question. NOTE: Any output that is included need to be accompanied by a written interpretation. • The number of points is indicated per question within a box, ten are free. • Hand-in your work in pdf file format into Nestors dropbox for Assignment 1 strictly before Monday, September 21, 2020. The deadline is strict! Questions and exercises: 1. Some general questions: (a) 5 What is the most likely shape of your distribution if the median is lower than the mean? Explain briefly why. (b) 5 Describe in your own words the terms: Quantile, Percentile, Distribution, random variable, expectation of random variable. (c) 5 In what situation would the (normalized) IQR provide a better estimate of the spread than the standard deviation. 2. Infected Persons. Suppose the probability that a person in a country gets infected at a certain day from a randomly spread virus is 10−4 and that there are 105 persons in the country. (a) 10 Which type of distribution is reasonable to assume? What is the mean and the variance of the distribution? (b) 5 Compute the probability that 20 or less persons get infected per day. (c) 5 Plot the probabilities of the distribution together with its approximation from the normal distribution. Is the approximation precise, given the extremely small probability of getting infected? See next page 1 3. Auto. Several variables such as miles per gallon, origin, horsepower were measured for 392 vehicles. The data are available under the name Auto in the library ISLR. (a) 5 Compute the mean, median, and standard deviation of the miles per gallon. Are the data skewed or more or less symmetric? (b) 5 Construct the QQ plot of the miles per gallon and give your interpretation on the degree of normality of the variable. (c) 5 Construct a combined boxplot for the miles per gallon per origin of the car, that is continent of production. What do you observe? (d) 5 Use e.g. the aggregate function to compute the mean miles per gallon per origin of cars. Do the same for the median. Are data per origin heavily skewed? (e) 5 Can the differences found in the previous question be explained by differences in car weight? Sustain your answer by a computation. (f) 5 Demonstrate how the function order can be used to reorder the complete data frame in decreasing order of the number of miles per gallon. Give the first five rows of the re-ordered data frame with respect the mpg and the car names. 4. Fashion Industry. The fashion industry is interested in the distribution of length of Dutch woman older than 18 years of age. Assume that the distribution is normal with mean 175 cm and standard deviation 10 cm. (a) 5 Use the function rnorm to sample 100000 woman from the population. What is the proportion from the simulated data that a woman is longer than 185 cm. Compare this with the theoretical probability. Hint: Use µ and σ for the comparison with the population. (b) 5 What is the proportion of woman between two standard deviations s above and two standard deviations below the mean x̄. Compare again with the theoretical probability. Hint: Use µ and σ for the comparison with the population and x̄ and s with respect to the simulated data. (c) 5 Use the first 1000 simulated lengths to construct a plot with, on the horizontal axis, the sample size n running from 1 to 1000. Add a horizontal red line with the population mean. Use lines to add the line with the mean x̄n = 1 n ∑n i=1 xi depending on the sample size n, running from n = 1, · · · , 1000. Compute the SEn as sn/ √ n, where the sample size n again runs from 1 to 1000. Use lines to add the line x̄n + 1.96 · SEn and the line x̄n − 1.96 · SEn to the plot. Give your observations. Remark: This will produce a Not Available for n = 1, but this does no harm to the plotting. See next page 2 5. Mathematics and Reading The data set Caschool from the Ecdat library gives data on the levels of Mathematics and Reading. (a) 2 Give a scatter plot on Mathematics and Reading and give your observations. (b) 1 What are the units measured in the data matrix/frame? (c) 2 Compute the correlation between Mathematics and Reading and provide an interpretation. (d) 5 Compute the mean score of Mathematics and Reading for each of the units and call it MR score. Next compute the mean of the MR score per county. Report the county name with smallest and the county name with the largest MR score. Hint: For the latter do not give a whole data frame, but compute explicitly. 3
Sep 19, 2021
SOLUTION.PDF

Get Answer To This Question

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here