do the project for meSTAT 603: Final Project STAT 603: Final Project Due: 6:00pm, Friday, May 24th....

Question

do the project for meSTAT 603: Final Project STAT 603: Final Project Due: 6:00pm, Friday, May 24th. General Directions: 0. You may work in groups to discuss about ideas and selection of statistical modeling methods, but the programming of your modeling method(s) and writing of the final report must be solely your own work. Copying others’ work/code or allowing others to copy your own work/code are all considered cheating and plagiarism, and will result in zero point for the whole final project and F grade for STAT603. Cheating in any coursework is considered serious offense against academic integrity and University rules. 1. Submit a PDF copy of your final report, R source code, and two label prediction files on Canvas. For the PDF file, you should name it as “myfinal_report.pdf”; for all the R source code, you should name it as “myfinal_code.R”; for your own prediction of two testing data sets (see “Detailed Instructions 4.”), you should name them as “myfinal_prediction1.txt” and “myfi- nal_prediction2.txt”. Only file types of “pdf”, “R” and “txt” will be accepted on Canvas. If any of these four files are missing online, we won’t grade your report. 2. Submit a hardcopy of the PDF file “myfinal_report.pdf” to instructor’s office Townsend 228 by 6pm on the due date. You may slide your report under the office door if the instructor is not in office. We won’t grade your homework without a hardcopy. 3. Typing your final report in RMarkdown or LaTeX is encouraged. Detailed Instructions: 1. Project Goals In this final project, we intend to evaluate the prediction performance of different statistical modeling methods on analysis of the MNIST data sets. From the last homework assignment, we have seen that a somewhat “naive” probablistic model by using the counts can achieve a mis-classification rate just a bit above 20%. But can you use other alternative statistical methods to beat the performance of our “naive” benchmark? 2. Methods You are asked to re-analyze the MNIST data sets with at least one (or more) statistical learning methods different from the homework assignment. You are allowed to form a small study group to try out various methods and give a more comprehensive comparative results in your final report. Nevertheless, you should completely “own” at least one method that does not overlap with the coding effort by your groupmates; the Methods/Algorithms/Computation description in the final report, the source code, and two label prediction files should be all based on this method(s) of your own effort. The final label predictions should only reflect the best results of your own effort, NOT from your groupmates or any other person in class. The following gives a list of possible modeling methods (most have been covered in our class or other stat courses), but you are free to choose others that may not be in the list: Multinomial Logistic Regression (with Lasso, Ridge, or Elastic Net) Random Forests (RF) Gradient Boosting Machine (GBM) 1 Multi-class Support Vector Machine (SVM) Multi-class Linear Discriminant Analysis (LDA) Convolutional Neural Network (CNN) 3. Statistical Modeling As we have seen from previous homework assignments, for fair performance comparision, we should only use the training data sets to build statistical models; that is, when building models using your methods of interest, you should only use “mnist_train_binary.csv” (with the original binary covariates) or “mnist_train_counts.csv” (with the compressed counts). Most modern methods above have their tuning parameters that need to be properly chosen, and their softwares often have built-in functions to automatically find values of some key tuning parameters. In the report, you are supposed to discuss how you choose these tuning parameters. A practically very popular way to tune parameters is called cross-validations (CV); Roughly speaking, CV intends to find the tuning parameter to minimize the misclassification rate of the training data, by working out a training- testing scheme inside the training data. For your own satisfaction, you may read a quick document here: http://statweb.stanford.edu/~tibs/sta306bfiles/cvwrong.pdf 4. Label Prediction After your best statistical modeling efforts, you need to make prediction for the testing data sets. The testing data sets are “mnist_test_binary.csv” (with the original binary covariates) or “mnist_test_counts.csv” (with the compressed counts), and you should use your prediction and the true labels to report and compare the misclassification rates of your methods. Suppose yhat1 is the vector object that contains your prediction, you should use the following code to generate the file “myfinal_prediction1.txt”: write.table(yhat1,file="myfinal_prediction1.txt",row.names=FALSE,col.names=FALSE,sep="") In addition, for grader to check your reported misclassification rate, you are also given new testing data sets “mnist_test_binary_new.csv” (with the original binary covariates) or “mnist_test_counts_new.csv” (with the compressed counts). These new testing data files have exactly the same formats as before, but their true labels are replaced by “-1”. You are asked to make prediction for the new testing data sets. Suppose yhat2 is the vector object that contains your prediction, you should use the following code to generate the file “myfinal_prediction2.txt”: write.table(yhat2,file="myfinal_prediction2.txt",row.names=FALSE,col.names=FALSE,sep="") Any other format of your prediction files will NOT be graded. 5. Final Report Write a detailed final report (“myfinal_report.pdf”) to completely summarize your work. If you work under a study group, write down the names of your groupmates and clearly specify the method(s) you “own”. Your report should have at least the following components: (a) Introduction/Goals, that describe the MNIST data problem and the motivation/goals of your work. (b) Methods/Algorithms/Computation, that describe in detail your statistical methods and their rationales; give concise and clear description of computational algorithms. Other things you should consider: why do you think the method may work? how do you choose the tuning parameters in the implemented algorithm? Is there any algorithmic/computation tricks that you used and found interesting in implementation? (c) If you implement more than one statistical method, you may start a new section to describe them and their rationales. 2 http://statweb.stanford.edu/~tibs/sta306bfiles/cvwrong.pdf (d) Numerical Results, that should report the misclassification rate for the testing set with true labels. How does it compare to other alternative methods (such as the “naive” benchmark) in terms of prediction accuracy and computing time? (e) Conclusion, a concluding summary of all your work and efforts, including possible future work. 6. Source Code All your implementation R source code should be saved to the file “myfinal_code.R”. Your submitted code should be able to allow anyone to repeat all your numerical experiment and reproduce your submitted prediction files. 3 	General Directions: 	Detailed Instructions:

STAT 603: Final Project STAT 603: Final Project Due: 6:00pm, Friday, May 24th. General Directions: 0. You may work in groups to discuss about ideas and selection of statistical modeling methods, but...

Get Answer To This Question

Related Questions & Answers

Submit New Assignment