data mining and analysis midterm:include five topics: Data transformation...

Question

data mining and analysis midterm:include five topics:										Data transformation (pre-processing)														Clustering (k-means)														Classification (decision trees, neural nets)														Model evaluation (cross-validation, error metrics)														Combining classifiers (ensembles)						DMA_22 Midterm Review.pptx Kaggle Competition Lab Reflections Midterm Review Data Mining & Analytics 4 of 29 • Wednesday, October 19th in-class • The Midterm will be on bCourses (like a quiz) • The link to the Midterm will be posted before class – 1. All exams must be turned in no later than 5:00pm – 2. You will have no more than 2 hours to complete the exam – 3. To receive the full 2 hours, you must start between 2pm and 3pm – 4. If starting at 2pm, for example, your exam will be due at 4pm – 5. If starting at 3:30pm, for example, your exam will be due at 5pm  – 6. There are 19 questions, worth 53 points total (8 points extra credit + 15%) – 7. The exam is open book/note – 8. No communication is allowed during the test Midterm 5 of 29 • Data transformation (pre-processing) • Clustering (k-means) • Classification (decision trees, neural nets) • Model evaluation (cross-validation, error metrics) • Combining classifiers (ensembles) Topics Subtopics ● Feature engineering (pandas) ● Representing data to fit the prediction task ● Normalization e.g., Z-score:  Data transformation (pre-processing) 7 of 29 Example Exam Question Subtopics ● Types of clustering methods ● Measures of cluster goodness (SSE, silhouette score) ● The k-means algorithm ● Ways of choosing K (elbow method) Clustering (k-means) Clustering (k-means) Sum of Squared Errors Clustering: Elbow Method Clustering: SSE, Silhouette Score Within-cluster Variance / Sum of Squared Errors Silhouette Score (take the avg. of all s(o)) - every data point For each data point o in Ci: 12 of 29 Example Exam Question Pt Feature 1 Feature 2 Cluster A 2 2 1 B 0 2 1 C -4 -1 2 D -3 -2 2 (a) Calculate the silhouette coefficient for point B. (b) If this assignment is obtained right after an iteration of K-means  clustering (which may or may not have terminated), do you think the  assignment will change in later iterations? Why or why not? Given a clustering assignment on four 2-dimensional points: Subtopics ● Characterizing purity (Gini/Info) ● Splitting based on features to improve purity (trees) ● Improving the generalizability of training trees (pruning) Classification (decision trees) 14 of 29 Example Exam Question Pt Feature Label A 7 0 B 10 1 C 4 0 D 10 0 E 16 1 F 9 1 (a) Calculate the gini index of the Dataset (b) If we split on 8, what is the overall gini  index after splitting? (c) If we split on 13, what is the overall gini  index after splitting? Subtopics ● Feed forward neural networks Input layer, hidden layer, output layer, weights, bias  ● Backpropagation (conceptual) ● Activation functions Logistic, relu, softmax ● Additional details Epoch, batch size, stopping criteria Classification (neural networks) 16 of 29 Example Exam Question Subtopics ● Metrics (confusion matrix based & continuous) ● Training, validation, and testing sets ● Cross-validation ● Model selection Evaluation of models Error Metrics predicted actual 0 0 0 1 1 0 1 1 1 1 Examples from lecture predicted actual 0.25 0 0.45 1 0.66 0 0.71 1 19 of 29 Example Exam Question (a) Predict GPA given students’ department, credits taken, study hours, etc. (b) Predict if a Twitter user is liberal or conservative. (c) Predict lung cancer from chest X-rays. For each of the tasks below, explain which algorithm(s) and error metric(s)  you would use and why. Subtopics ● Simple combiners ● Bagging (e.g., random forests) ● Boosting (Adaboost) ● Blending/Stacking Combining classifiers (ensembles) 21 of 29 Example Exam Question Determine True or False for each of the following statements: (a) Ensemble methods are never the cause of overfit. (b) Hyperparameters of random forests include (but not limited to) number of trees,  percentage of rows sampled, and max depth. (c) Blending uses cross-validation while stacking uses a holdout validation. 22 of 29 Break-out groups • Clustering & Preprocessing  • Prediction/Classification Models • Cross-validation & Metrics  • Ensembling  Prof. Pardos will stay in the main room for  general Q & A (self-select after break)

DMA_22 Midterm Review.pptxKaggle Competition Lab ReflectionsMidterm ReviewData Mining & Analytics4 of 29• Wednesday, October 19th in-class• The Midterm will be on bCourses (like a quiz)•...

Get Answer To This Question

Related Questions & Answers

Submit New Assignment