Data MiningCOSC 2111/2110Assignment 2 Neural Networks and Ensemble MethodsAssessment Type This is an individual assignment, meaning that you mustcomplete this assignment by yourself. Please...

1 answer below »
Data Mining with Weka. I want to avoid JAVANSS. I would rather want the answer using multilayer perception.


Data Mining COSC 2111/2110 Assignment 2 Neural Networks and Ensemble Methods Assessment Type This is an individual assignment, meaning that you must complete this assignment by yourself. Please submit your assignment online via “Canvas → Assignments → Assign- ment 2”. Clarifications/updates may be made via announce- ments/relevant discussion forums. Due Date End of week 11, Monday 10th October 2022, 11:59pm Marks 40 1 Overview In this assignment you are asked to explore the use of neural networks and ensemble methods for classification. You are also asked to carry out a data mining investigation on a real-world data file. You are required to write a report on your findings. Your assignment will be assessed on desmontrated understanding of concepts, algorithms, methodology, analysis of results and conclusions. Please make sure your answers are labelled correctly with the corresponding part and sub-question numbers, to make it easier for the marker to follow. Please stick to the required page limits (penalty will apply). 2 Learning Outcomes This assessment relates to the following learning outcomes of the course. • CLO 1: Demonstrate advanced knowledge of data mining concepts and techniques. • CLO 2: Apply the techniques of clustering, classification, association finding, fea- ture selection and visualisation on real world data. • CLO 3: Determine whether a real world problem has a data mining solution. • CLO 4: Apply data mining software and toolkits in a range of applications. • CLO 5: Set up a data mining process for an application, including data preparation, modelling and evaluation 3 Assignment Details 3.1 Part 1: Classification with Neural Networks (12 marks) This part involves predicting the Class attribute in the following file: hepatitis.arff in the directory: /KDrive/SEH/SCSIT/Students/Courses/COSC2111/DataMining/data/arff/UCI/ The main goal is to achieve the lowest classification error with the lowest amount of overfitting. You are recommended to use ‘MultilayerPerceptron’ in Weka to complete this task, though alternatively, it is also possible to do this by using ’Javanns’. For the neural network training runs build a table with the following headings: Run Archi- Param Train Train Epochs Test Test No tecture- eters MSE Error MSE Error 1 ii-hh-oo lr=.2 0.5 30% 500 0.6 40% 1. Describe the data preprcocessing tasks (including data encoding) that are required. How many outputs and how many inputs will there be? How do you handle numeric and nominal attributes? What are the normalizations requred? How do you deal with missing values (if present)? Include your data preprocessing scripts (e.g., if you choose the Javanns option) as an appendix (not part of the page count). 2. Elaborate the pre-processing procedure (in Weka) to generate the necessary train- ing, validation and test data files. How do you determine when to stop training a neural network? Include your data preparation script (e.g., if you choose Javanns) as an appendix (not part of the page count). 3. Describe how a trained neural network determines unseen test data instance’s class label. If Javanns is chosen, describe how to use the ‘analyze” program to do this. 4. Assuming that no hidden layer is used, carry out 5 training and test runs for a network. Comment on the limitations of this single-layer “perceptron” network, as opposed to a network where one or more hidden layers are employed. 5. Assuming that one hidden layer is used, carry out 10 training and test runs for a network with different numbers of hidden nodes. What would be a good strategy to figure out the right number of hidden nodes? From your runs, what seems to be the right number of hidden nodes for this problem? Comment on the variation in the training runs and the degree of overfitting. 6. For a network with 5 hidden nodes, explore at least 5 different combinations of learning rate and momentum. What do you conclude? 7. Compare the classification accuracy of ‘MultilayerPerceptron’ in Weka (or Javanns) with the classification accuracy of Weka J48. Comment on the pros and cons of employing these two different types of classifiers for classification tasks. 8. [Optional for COSC2110] Experimenting with both Javanns and Weka Multilayer- Perceptron, what are the pros and cons of these two different software programs for neural network training? What makes you decide to choose to use either Javanns or Weka? Provide your reasoning. Report Length Up to two pages. 2 3.2 Part 2: Classification with Ensemble Methods (10 marks) This part involves the following file: hepatitis.arff, in the directory: /KDrive/SEH/SCSIT/Students/Courses/COSC2111/DataMining/data/arff/UCI/ The main goal is to try to predict as accurately as possible if a patient would die. In other words, the task is more concened with predicting correctly the class label of those instances associated with the label “die”. You are expected to use ensemble methods (e.g., Bagging, Random Forest, AdaBoost, Voting and Stacking) for this task. 1. Load the data set, and run ZeroR (i.e., a very basic classifier) and J48 (a typical decision tree) to establish two baselines, for comparing your ensemble methods. 2. Discuss whether or not ensemble methods (such as Bagging) would be suitable for handling this specific data set on hepatitis. 3. From Weka, run Bagging (via meta classifiers). Run the Bagging classifier for different numbers of iterations, for example 5,10,100,200,500,1000 and build a table of results. What do you observe? Provide your explanation. 4. Repeat the above experiment for running AdaBoostM1 classifier and RandomForest classifier, and build your table of results. What do you observe for the results across different ensemble methods? Provide your explanation. 5. From Weka, run the “Vote” ensemble method, and try to include several different machine learning (ML) models, such as OneR, J48, NäıveBayes, Neural Network, etc for the ensemble. The idea is to try to use a diverse range of ML models as much as possible. Run “Vote” with this composition of ML algorithms, and that using the Vote default setting (which uses only ZeroR). Include these results for comparison in a table, and provide your analysis on whether or not there are differ- ences between the two. In addition, is “Vote” method (with a diverse composition of ML algorithms) better than the previously used ensemble methods? Provide your explanation. 6. Considering that the task is more concerned with predicting as acurrately as pos- sible if a patient would die, and we wish to minimize the chance of misdiagnosing anyone who would die, what would be the most appropriate (and meaningful) per- formance measure to use here? You need to provide sufficiently detailed explanation to demonstrate your understanding. 7. Comparing with part 1 (i.e., using neural networks), what differences in the results can you observe? Discuss the issue of using just classification error (or accuracy) as a performance measure. 8. From the above experimental runs and result analysis, explain whether (or not) ensemble methods should be considered as effective data mining methods. Report Length Up to two pages. 3 3.3 Part 3: Data Mining (15 marks) This part of the assignment is concerned with the data file IMDB-movie-data.csv, which is in the directory: /KDrive/SEH/SCSIT/Students/Courses/COSC2111/DataMining/data/other/ Your task is to analyse this data with appropriate classification, clustering, association finding, attribute selection and visualisation techniques selected from the Weka menus and identify any “golden nuggets” in the data. If you don’t use any of the above techniques, you need to say why. You need to provide a report for this analysis, focusing on the following two aspects: 1. Describe the strategies you have adopted, your methodologies, the runs you per- formed, any “golden nuggets” you found and your conclusions. 2. Discuss the advantages and disadvantages of each of your chosen data mining meth- ods. Make sure you provide a rationale for your choices, and why it worked well (or not well) for discovering “golden nuggets”. Please follow the provided guidelines on how to write this part of the report. It is provided as separate document listed under the assignment 2 option on Canvas. Report Length Up to two pages. 3.4 Part 4: Self-reflection (3 marks) In this task, you will need to provide a recorded video presentation (3 or 4 minutes, with no more than 5 presentation slides) of your reflection on what you have learnt from this course on Data Mining. In particular, you should focus on answering the following questions: • Have you gained much improved understanding of key data mining concepts and major techniques? What is your reflection on the journey (considering now that you have completed your two assignments)? • What is your knowledge and understanding now in determining whether there is a data mining solution for a real-world problem? • What have you learned from doing both assignment 1 and 2, in terms of helping you extract meaningful patterns (i.e., “golden nuggets”) for a real-world data mining problem? You will need to record the presentation in MP4 format. Both the recorded video presen- tation and the presentation slides (PDF format) should be submitted through Canvas. 4 Alternative for this assignment It is possible for you to choose to work on some other real-world data sets from the Kaggle Competition website: https://kaggle.com. You still need to complete all four parts (part 1, 2, 3 and 4 as described in Section 3), with the only difference being the data set you choose to use in part 3. For this part 3, you are allowed to use deep learning techniques and Python programming language. You need to consult the lecturer about this request individually to get an approval, before going ahead with it. 4 https://kaggle.com 5 Submission Instructions You need to submit the following 3 files via Canvas: • one PDF file for the report covering Part 1 - Part 3 (note that each part has a 2-page limit, not counting the appendix where you could include your scripts developed). • one MP4 video file for Part 4. • one PDF file for your presentation slides for Part 4. 5.1 Late submission penalty After the due date, you will have 5 business days to submit your assignment as a late submission. Late submissions will incur a penalty of 10% per day. After these five days, Canvas will be closed and you will lose ALL the assignment marks. Assessment declaration: When you submit work electronically, you agree to the assessment declaration - https:// www.rmit.edu.au/students/student-essentials/assessment-and-exams/assessment/ assessment-declaration 6 Academic integrity and plagiarism (standard warning) Academic integrity is about honest presentation of your academic work. It means ac- knowledging the work of others while developing your own insights, knowledge and ideas. You should take extreme care that you have: • Acknowledged words, data, diagrams, models, frameworks and/or ideas of others you have quoted (i.e. directly copied), summarised, paraphrased, discussed or men- tioned in your assessment through the appropriate referencing methods • Provided a reference list of the publication details so your reader can locate the source if necessary. This includes material taken
Answered 1 days AfterOct 05, 2022

Answer To: Data MiningCOSC 2111/2110Assignment 2 Neural Networks and Ensemble MethodsAssessment Type This...

Amar Kumar answered on Oct 06 2022
50 Votes
1.
Data cleaning-There might be many elements of the data that are missing or useless. To address this component, data cleansing is performed. It entails coping with noisy, missing, or both types of data.
a) Missing data: This condition occurs when certain data are missing from the data. It can be handled in a variety of ways.
b) Noisy data: Noisy data is data that has no meaning and cannot be understood by machines. It might be caused by data in
put errors, inaccurate data gathering, or other circumstances.
Data transformation: This step is performed to convert the data into the appropriate forms for the mining process. This can be accomplished in the following ways:
a) Normalization: It is done to scale the data values inside a predefined range (from -1.0 to 1.0 or from 0.0 to 1.0).
b) Attribute selection: New attributes are created using this method from the supplied set of attributes to aid the mining process.
c) Discretization: This method substitutes interval levels or conceptual levels for the raw values of numerical properties.
d) Concept hierarchy generation: In this instance, characteristics are elevated from a lower hierarchy level. The characteristic "city," for example, may be changed to "country."
Data reduction: due to the fact that data mining is a way for handling a lot of data. It became more challenging to assess in these instances while working with a lot of data. In order to do away with this, we use the data reduction approach. It seeks to lower data analysis and storage expenses while enhancing storage effectiveness.
2.
WEKA's preprocessing tools are referred known as "filters." The preprocess pulls the data from a file, SQL database, or URL. Because the entire dataset was kept in main memory, subsampling could be necessary for really large datasets. Data can be processed using one of Weka's preprocessing tools. On the Preprocess tab, a histogram with statistics for the chosen attribute is shown. All characteristics' histograms are available for viewing.
The neural network finishes training when the error, or the gap between the intended and anticipated output, drops below a specific threshold value or reaches a specific number of iterations or epochs.
3.
If the dataset is tiny, it is recommended to test neural network models several times on the same dataset and give the mean performance over the repeats.
This is as a result of the stochastic character of the learning method.
It's also a good idea to employ k-fold cross-validation rather than train/test splits of a dataset when generating predictions about fresh data to obtain an accurate assessment of model performance.
Once more, provided there isn't an excessive amount of data, the procedure can only be completed in a fair length of time.
4.
Perceptron networks have certain limitations. First, the output values of a perceptron can only take on one of two values—zero or one—due to the hard-limit transfer function. Second, only linearly separable vector sets may be classified using perceptrons. The input vectors can be separated into their appropriate categories by drawing a straight line or a plane, in which case they are linearly separable. If vectors cannot be separated linearly, learning will never be able to identify all vectors accurately. However, it has been shown that if the vectors are linearly separable, perceptrons trained adaptively will always discover a solution in a finite amount of time. You may want to consider using vectors that are linearly non-separable.
It illustrates how difficult it is to classify input vectors that cannot be linearly separated.
It is reasonable to point out that networks with more than one perceptron can be used to address more challenging tasks.
Let's imagine you have a collection of four vectors that you wish to divide into independent, two-line-divisible groups.
Two decision boundaries that divide the inputs into four categories can be used to build a two-neuron network.
5.
For the purpose of evaluating the test set's correctness, cross-validation is necessary. If you have a large number of training instances, you can employ numerous hidden units; nevertheless, there are occasions when only two hidden units perform well when you have limited data. There is no rule that states you must multiply the number of inputs by N. It is also possible that the ideal number of hidden units is less than the total number of inputs. However, current research into the designs of deep neural networks shows that complicated...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here