Assignment specifications You are required to perform an Investigation to determine appropriate methods for a data analysis task, which is the authorship attribution tasks described above. This...

1 answer below »
subject name - social media data analyticsyou have to select three models, describe them and have to write their advantages and disadvantages , compare them.investigation assignment file is example of assignment. it has to be better than that.u can select R VS sas vs rapid miner.



Assignment specifications You are required to perform an Investigation to determine appropriate methods for a data analysis task, which is the authorship attribution tasks described above. This investigation will form the basis for your second assignment, the Data Analysis assignment. The task is to select three different, complementary methods to perform an authorship attribution task. The methods should be comparable somehow, for example, you can compare n-gram methods with binary, characters or whole words; alternatively, it could be Rapid Miner vs R vs SAS; or maybe you could compare different machine learning methods such as SVM and random forest. When you select your preferred method, you must justify your choice wholly by evidence from the literature. you will not in Assignment 1 (Investigation) be implementing anything, as that is what you will do in Assignment 2 (Data Analysis); your assessment of methods needs to be supported by academic literature sources or other reliable sources. You must have at least one source for each method, but will gain best marks for having at least two sources for each method. Structure for your investigation Describe the three methods you are comparing (9 marks total, 3 marks for each method) For each, name (and cite) the advantages and disadvantages of each method (6 marks total, 2 marks for each method) Make a comparison table for each method (rows) versus advantages/disadvantages (columns). Use the following table structure (6 marks total, 2 for each method): Method Advantages Disadvantages Method1 Method2 Method3 Decide which method will achieve the best outcome, explaining the rationale behind your choice (4 marks total).   Marking There are up to 25 marks available for the Investigation assessment. Marks are allocated according to the Structure (see above).   Submission The Investigation submission date is given on the main course webpage and will be the last day of week 6. A single PDF or doc/docx should be submitted by each person. Multiple files will not be marked.   Penalties There is a 2.5-mark penalty per day or part thereof for late submission. Submission will close completely 48 hours after the deadline. A 1-mark penalty will apply if the submission format above is not observed. Last modified: Thursday, 20 February 2020, 10:09 AM Investigation Assignment of Data Analysis Introduction Data analysis is a task that undertaken especially in data science or data mining fields. Studies tells there are many methods are available to analyse data such as classification, descriptive, regression, random forest, neural network, decision tree etc. Here, only three methods are undertaken for research that are classification, decision tree, and random forest. Comparison is made on these method to find the best method for data analysis. Description of Methods for Data Analysis Classification Method Classification is a machine learning or data mining approach that is widely used for data analysis purpose. Organised or un-organised data is taken in classification. Classification is a technique that takes number of classes in categorised form of data. The basic aim behind classification is to detect a class or category in which new data will enter. There are different multiple types of classification but here is discussed only a binary classification for analysis. Binary Classification – It is a classification task that takes two possible results. E.g.: Gender classification – female or male. There are following footsteps needed in classification model: · Initialize the classifier to be used. · Train the classifier: Completely classifiers in scikit-learn practices a set (X, y) method to utilize the method (training) for the assumed train data X and train label y. · Target Prediction: Assumed an unlabeled measurable X, it predict (X) takings the predicted label y. · Classifier method evaluation. Decision Tree Method The other one methodology is, the decision tree, shaped in tree diagram that is especially designed for classification or regression models to analyse data. It breaks data into number of data sets and subset of data by specifying their instances. It uses branches to represent how the data will be divided and made a tree. The following steps may be essential to make a tree. · Every internal node passes the test session as where to build a next node e.g. crow drinks water or not. · Leaf node shows class label where the decision is taken on behalf of figuring test features. · To make a relationship between nodes, branches are used. · Classification rules are the paths that represent root to leaf node. Here is an example placed that shows the dataset for account management. Random Forest Method Random forest, the other method to analyse data, it also known as random decision forest because it makes up of multiple decision trees for classification or regression model. It takes several decision trees at training session and gives result in a class that is a mode or mean of trees. The theory of random forest in simpler but much strong as to take several individual tress to compile a single class or tree. It works mainly in two terminology that are underlying. · For uncorrelated trees, they works as a similar group to make an individual fundamental tree. · For low correlation trees, they works together to overlaps others errors and flaws, and move on as a group in accurate direction. Hence, output of random forest would be accurate at all scenario. A strong figurative example is given below related to working of random forest methodology. Pros and Cons of Described Methods for Data Analysis Pros and Cons of Classification Method There are following advantages and disadvantages regarding classification method that are needed attention. Pros · Training session takes less time as it is easy to understand and implement. · It works fast in noisy data. · It gives better results in multiple classes or models. · Interpretation, evaluation, and explanation is not much effort taking tasks. Cons · It may be risky for taking the data as local structured. · It has the issues of limited memory spaces, complexity in working, and slow speed in lazy algorithm. · It has the chances to duplication or mix matching of several classes in big data set. Pros and Cons of Decision Tree Method The following numbers of advantage and disadvantage related to decision tree method have great depth in studies. Pros · In pre-processing phase, decision tree takes less effort as compared to other methods as it always follows some set of rules. · Normalization or ascending of data is not required here. · Missing value may not hit the tree making process and it is much easier to build a tree as its easier understanding process to any layman. Cons · Many and different class labels makes it complex task. · A minor change in anywhere in data set in a tree may be cause of disturbing whole structure of a tree. · It is not suitable and preferable for prediction and regression values model. Pros and Cons of Random Forest Method The number of advantages and disadvantages of random forest method are listed below. Pros · It handles a large amount of data very effectively and efficiently because it takes several input data and merge into same accurate path to give a single result. · It is preferable method for missing data as it maintain correctness where the data is missing. · It is suitable methodology where the data in imbalance by balancing data. Cons · It works better for classification but not for regression technique as it in unable to predict nature prediction as well as unable to work with noisy data. · It has lose control on working of random forest because of its specific rules. Comparison of Described Methods for Data Analysis Method Advantages Disadvantages Classification It takes less effort in understanding, implementation, interpretation, evaluation, and explanation because of its classifier models or classes. The big problem with this method are shortage of memory, unable to handle big data, slow speed in algorithm execution because of low memory etc. Decision Tree It works better than above method as it takes big data set, it doesn’t require normalization of data, it also able to work with missing values effectively. As compared to classification, it works better but also have some flaws as multiple classes or nodes makes it complex, a little change in a data will affect the entire tree, it is preferable for only classification but not for regression or prediction models. Random Forest This methodology is most effective as compared to above classification and decision tree methods. It has great capacity to handle big data set efficiently as it takes multiple input and returns single output, it balances the imbalance data, and also works with missing data. To make a comparison of it with other, it is better approach but also have some laps and holes as it just suits to classification technique not for regression, and it has lose control over making random forest. The Finest or Best Method from All As here is the topic to learn about methodologies that are suitable to analyse data in data mining or other task. In above, three methods – classification, decision tree, and random forest, are discussed with advantages and disadvantages to show and evaluate what methods is best and what are the main reasons behind it. Each and every method is good at all because every method follows some set of rules and fit for some specific dataset and circumstances, but surely there are some limitation on all methods that must be followed to successful implementation. As discussed before, every method faces some pros and cons, there is nothing a method that suits for all type of data and circumstances. The goals is to find best methods that covers maximum advantages, here the random forest is the finest method from all above method because it deals with short data and large data sets, missing data sets, and imbalance datasets; and have the capacity to deal with complexity of dataset and circumstances as well. The more description about it is placed above on its main heading. It is the latest methodology that takes multiple decision trees as an input and gives output of single result. Reference https://www.tutorialspoint.com/data_mining/dm_classification_prediction.htm https://www.researchgate.net/figure/Advantages-And-Disadvantages-Of-Classification-Techniques_tbl1_291019467 https://towardsdatascience.com/decision-tree-in-machine-learning-e380942a4c96 https://medium.com/@dhiraj8899/top-5-advantages-and-disadvantages-of-decision-tree-algorithm-428ebd199d9a https://en.wikipedia.org/wiki/Random_forest https://www.quora.com/What-are-the-advantages-and-disadvantages-for-a-random-forest-algorithm
Answered Same DayApr 08, 2021

Answer To: Assignment specifications You are required to perform an Investigation to determine appropriate...

Deepti answered on Apr 10 2021
136 Votes
data analytics techniques
Introduction
Authorship Attribution is the science of identifying characteristics of an author by analyzing the characteristics of the text written by the author. This task is focused on defining delineation the content that captures the author’s writing style. Owing to the high lev
el of difficulty in this identification task, several classification methods are used in the process. This concept is used in various techniques to analyze data such as Decision Tree Method, Support Vector Machine (SVM) predictive analytics, K- Nearest Neighbors (KNN), Random Forest Method, Delta, etc. Authorship attribution through computational and statistical methods has been advantageous in fields like information retrieval, machine learning and NLP. The problem-solving methods in data analytics combine certain tools with step-by-step process. The following sections describe in detail three main techniques used for data analysis including KNN, Delta and SVM.
K-Nearest Neighbors
KNN is a simple supervised machine learning algorithm which is used to address classification and regression problems (Yang, 2006). Classification problem usually has specific output values while regression problem has real numbers as output values. KNN algorithm works on the assumption that the similar values would exist with near proximity. The idea of similarity or close distance and finding the distance between two points on that basis is the fundamental behind making KNN useful in analyzing data. The KNN algorithm follows the steps described below:
Step 1: Data is loaded
Step 2: Variable K is assigned the number of neighbors selected
Step 3: For each instance of the data,
· Distance between the current instance and the query example is calculated
· An ordered collection is maintained where the index of the instance along with the calculated distance is recorded.
Step 4: The ordered collection is sorted in ascending order over the recorded indices of instances
Step 5: First K entries are selected from the sorted collection and get their labels.
Step 6: In case of classification, the mode of the K entries is returned.
Step 7: In case of regression, the mean of the selected K entries is returned.
It is very important to pay attention to the selection of K appropriate to the data (Zhang, 2016). The main advantages of KNN algorithm are:
· It is a versatile algorithm as it can be used for different problems like regression, search and classification.
· It does not require any model, tuning of parameters or making assumptions except the one basic assumption of proximity.
· It is easily implementable.
The disadvantages include:
· Issue of data normalization, as in case of unnormalized data, the calculated distance may be biased for a particular dimension.
· It assumes that the dimensions are independent and cannot be inter-related.
· KNN algorithm becomes potentially slow with increased number of independent variables. It may get slow to such an extent that if fast predictions are required, it becomes an impractical choice.
Delta
Delta Method (Alex Deng,...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here