every instruction in pdfs

1 answer below »
every instruction in pdfs


Assignment 4 So far you have learnt preprocessing data and applying various classification and regression techniques. This assignment is divided in 3 parts. 1. Part 1 – CLUSTER ANALYSIS. 1.1. The purpose of clustering and classification algorithms is to make sense of and extract value from large sets of structured and unstructured data. If you’re working with huge volumes of unstructured data, it only makes sense to try to partition the data into some sort of logical groupings before attempting to analyze it. 1.2. You have to perform KMeans and Hierarchical analysis on the IMDB dataset (refer to 1. In the resources below). The goal is to put all the movies that share some common characteristics in one cluster. 1.3. For KMeans you will first have to find the optimum number of cluster by plotting the SSE vs # of Clusters (Elbow method) and then proceed with applying Kmeans. 1.4. For hierarchical clustering, apply single, complete and average link and display the dendogram (the plot that visualizes the hierarchy). 2. Part 2 – TEXT MINING. 2.1. Text mining is the process of analyzing collections of textual materials in order to capture key concepts and themes and uncover hidden relationships and trends without requiring that you know the precise words or terms that authors have used to express those concepts. 2.2. To make sense of the text and make it useful for various ML techniques, there has to be a numeric way to express the text. 2.3. Your task is to create a count vector and a tfidf vector on the given data (refer to 2. In the resources below). 2.4. Display the count vector and tfidf vector and explain the usage of tfidf. 3. Part 3 – ARTIFICIAL NEURAL NETWORK (ANN). 3.1. An ANN is (supposed to be) exactly like a human brain but simulated using a software. Like a brain, a NN consist of various layers of “neurons” that work simultaneously to get the output. 3.2. Your task is to apply ANN on the admission dataset used in assignment 2 using code given in tutorial 6. You will have to make necessary changes to the code to make it work for the admission dataset. Hint: encode the target attribute values to binary values . 4. Resources 4.1. Find the IMDB dataset under Files -> Labs -> data and find the description of the dataset here: https://www.kaggle.com/iarunava/imdb-movie-reviews-dataset 4.2. Dataset for text mining: Please note this is a python list. [ 'Now for manners use has company believe parlors.', 'Least nor party who wrote while did. Excuse formed as is agreed admire so on result parish.', 'Put use set uncommonly announcing and travelling. Allowance sweetness direction to as necessary.', 'Principle oh explained excellent do my suspected conveying in.', 'Excellent you did therefore perfectly supposing described. ', 'Its had resolving otherwise she contented therefore.', 'Afford relied warmth out sir hearts sister use garden.', 'Men day warmth formed admire former simple.', 'Humanity declared vicinity continue supplied no an. He hastened am no property exercise of. ' , 'Dissimilar comparison no terminated devonshire no literature on. Say most yet head room such just easy. '] https://www.kaggle.com/iarunava/imdb-movie-reviews-dataset 5/5/22, 8:30 PM Cluster Analysis, ANN and Text Mining Project https://csus.instructure.com/courses/89590/assignments/1414127 1/2 "imdb_dataset.csv" in files----->labs---->data For clustering using TFIDF for sentiment analysis this may perhaps be a useful relevant article with python code. You can partition the data for training and validation and do k-means with TFIDF by removing labels (unsupervised) and testing with test partition and compare with true labels in test data: K-means clustering by Daniel Foley: h�ps://link.medium.com/Ezc6zqqcW5 Please read this article: 6.2. Feature extraction — scikit-learn 1.0.2 documentation Summary of article above: 1. find all unique useful words in all documents 2. For each document find the count vector for all the words 3. Then find the tf-idf vector by using inverse document formulae. This gives a relative score of frequencies. 4. Then within the vector normalize data to lie between 0 and 1. 5. Each numerical vector represents a document. Use this numerical vector to find the distance between vectors of documents in k- means. 6. This will give you clusters (groups) of documents. The groups discovered in an unsupervised manner, may now automatically through ML, contain related documents such as biology or chemistry and so on depending on the context! Since ANN concerns classification topic from last assignment , you are required to use the same admissions dataset as in Project3 with binary classes. Create a Report on all three areas. The Rubric for this team project is: 1. Clustering: 15% K-Means (5% plot SSE vs # of clusters 10% K-means algorithm) 15% Hierarchical (single, complete and average link (9%), plot (6%) Subtotal: (30%) 2. Text Mining: Create count vector and tf-idf vector (normalized vector) 20% https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction 5/5/22, 8:30 PM Cluster Analysis, ANN and Text Mining Project https://csus.instructure.com/courses/89590/assignments/1414127 2/2 Explain usage 10% Subtotal (30%) 3. ANN: Attribute value Binarization 10% ANN 10% Accuracy comparison with other classification models 10% Subtotal (30%) 4. Report with summarized findings (please submit in PDF document format): 10% Total: 100% You may use code from tutorials 6, 7 and 8 and write any additional code to accomplish the above goals.
Answered 3 days AfterMay 06, 2022

Answer To: every instruction in pdfs

Uhanya answered on May 08 2022
98 Votes
Part 1: Cluster Analysis:
K-Means Clustering Algorithm
K means clustering is the one of the most
popular unsupervised learning algorithms. It is used for clustering machine learning problems.
What is K-Means Algorithm?
K means clustering algorithm is used to group the un labelled data in to the clusters. The number of clusters are mentioned as K value. If K=2, then the number of clusters are two type.
If K=3, then number of clusters are three.
Findings:
IMDB movies ratings dataset is used for cluster analysis. The attributes are listed below,
['Unnamed: 0', 'title', 'title_type', 'genre', 'runtime', 'mpaa_rating', 'studio', 'thtr_rel_year',...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here