CISC 5790: Data Mining Prof. Yijun ZhaoFordham University, Spring 2023Course ProjectDue: May 81 IntroductionThis project requires you to explore classification algorithms on a real world...

1 answer below »
This project has to be done using Python. One very important thing to remind is that no sklearn, scikit-learn or any other in built classification libraries are used to do this project. All the code has to be written in detail. There should be a ppt. presentation explaining the steps and outcome of the project.


CISC 5790: Data Mining Prof. Yijun Zhao Fordham University, Spring 2023 Course Project Due: May 8 1 Introduction This project requires you to explore classification algorithms on a real world dataset, and write a report explaining your experimental results. The language of implementation is up to you — the only requirement is that your program be able to interpret the data format specified below, and be able to classify instances and produce interesting statistics such as accuracy, false positive rate, false negative rate, etc. You are free to construct whatever user interface for your program, but you must fully document your interface. 2 Algorithm • Your algorithm should be based on the classification algorithms learned during the course. Usually a straight forward implementation of one method will not lead to satisfactory perfor- mance. Your algorithm can be a combination of methods and should incorporate one or more data mining techniques when the situation arises. These techniques include (and certainly not limited to): – Handling imbalanced dataset – Proper imputation methods for missing values – Different treatment of various type of features: continuous, discrete, categorical, etc. 3 Data You’ll be examining the behavior of your model on a dataset from the UCI machine learning lab. The dataset is represented in a standard format, consisting of 3 files. The first file, census-income.names, describes the categories and features of the dataset. It also has some empirical results for your ref- erence. The other two files are census-income.data and census-income.test, containing the actual data instances, formatted at one instance per line, as follows: 1 F 11 , F 2 1 , . . . , F k 1 , label1 F 12 , F 2 2 , . . . , F k 2 , label2 ... F 1n , F 2 n , . . . , F k n , labeln where F ji , labeli (i = 1, . . . , n, j = 1, . . . , k) represent the value of the j th feature and class category for the ith instance respectively. The data you will be examining was extracted from the census bureau database. Each instance contains an individual’s educational, demographic and family information. Prediction task is to determine whether a person makes over 50K a year. You should use census-income.data to train your classifier and use census-income.test to evaluate the performance of your learning algorithm. 4 Your Mission... Deliverables for this project are: • Code to implement the classification algorithm for the data file formats given above • A README file, with simple, clear instructions on how to compile and run your code • Testing statistics for the application of your learning algorithm. At a minimum you should provide training set accuracy, test set accuracy • A discussion of data mining techniques employed in your algorithm • A report analyzing the behavior of your algorithm on the dataset, including any unusual or anomalous (in your opinion) behavior 2 5 How to turn in your code • Create a README file, with simple, clear instructions on how to compile and run your project. If the TA cannot run your program by following the instructions, you will receive 50% of programing score. • Zip all your files (code, README, written report, etc.) in a zip file named {firstname} {lastname} CS5790 project.zip and upload it to Blackboard • Only one person in your group needs to turn in the code and the report. Make sure every team member’s name is listed on the cover of the report 3
Answered 14 days AfterMar 27, 2023

Answer To: CISC 5790: Data Mining Prof. Yijun ZhaoFordham University, Spring 2023Course ProjectDue: May...

Mukesh answered on Apr 01 2023
28 Votes
PowerPoint Presentation
Income Classification
About Dataset
An individual’s annual income results from various factors. Intuit
ively, it is influenced by the individual’s education level, age, gender, occupation, and etc.
This is a widely cited KNN dataset. I encountered it during my course, and I wish to share it here because it is a good starter example for data pre-processing and machine learning practices.
Fields
The dataset contains 16 columns
Target filed: Income
-- The income is divide into two classes: <=50K and >50K
Number of attributes: 14
-- These are the demographics and other features to describe a person
We can explore the possibility in predicting income level based on the individual’s personal information.
Independent features
age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov,
Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales,...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here