The following data is a sample from a loan history database of a Japanese bank Assignment 6 Mining Real Data This assignment is about applying data mining techniques to real data. A data set from a...

1 answer below »
I need your help with this assignment


The following data is a sample from a loan history database of a Japanese bank Assignment 6 Mining Real Data This assignment is about applying data mining techniques to real data. A data set from a real medical domain created by physicians is provided. It contains medical records of 310 patients contracted Hepatitis C. The objective is to analyze the data, do some preprocessing if necessary and generate classification models that may be later applied to new patients. Here are the major steps you have to follow in order to complete the assignment: I Do the following experiments: 1. Read “Hepatitis Data Description” and inspect the data file hepatitis.arff. Try to understand the meaning of the data (as much as you can without been a doctor) and find any attribute relationships (e.g., between DOB and Age), that may be used later in the experiments. 2. Use the Weka preprocessing tools to analyze the data and find any errors, omissions or inconsistencies. • Use Preprocess to check the statistical properties of the data. • Use Visualize to analyze relationships and possibly find errors (e.g., see how DOB and Age are correlated). 3. Use the Weka classifiers to create and evaluate data models. For evaluation use holdout and ten-fold cross validation. Try at least the following classifiers and vary their parameters to get the best accuracy: • ZeroR to get some idea about the class distribution and the lower bound of accuracy. • OneR • IBk • Naive Bayes • Decision Tree (J48) • Prism o needs discretization, or omission of numeric attributes o doesn’t tolerate missing values (use ReplaceMissingValues filter). 4. You may get some insights by running the above classifiers with training set test option (however don't use this for evaluating models). For example, IBk=1 will produce an accuracy less than 100%. When tested on the training data obviously the nearest neighbor of every instance is the instance itself. How is then possible to have errors? Use Visualize classifier errors to investigate the reasons for these errors. 5. Change the data and try the classifiers again, see the changes in accuracy. The objective here is to find the model which maximizes accuracy evaluated by ten-fold cross validation. Try: • changing attributes by applying discretization, attribute selection, removing dependent attributes, or other filters (e.g., replace missing values). • changing instances: removing duplicates, or inconsistent instances (same instances with different class values, see item 4). • sampling: use instance filters to select a subset of data. • transform the task into a two-class problem (e.g., use the filter MergeManyValues to merge two values for the class attribute - set: ignoreClass = True in the MergeManyValues parameters). II Write a report: 1. (20 p.) Comment on Problem 2. 2. (30 p.) Comment on pieces of models or evaluation results to show trends and improvements (Problem 3). DO NOT include classifiers’ outputs files. Use tables. 3. (20 p.) Describe at least one change in data (see Problem 5) that improves the model in terms of accuracy measured with ten-fold cross validation. Describe what you changed and why, and how the accuracy improved. 4. (5 p.) Comment on the compactness and comprehensibility of the models. 5. (5 p.) Any other insights, thoughts, interesting results and suggestions how to create a good model. What to turn in: - The report. - A zip file containing the data sets that you have created and used. Overview of the hepatitis data Overview of the hepatitis data1 Currently there are 5 identifiable forms of viral hepatitis namely A, B, C, D and E. All of these viruses are hepatotrophic (i.e. liver is the primary site of infection). Approximately 4 million people in U.S. and 100 million people worldwide are infected with the HepatitisC (HepC) virus. Approximately 85% of persons with acute HepC develop chronic hepatitis as determined by persistently abnormal serum enzymes and/or viremia ( HepC virus (RNA)). Both the acute and the chronic illnesses are predominantly asymptomatic. For this reason and because of chronic illness runs in extremely protracted course, it has been difficult to accurately define the frequency and the rate of progression to symptomatic or end stage liver disease and death. The response for the current treatment for the Hepatitis C virus is only about 30-40%. And has many side effects and is expensive. Liver biopsy is usually the most specific test to assess the nature and severity of the liver disease. So it is proposed in the medical literature to treat stage III and IV liver disease on liver biopsy. Liver has a rich vascular supply, therefore there are some complications associated with the liver biopsies. Approximately 1-3% of patients require hospitalizations for complications after liver biopsy. Complications include transient, localized discomfort at the biopsy site; pain requiring analgesia; and mild transient hypotension. Approximately ¼ th of the patients have pain in the right upper quadrant or right shoulder after liver biopsy. Although very rare clinically significant intraperitoneal hemorrhage, is the most serious bleeding complication of liver biopsy. Therefore in order to avoid complications associated with liver biopsy and to predict the severity of the disease early so that the treatment (medical or surgical (i.e. Liver transplant) can be started early. In our study we aim to predict the stage of the disease (I, II, III, IV) using data mining software WEKA thus avoiding Liver Biopsy. In medical literature different modes of the disease transmission has been documented. The most common modes of transmission are IVDA (Intravenous Drug Abuse), usage of nasal Cocaine, Blood Transfusion (Tx), Needle Stick (N) in occupation, for example accidental needle sticks in work place (Nurses, doctors, emergency medical technicians and other health care professionals), presence of Tattoo marks, Sexual transmission of the disease. It has been documented that some people do not have any of the above risk factors and fall into the category of No Risk Factors (NRF). Co-infection with the Hepatitis B virus (HBV) or with Human Immune Deficiency Virus (HIV) is also an important consideration. Alcohol use (ETOH), Obesity, co-infection with HBV and HIV makes Liver Disease progression faster and worse. The current treatment options for the patients inflicted with the HepC virus is Interferon and Ribavarin. There are 6 Genotypes (GT) of the Hepatitis virus: 1, 2, 3, 4, 5, and 6. About 70% of the patients in US have subtype 1. Only 50% of the genotype1 respond the above treatment. In genotypes other then 1 have about 70% response rate. Liver Function test (LFT) is used as an indicator of the severity of the liver disease, it is represented as negative if the test result is within the normal lab limits else if it is greater than 1.5 times the normal it is recorded as positive. Duration, which is the number of years for which the patient had the disease is also important in determining the future progression of the disease. 1 The material presented here is copied from MINING MEDICAL DATA: Predicting the stage of Hepatitis-C Using the WEKA 3.2 Data Mining System, capstone project of PADMA TATAVARTHY, SHWETHA TIPPA, KARUNASRI SEELA, CIT, CCSU, Fall 2002. Liver Biopsy is performed to stage the severity of the disease (I, II, II, IV) and also to determine if the treatment is indicated. For example a 50 year old male acquired HepCV at the age of 20 by IVDA has normal LFT and liver biopsy (Bx) shows stage I disease no treatment is indicated. It is generally that if the Biopsy is in Stage I and II no treatment is indicated, else if it is in stage III or IV the patient is given a regimen of Interferon and Ribavarin. The table shown below gives the attributes, their values and descriptions. No Attributes Type of Data Values Descriptions 1 Sex Categorical M, F Gender 2 DOB Numeric Date Date of Birth 3 DOT Numeric Date Date of transmission of the disease 4 Route Categorical Coc, IV, Tx, N, NRF, Tatt, Sex The route through which the disease was transmitted. 5 IV Categorical +, - Intravenous 6 Tx Categorical +,- Blood Transfusion 7 Coc Categorical +, - Usage of Cocaine 8 Tatt Categorical +, - Presence of Tatoo on the body of the patient 9 HBV Categorical +, - Presence of Hepatitis B virus in the patient. 10 HIV Categorical +, - Presence of HIV infection 11 EtOH Categorical +, - Alcohol usage by the patient. 12 Obes Categorical +, - Whether the patient is obese or not. 13 Rx Categorical +,- Treatment, it is whether the patient has been treated. 14 Tox Categorical +, - Presence of any toxic elements. 15 CLD Categorical +, - Whether the patient has Chronic Liver Disease. 16 GT Categorical -, I, II, III Genotype of the patient. 17 LFT Categorical +,- Whether or not the Liver Function Test was done. 18 Duration Numeric Number years for which the patient had the disease. It is basically the difference between DOB and DOT. 19 Age Numeric Current age of the patient. 20 Bx Categorical I, II, III, IV Biopsy result, which specifies the stage of the HepC. Overview of the hepatitis data
Answered 3 days AfterMar 26, 2022

Answer To: The following data is a sample from a loan history database of a Japanese bank Assignment 6 Mining...

Mohd answered on Mar 29 2022
108 Votes
=== Run information ===
Scheme: weka.classifiers.rules.ZeroR
Relation: MedicalData-weka.filters.unsupervised.attribute.Remove-R2
Instances: 310
Attributes: 18
Sex
DOT
Route
IV
Tx
Coc
Tatt
HBV
HIV
EtOH
Obes
Rx
Tox
CLD
LFT
Duration
Age
Bx
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
ZeroR predicts class value: II
Time taken to build model: 0 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 126 40.6452 %
Incorrectly Classified Instances 184 59.3548 %
Kappa statistic 0
Mean absolute error 0.261
Root mean squared error 0.3605
Relative absolute error 100
%
Root relative squared error 100 %
Total Number of Instances 310
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.000 0.000 ? 0.000 ? ? 0.196 0.013 0
0.000 0.000 ? 0.000 ? ? 0.489 0.376 I
1.000 1.000 0.406 1.000 0.578 ? 0.484 0.399 II
0.000 0.000 ? 0.000 ? ? 0.484 0.195 III
? 0.000 ? ? ? ? ? ? IV
Weighted Avg. 0.406 0.406 ? 0.406 ? ? 0.482 0.344
=== Confusion Matrix ===
a b c d e<-- classified as
0 0 4 0 0 | a = 0
0 0 118 0 0 | b = I
0 0 126 0 0 | c = II
0 0 62 0 0 | d = III
0 0 0 0 0 | e = IV
=== Run information ===
Scheme: weka.classifiers.rules.OneR -B 6
Relation: MedicalData-weka.filters.unsupervised.attribute.Remove-R2
Instances: 310
Attributes: 18
Sex
DOT
Route
IV
Tx
Coc
Tatt
HBV
HIV
EtOH
Obes
Rx
Tox
CLD
LFT
Duration
Age
Bx
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
Age:
    < 46.5    -> I
    < 47.5    -> II
    < 49.5    -> I
    >= 49.5    -> II
    ?    -> I
(152/310 instances correct)
Time taken to build model: 0 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 112 36.129 %
Incorrectly Classified Instances 198 63.871 %
Kappa statistic -0.0364
Mean absolute error 0.2555
Root mean squared error 0.5055
Relative absolute error 97.8901 %
Root relative squared error 140.1901 %
Total Number of Instances 310
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.000 0.000 ? 0.000 ? ? 0.500 0.013 0
0.568 0.615 0.362 0.568 0.442 -0.046 0.477 0.370 I
0.349 0.380 0.386 0.349 0.367 -0.032 0.484 0.399 II
0.016 0.040 0.091 0.016 0.027 -0.052 0.488 0.198 III
? 0.000 ? ? ? ? ? ? IV
Weighted Avg. 0.361 0.397 ? 0.361 ? ? 0.482 0.343
=== Confusion Matrix ===
a b c d e<-- classified as
0 3 1 0 0 | a = 0
0 67 46 5 0 | b = I
0 77 44 5 0 | c = II
0 38 23 1 0 | d = III
0 0 0 0 0 | e = IV
=== Run information ===
Scheme: weka.classifiers.lazy.IBk -K 1 -W 0 -A "weka.core.neighboursearch.LinearNNSearch -A \"weka.core.EuclideanDistance -R first-last\""
Relation: MedicalData-weka.filters.unsupervised.attribute.Remove-R2
Instances: 310
Attributes: 18
Sex
DOT
Route
IV
Tx
Coc
Tatt
HBV
HIV
EtOH
Obes
Rx
Tox
CLD
LFT
Duration
Age
Bx
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
IB1 instance-based classifier
using 1 nearest neighbour(s) for classification
Time taken to build model: 0 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 125 40.3226 %
Incorrectly Classified Instances 185 59.6774 %
Kappa statistic 0.0666
Mean absolute error 0.2408
Root mean squared error 0.484
Relative absolute error 92.2546 %
Root relative squared error 134.243 %
Total Number of Instances 310
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.000 0.000 ? 0.000 ? ? 0.505 0.013 0
0.475 0.323 0.475 0.475 0.475 0.152 0.573 0.424 I
0.444 0.446 0.406 0.444 0.424 -0.001 0.497 0.406 II
0.210 0.165 0.241 0.210 0.224 0.047 0.521 0.209 III
? 0.000 ? ? ? ? ? ? IV
Weighted Avg. 0.403 0.337 ? 0.403 ? ? 0.531 0.369
=== Confusion Matrix ===
a b c d e<-- classified as
0 4 0 0 0 | a = 0
0 56 48 14 0 | b = I
0 43 56 27 0 | c = II
0 15 34 13 0 | d = III
0 0 0 0 0 | e = IV
=== Run information ===
Scheme: weka.classifiers.lazy.IBk -K 1 -W 0 -A "weka.core.neighboursearch.LinearNNSearch -A \"weka.core.EuclideanDistance -R first-last\""
Relation: MedicalData-weka.filters.unsupervised.attribute.Remove-R2
Instances: 310
Attributes: 18
Sex
DOT
Route
IV
Tx
Coc
Tatt
HBV
HIV
EtOH
Obes
Rx
Tox
CLD
LFT
Duration
Age
Bx
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
IB1 instance-based classifier
using 1 nearest neighbour(s) for classification
Time taken to build model: 0 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 125 40.3226 %
Incorrectly Classified Instances 185 59.6774 %
Kappa statistic 0.0666
Mean absolute error 0.2408
Root mean squared error 0.484
Relative absolute error 92.2546 %
Root relative squared error 134.243 %
Total Number of Instances 310
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.000 0.000 ? 0.000 ? ? 0.505 0.013 0
0.475 0.323 0.475 0.475 0.475 0.152 0.573 0.424 I
0.444 0.446 0.406 0.444 0.424 -0.001 0.497 0.406 II
0.210 0.165 0.241 0.210 0.224 0.047 0.521 0.209 III
? 0.000 ? ? ? ? ? ? IV
Weighted Avg. 0.403 0.337 ? 0.403 ? ? 0.531 0.369
=== Confusion Matrix ===
a b c d e<-- classified as
0 4 0 0 0 | a = 0
0 56 48 14 0 | b = I
0 43 56 27 0 | c = II
0 15 34 13 0 | d = III
0 0 0 0 0 | e = IV
=== Run information ===
Scheme: weka.classifiers.lazy.IBk -K 5 -W 0 -A "weka.core.neighboursearch.LinearNNSearch -A \"weka.core.EuclideanDistance -R first-last\""
Relation: MedicalData-weka.filters.unsupervised.attribute.Remove-R2
Instances: 310
Attributes: 18
Sex
DOT
Route
IV
Tx
Coc
Tatt
HBV
HIV
EtOH
Obes
Rx
Tox
CLD
LFT
Duration
Age
Bx
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
IB1 instance-based classifier
using 5 nearest neighbour(s) for classification
Time taken to build model: 0 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 128 41.2903 %
Incorrectly Classified Instances 182 58.7097 %
Kappa statistic 0.0456
Mean absolute error 0.2498
Root mean squared error 0.3905
Relative absolute error 95.6956 %
Root relative squared error 108.2989 %
Total Number of Instances 310
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.000 0.003 0.000 0.000 0.000 -0.007 0.510 0.013 0
0.542 0.479 0.410 0.542 0.467 0.061 0.550 0.438 I
0.492 0.435 0.437 0.492 0.463 0.056 0.538 0.445 II
0.032 0.036 0.182 0.032 0.055 -0.009 0.467 0.190 III
?...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here