I need your help with this assignmentThe following data is a sample from a loan history database of...

Question

I need your help with this assignmentThe following data is a sample from a loan history database of a Japanese bank Assignment 6   Mining Real Data     This assignment is about applying data mining techniques to real data. A data set from a real  medical domain created by physicians is provided. It contains medical records of 310 patients  contracted Hepatitis C. The objective is to analyze the data, do some preprocessing if necessary  and generate classification models that may be later applied to new patients. Here are the  major steps you have to follow in order to complete the assignment:  I  Do the following experiments:  1. Read  “Hepatitis Data Description” and inspect the data file hepatitis.arff. Try to  understand the meaning of the data (as much as you can without been a doctor) and  find any attribute relationships (e.g., between DOB and Age), that may be used later in  the experiments.  2. Use the Weka preprocessing tools to analyze the data and find any errors, omissions or  inconsistencies.  • Use Preprocess to check the statistical properties of the data.  • Use Visualize to analyze relationships and possibly find errors (e.g., see how  DOB and Age are correlated).  3. Use the Weka classifiers to create and evaluate data models. For evaluation use holdout  and ten-fold cross validation. Try at least the following classifiers and vary their  parameters to get the best accuracy:  • ZeroR to get some idea about the class distribution and the lower bound of  accuracy.  • OneR  • IBk  • Naive Bayes  • Decision Tree (J48)  • Prism  o needs discretization, or omission of numeric attributes  o doesn’t tolerate missing values (use ReplaceMissingValues filter).  4. You may get some insights by running the above classifiers with training set test option  (however don't use this for evaluating models).  For example, IBk=1 will produce an  accuracy less than 100%. When tested on the training data obviously the nearest  neighbor of every instance is the instance itself. How is then possible to have errors?  Use Visualize classifier errors to investigate the reasons for these errors.  5. Change the data and try the classifiers again, see the changes in accuracy. The objective  here is to find the model which maximizes accuracy evaluated by ten-fold cross  validation. Try:  • changing attributes by applying discretization, attribute selection, removing  dependent attributes, or other filters (e.g., replace missing values).   • changing instances: removing duplicates, or inconsistent instances (same  instances with different class values, see item 4).  • sampling: use instance filters to select a subset of data.  • transform the task into a two-class problem (e.g., use the filter  MergeManyValues to merge two values for the class attribute - set: ignoreClass  = True in the MergeManyValues parameters).  II  Write a report:   1. (20 p.) Comment on Problem 2.  2. (30 p.)  Comment on pieces of models or evaluation results to show trends and  improvements (Problem 3). DO NOT include classifiers’ outputs files. Use tables.  3. (20 p.) Describe at least one change in data (see  Problem 5) that improves the model in  terms of accuracy measured with ten-fold cross validation. Describe what you changed  and why, and how the accuracy improved.  4. (5 p.) Comment on the compactness and comprehensibility of the models.  5. (5 p.) Any other insights, thoughts, interesting results and suggestions how to create a  good model.   What to turn in:   - The report.   - A zip file containing the data sets that you have created and used.        Overview of the hepatitis data Overview of the hepatitis data1    Currently there are 5 identifiable forms of viral hepatitis namely A, B, C, D and E. All of these viruses are  hepatotrophic (i.e. liver is the primary site of infection). Approximately 4 million people in U.S. and 100  million people worldwide are infected with the HepatitisC (HepC) virus. Approximately 85% of persons  with acute HepC develop chronic hepatitis as determined by persistently abnormal serum enzymes and/or  viremia ( HepC virus (RNA)). Both the acute and the chronic illnesses are predominantly asymptomatic.  For this reason and because of chronic illness runs in extremely protracted course, it has been difficult to  accurately define the frequency and the rate of progression to symptomatic or end stage liver disease and  death.    The response for the current treatment for the Hepatitis C virus is only about 30-40%. And has many side  effects and is expensive. Liver biopsy is usually the most specific test to assess the nature and severity of  the liver disease. So it is proposed in the medical literature to treat stage III and IV liver disease on liver  biopsy. Liver has a rich vascular supply, therefore there are some complications associated with the liver  biopsies. Approximately 1-3% of patients require hospitalizations for complications after liver biopsy.  Complications include transient, localized discomfort at the biopsy site; pain requiring analgesia; and  mild transient hypotension. Approximately ¼ th of the patients have pain in the right upper quadrant or  right shoulder after liver biopsy. Although very rare clinically significant intraperitoneal hemorrhage, is  the most serious bleeding complication of liver biopsy.    Therefore in order to avoid complications associated with liver biopsy and to predict the severity of the  disease early so that the treatment (medical or surgical (i.e. Liver transplant) can be started early. In our  study we aim to predict the stage of the disease (I, II, III, IV) using data mining software WEKA thus  avoiding Liver Biopsy.    In medical literature different modes of the disease transmission has been documented. The most common  modes of transmission are IVDA (Intravenous Drug Abuse), usage of nasal Cocaine, Blood Transfusion  (Tx), Needle Stick (N) in occupation, for example accidental needle sticks in work place (Nurses, doctors,  emergency medical technicians and other health care professionals), presence of Tattoo marks, Sexual  transmission of the disease. It has been documented that some people do not have any of the above risk  factors and fall into the category of No Risk Factors (NRF). Co-infection with the Hepatitis B virus  (HBV) or with Human Immune Deficiency Virus (HIV) is also an important consideration.  Alcohol use (ETOH), Obesity, co-infection with HBV and HIV makes Liver Disease progression faster  and worse.  The current treatment options for the patients inflicted with the HepC virus is Interferon and  Ribavarin. There are 6 Genotypes (GT) of the Hepatitis virus: 1, 2, 3, 4, 5, and 6. About 70% of the  patients in US have subtype 1. Only 50% of the genotype1 respond the above treatment. In genotypes  other then 1 have about 70% response rate.     Liver Function test (LFT) is used as an indicator of the severity of the liver disease, it is represented as  negative if the test result is within the normal lab limits else if it is greater than 1.5 times the normal it is  recorded as positive. Duration, which is the number of years for which the patient had the disease is also  important in determining the future progression of the disease.                                                      1 The material presented here is copied from MINING MEDICAL DATA: Predicting the stage of  Hepatitis-C Using the WEKA 3.2 Data Mining System, capstone project of PADMA TATAVARTHY,  SHWETHA TIPPA, KARUNASRI SEELA, CIT, CCSU, Fall 2002.    Liver Biopsy is performed to stage the severity of the disease (I, II, II, IV) and also to determine if the  treatment is indicated. For example a 50 year old male acquired HepCV at the age of 20 by IVDA has  normal LFT and liver biopsy (Bx) shows stage I disease no treatment is indicated.    It is generally that if the Biopsy is in Stage I and II no treatment is indicated, else if it is in stage III or IV  the patient is given a regimen of Interferon and Ribavarin.    The table shown below gives the attributes, their values and descriptions.    No Attributes Type of Data Values Descriptions  1 Sex Categorical M, F Gender  2 DOB Numeric Date Date of Birth  3 DOT Numeric Date Date of transmission of the  disease  4 Route Categorical Coc, IV, Tx,  N, NRF,  Tatt, Sex  The route through which the  disease was transmitted.   5 IV Categorical +, - Intravenous   6 Tx Categorical +,- Blood Transfusion  7 Coc Categorical +, - Usage of Cocaine  8 Tatt Categorical +, -  Presence of Tatoo on the  body of the patient  9 HBV Categorical +, - Presence of Hepatitis B virus  in the patient.  10 HIV Categorical +, - Presence of HIV infection  11 EtOH Categorical +, - Alcohol usage by the patient.  12 Obes Categorical +, - Whether the patient is obese  or not.  13 Rx Categorical +,- Treatment, it is whether the  patient has been treated.  14 Tox Categorical +, - Presence of any toxic  elements.  15 CLD Categorical +, - Whether the patient has  Chronic Liver Disease.  16 GT Categorical -, I, II, III Genotype of the patient.  17 LFT Categorical +,- Whether or not the Liver  Function Test was done.  18 Duration Numeric  Number years for which the  patient had the disease. It is  basically the difference  between DOB and DOT.  19 Age Numeric  Current age of the patient.  20 Bx  Categorical I, II, III, IV Biopsy result, which  specifies the stage of the  HepC. 	Overview of the hepatitis data

Mohd · Accepted Answer

=== Run information ===
Scheme:       weka.classifiers.rules.ZeroR 
Relation:     MedicalData-weka.filters.unsupervised.attribute.Remove-R2
Instances:    310
Attributes:   18
              Sex
              DOT
              Route
              IV
              Tx
              Coc
              Tatt
              HBV
              HIV
              EtOH
              Obes
              Rx
              Tox
              CLD
              LFT
              Duration
              Age
              Bx
Test mode:    10-fold cross-validation
=== Classifier model (full training set) ===
ZeroR predicts class value: II
Time taken to build model: 0 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances         126               40.6452 %
Incorrectly Classified Instances       184               59.3548 %
Kappa statistic                          0     
Mean absolute error                      0.261 
Root mean squared error                  0.3605
Relative absolute error                100      %
Root relative squared error            100      %
Total Number of Instances              310     
=== Detailed Accuracy By Class ===
                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0.000    0.000    ?          0.000    ?          ?        0.196     0.013     0
                 0.000    0.000    ?          0.000    ?          ?        0.489     0.376     I
                 1.000    1.000    0.406      1.000    0.578      ?        0.484     0.399     II
                 0.000    0.000    ?          0.000    ?          ?        0.484     0.195     III
                 ?        0.000    ?          ?        ?          ?        ?         ?         IV
Weighted Avg.    0.406    0.406    ?

The following data is a sample from a loan history database of a Japanese bank Assignment 6 Mining Real Data This assignment is about applying data mining techniques to real data. A data set from a...

Answer To: The following data is a sample from a loan history database of a Japanese bank Assignment 6 Mining...

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment