Wisconsin Diagnostic Breast Cancer Data (WDBC)



Biostatistics – Midterm Exam (Part 2) (65 points)1) Wisconsin Diagnostic Breast Cancer Data (WDBC)Researchers who created these data:Dr. William H. Wolberg, General Surgery Dept., University ofWisconsin, Clinical Sciences Center, Madison, WI 53792wolberg@eagle.surgery.wisc.eduW. Nick Street, Computer Sciences Dept., University ofWisconsin, 1210 West Dayton St., Madison, WI 53706street@cs.wisc.edu 608-262-6619Olvi L. Mangasarian, Computer Sciences Dept., University ofWisconsin, 1210 West Dayton St., Madison, WI 53706olvi@cs.wisc.edu Medical literature citations:W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques to diagnose breast cancer fromfine-needle aspirates. Cancer Letters, 77 (1994) 163-171.W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Image analysis and machine learning applied to breast cancerdiagnosis and prognosis. Analytical and Quantitative Cytology and Histology, Vol. 17No. 2, pages 77-87, April 1995. W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computerized breast cancer diagnosis and prognosis from fineneedle aspirates. Archives of Surgery 1995;130:511-516.W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computer-derived nuclear features distinguish malignant frombenign breast cytology. Human Pathology, 26:792--796, 1995.See also: Data Description:Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.A sample image is shown above.Response: Diagnosis (M = malignant, B = benign)Ten real-valued features are computed for each cell nucleus:Radius = radius (mean of distances from center to points on the perimeter)Texture texture (standard deviation of gray-scale values)Perimeter = perimeter of the cell nucleusArea = area of the cell nucleusSmoothness = smoothness (local variation in radius lengths)Compactness = compactness (perimeter^2 / area - 1.0)Concavity = concavity (severity of concave portions of the contour)Concavepts = concave points (number of concave portions of the contour)Symmetry = symmetry (measure of symmetry of the cell nucleus)FracDim = fractal dimension ("coastline approximation" - 1)Several of the papers listed above contain detailed descriptions of how these features are computed. Questions and Tasks:a) Develop a logistic regression model for diagnosis in R. Use the transformation guidelines we went through in class. Include the plots that you used to choose appropriate terms for your model in your output. Examine case diagnostics, plots used assess the model adequacy (mmps) and an ROC curve for your “final” model. Summarize your model development process and your findings. You do not need to address OR interpretations for this model! (20 pts.) To obtain an ROC curve in R, you will need to install the package epicalc from CRAN and use the function lroc to plot the ROC and find the area under the curve (AUC). Use the R commands below:> roc.mymodel = lroc(mymodel) draws curve and saves results> roc.mymodel$auc gives the area under the curve (AUC)b) What does the ROC curve tell you about the predictive abilities of your model? (3 pts.)c) Fit your final model from R in JMP using the data file: BreastDiag.JMP. Save the fitted probabilities into your spreadsheet and cross-classify Most Likely Diagnosis (X) with actual Diagnosis (Y). What is apparent error rate (AER) when your model is used to classify the tumor as malignant or benign? (5 pts.)Note: In logistic regression, if classification is one goal of the analysis, we can use the following rule to perform the actual classification:If then classify as Y = 1.The cutoff probability is usually taken to be .50 for obvious reasons but other values can be used.The ROC curve is constructed looking at a sequence of value for described above. For each cutoff value, we can easily compute the sensitivity and specificity so the ROC curve and the area beneath it can be found.2) Right Heart Catheterization Study The effectiveness of right heart catheterization in the initial care of critically ill patients. SUPPORT Investigators. Connors AF, et al.Department of Medicine, Case Western Reserve University at Metro Health Medical Center, Cleveland, Ohio, USA.OBJECTIVE: To examine the association between the use of right heart catheterization (RHC) during the first 24 hours of care in the intensive care unit (ICU) and subsequent survival, length of stay, intensity of care, and cost of care. DESIGN: Prospective cohort study. SETTING: Five US teaching hospitals between 1989 and 1994. SUBJECTS: A total of 5735 critically ill adult patients receiving care in an ICU for 1 of 9 prespecified disease categories. MAIN OUTCOME MEASURES: Survival time, cost of care, intensity of care, and length of stay in the ICU and hospital, determined from the clinical record and from the National Death Index.Variable nameVariable Definition AgeAge SexSexRaceRace: white, black, otherEduYears of educationIncomeIncome (under 11k, 11-25k, 26-50k, > 50k)NinsclasMedical insurance status: No insurance, Medicare, Medicaid, Medicaid & Medicare, Private & Medicare, PrivateCat1Primary disease category: MOSF w/sepsis, MOSF w/malignancy, lung cancer, COPD, coma, colon cancer, cirrhosis, CHF, ARFCategories of admission diagnosis:?RespRespiratory Diagnosis (yes or no)CardCardiovascular Diagnosis (yes or no)NeuroNeurological Diagnosis (yes or no)GastrGastrointestinal Diagnosis (yes or no)RenalRenal Diagnosis (yes or no)MetaMetabolic Diagnosis (yes or no)HemaHematologic Diagnosis (yes or no)SepsSepsis Diagnosis (yes or no)TraumaTrauma Diagnosis (yes or no)OrthoOrthopedic Diagnosis (yes or no)Das2d3pcDASI ( Duke Activity Status Index) Dnr1DNR status on day1 (yes or no)CaCancer (3 levels = yes, no, or metastatic)Surv2md1Support model estimate of the prob. of surviving 2 monthsAps1APACHE scoreWtkilo1Weight Temp1TemperatureMeanbp1Mean blood pressureResp1Respiratory rateHrt1Heart ratePafi1PaO2/FIO2 ratioPaco21PaCo2Ph1PHWblc1WBCHema1HematocritSod1SodiumPot1PotassiumCrea1CreatinineBili1BilirubinAlb1Albumin Categories of comorbidities illness:These are all coded as: 0 = No, 1 = Yes?CardiohxAcute MI, Peripheral Vascular Disease, Severe Cardiovascular Symptoms (NYHA-Class III), Very Severe Cardiovascular Symptoms (NYHA-Class IV)Chfhx Congestive Heart FailureDementhxDementia, Stroke or Cerebral Infarct, Parkinson’s DiseasePsychhxPsychiatric History, Active Psychosis or Severe DepressionChrpulhxChronic Pulmonary Disease, Severe Pulmonary Disease, Very Severe Pulmonary DiseaseRenalhxChronic Renal Disease, Chronic Hemodialysis or Peritoneal DialysisLiverhxCirrhosis, Hepatic FailureGibledhxUpper GI Bleeding MalighxSolid Tumor, Metastatic Disease, Chronic Leukemia/Myeloma, Acute Leukemia, LymphomaImmunhxImmunosupperssion, Organ Transplant, HIV Positivity, Diabetes Mellitus Without End Organ Damage, Diabetes Mellitus With End Organ Damage, Connective Tissue DiseaseTranshx Transfer (> 24 Hours) from Another HospitalAmihxDefinite Myocardial InfarctionMore Important VariablesSwang1Right Heart Catheterization (RHC) (yes or no)DeathDeath at any time up to 180 Days (yes or no)Dth30Death at any time up to 30 Days (yes or no) RESPONSEThe researchers found that heart attack patients who had a right heart catheter (Swan-Ganz line) put in had a 24% higher risk of 30-day mortality than patients that did not have the procedure performed. Given that this procedure is generally used when doctors are in some sense perplexed about what course of treatment to follow, one could argue that this result is expected because patients where the course of treatment is not obvious may be more severely ill. In study such as this we can try to eliminate this potential confounding by using information about the physiological state of the patient at the time of treatment. All of the additional variables can be used to accomplish this goal. Build a logistic regression model for 30-day mortality using these data. Keep in mind quantification of the risk associated with right heart catheterization (swang1) is of primary interest so no matter what don’t take it out of the model. Carefully and thoroughly summarize the following:a) The model development process. You are going to have to take a more hands on approach here. Examine diagnostics, model adequacy plots, and the ROC curve for your “final” model. (25 pts.)b) Create a table giving the OR and associated CI for each term in your final model. For continuous predictors pick an increment (e.g. c = SD(x)) giving the OR and CI for that increment value. (20 pts.)b) The risk associated with the Swan-Ganz line procedure. Does the 24% increase in risk sound reasonable given your model? Use your OR and CI from part (b) to answer this question. (3 pts.)All the necessary data files are on the course website in the Data Sets list. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download