This READ_ME is designed to cover all aspects of our ...



Table of Contents to Introduction to ROC5 (08/26/08)

Table of Contents to Introduction to ROC5 (08/26/08) 1

1. Introduction to ROC5 (08/26/08) 2

1.1 Files included with download (at mirecc.Stanford.edu) 2

1.2 What is this Program Good for? 2

1.2.1 Producing a “Decision Tree” 3

1.2.2 Weighing the Importance of False Positives versus False Negatives 3

1.3 Who Owns this Program? 4

1.4 Where Does the Theory Behind the Program Come From? 4

2. Overview of Programming Strategy 4

3. Data Preparation 5

3.1 The Gold Standard versus Predictors 5

3.2 Details of Data Preparation 5

3.2.1 Missing data 5

3.2.2 ID Numbers 5

3.2.3 Note on Data Recoding 5

4. Running the ROC Program 5

4.1 How do you Run the Program? 5

4.1.1 Batch Files Basics 5

4.1.2 Batch Files Quirks 6

4.2 What does the ROC Output Mean? (and how to read it) 6

4.3 How to Change Emphasis on Sensitivity versus Specificity 7

4.4 How to Get Results for Plots, i.e. ROC Curves 7

4.4.1 How to Actually Get a ROC Plot out of the Data 7

4.4.2 How do I get SAS or Better yet What is SAS? 7

5. Run it Again Sam? More on Decision Trees 8

6. FAQ (Frequently Asked Questions) 9

Appendix 1: Note on Memory Allocation and Run Time 10

Appendix 2: Note on Data Recoding 12

IF 12

Even more important is the operator AND in Excel: 14

Even more important is the operator = in Excel: 14

Rotate.exe: 14

Appendix 3: Formulae 15

Appendix 4: Example SAS Program for Graphics 16

1. Introduction to ROC5 (08/26/08)4

This READ_ME is designed to cover all aspects of our program designed to perform a number of signal detection functions.

1.1 Files included with download (at mirecc.Stanford.edu)

Download the file ROC54.ZIP. It can be unzipped by programs such as WinZip. ROC54.ZIP contains the files:

ROC4.2118.exe The program old version(100 variables x 15000 cases max)

The current version as of 12/29/2003 is ROC4.18.exe.

Demo.txt Demo dataset

runDemoData.bat The batch file that does all the housekeeping and runs the program on the right dataset with the right settings.

outrunDemoData.doc Output from the run

ROC4yymmddhhhh.doc A word file of the actual c code for version ROC4. It is dated by yy year, mm month and day and sometimes hhhh hour. Change the .doc to .c for a C compiler.

READ_ME.yymmdd.doc An explanation of all this (what you are reading)

ROC5.xx.exe The new version for single processor

ROC5.xx.dual.exe The new version for dualsingle processor

ROC5.xx.quad.exe The new version for quadsingle processor

Source code for ROC5

1.2 What is this Program Good for?

This program is designed to help the average clinician/researcher with a PC to evaluate clinical databases and discover the characteristics of patients, including genetics. that best predict a binary outcome. That outcome may be any binary outcome such as:

❑ Whether or not the patient has a certain disorder (medical test evaluation)

❑ Whether or not the patient is likely to develop a certain disorder (risk factor evaluation)

❑ Whether or not the patient is likely to respond to a certain treatment (evaluation of treatment moderators)

When the predictors considered are themselves all binary (e.g., male/female; inpatient/outpatient; symptoms present/absent), the program identifies the optimal predictor. When one or more of the predictors are ordinal (e.g., age, severity of symptoms) it identifies the optimal cutpoint for each or the ordinal predictors, as well as the overall optimal predictor.

1.2.1 Producing a “Decision Tree”

The program can be applied to different subsets of the same dataset, thus producing a "decision tree", which combines various predictors with "and/or" rules to best predict the binary outcome. The “bottom line” of the output is one of these trees. This is a schematized example from a hypothetical study predicting conversion to Alzheimer’s Disease using age and the Mini-Mental State Exam (MMSE) as potential predictors:

In this example, subjects who are less than 75 years old have a 10% conversion rate. Those who are at least 75 AND have an MMSE score less than 27 have a 20% conversion rate. Finally, subjects older than 75 AND have an MMSE score of at least 27 have a 40% conversion rate. These cutpoints are significant at the p=.01 level.

1.2.2 Weighing the Importance of False Positives versus False Negatives

This program (a type of recursive partitioning) differs from other such programs in that the criterion for splitting is based on a CLINICAL judgment of the relative clinical or policy importance of false positive versus false negative identifications via weights called r. The program automatically considers three possibilities:

❑ Optimal Sensitivity: Here r=1, and the total emphasis is placed on avoiding false negatives. This would be appropriate, for example, for self-examination for breast or testicular lumps.

❑ Optimal Efficiency: Here r=1/2, and equal emphasis is placed on both types of errors. This would be appropriate, for example, for mammagraphy.

❑ Optimal Specificity: Here r=0, and total emphasis is placed on avoiding false positives. This would be appropriate, for example, for frozen tissue biopsy done during breast surgery to decide on whether or not a mastecomy were to be done.

When the user does not have reason to favor either false positives or false negatives, use of r=1/2 is advised.

It is also possible that a user might want to choose a weight of, say, 0.70 to put more emphasis on avoiding false negatives, but not total emphasis. The program has an option for the user to input the value of r (between 0 and 1) to obtain the optimal predictor for that cutpoint. How you do this is described below in Section 4.3: How to Change Emphasis on Sensitivity versus Specificity.

1.3 Who Owns this Program?

It is in the public domain. The work that went into this was mostly paid for by the Department of Veteran’s Affairs and the National Institute of Aging of the United States of America.

1.4 Where Does the Theory Behind the Program Come From?

From HC Kraemer, Evaluating Medical TestsSage Publications, Newbury Park, CA 1992. The formulae for the calculations are taken from page “X” from the book and are presented in Appendix 3.

2. Overview of Programming Strategy

The ROC4 program is designed to perform basic signal detection computations in a Windows environment. The program is written in C++ Microsoft version 6.0. Original “Mark 4” version was written circa October 2001. Likely it can be recompiled on other platforms that use C++ or C, such as Sun, SGI or other UNIX workstations, and maybe the Macintosh. For details on capacity of the program see Appendix 1, but basically it has been tested on datasets of 50 variables and 8000 cases on a Dell Inspiron 5000 laptop. ROC5 is the “industrial strength” version designed to perform similar work on very large datasets such as those from whole gene scans containing 600,000 variables and 200 subjects. It has been tested on A Dell Xeon Precision Workstation with 32GB of RAM and a 64-bit processor for the whole genome analyses. This size of workstation appears to be the physical and programming limitation, but luckily seems also to be within the size limitations of typical large-scale genetic analyses. .A “big” version designed for huge datasets and runs slower as it uses virtual memory, is available on request.

To get the full benefit from ROC4this program and 5 you MUST be able to use Excel a bit and if you want sophisticated graphics, SAS. It is a waste of time to recreate the editing and statistical capabilities of Excel and SAS, especially the latter for plotting ROC curves and the former for creating a clean dataset. Dealing with genetic datasets is a whole ‘nother level of complexity.

So, the basic idea is that however you prepare your data it should be able one way or another to get to Excel and be output as a text tab-delimited (separated) dataset. Then, after going through ROC4, you also get a text tab-delimited dataset, this time readable by SAS (SAS Institute Inc., Cary NC) for sophisticated plots and graphics, or, more likely, you just look at the results that come out of ROC.

So, the basic idea is:

|Dataprep (Excel) -> signal detection calculations (ROC) -> graphics (SAS) |

The ROC output is an ASCII file that can be read into MS Word or Excel.

3. Data Preparation

3.1 The Gold Standard versus Predictors

The ROC program reads in data via a text tab-delimited format. The last column is a set of 0’s and 1’s representing the “gold standard”. This is the criterion for “success”. The other columns are the “predictors”. This can all be arranged in an Excel file and then output to a tab-delimited .txt file.

3.2 Details of Data Preparation

3.2.1 Missing data

Represent missing data only with a –9999.99. If you have blanks, edit it in Word first and do a global replace of ^t^t (two tabs) with ^t-9999.99^t. Version 5 also takes the integer -9999 as missing.

3.2.2 ID Numbers

Remove any columns of data that will not be analyzed (e.g. ID numbers).

3.2.3 Note on Data Recoding

This should be done in Excel before submitting the data to ROC4. See Appendix 2 for information on recoding. A Demonstration dataset (Demo.txt) is also enclosed as part of the Zip package. A separate program is includedrequired if the dataset needs to be rotated: we want the columns to be the variables. If subjects are the column, the dataset must be rotated before it can be analyzed.

4. Running the ROC Program

4.1 How do you Run the Program?

4.1.1 Batch Files Basics

It is easiest to run the program out of a batch file (.bat), i.e. you tap on the icon. This is like a UNIX script and basically is a place that keeps all your files and commands straight. rDemoData.bat is a simple one liner edited as a text file in Word:

roc4 Demo.txt 50> runDemoData.doc

This tells ROC4 to use Demo.txt as the data file and output (the “>”) it to runDemoData.doc as a word (.doc) file. The “50” is explained in Section 4.3.

Note on Versions. As newer versions come out, you may have to modify the script to carry the right version number.

Version 5 Notes: Several additional command line arguments are included:

• (SENS/SPEC) 50 = equal trade off of sensitivity and specificity ... check README... default 50

• (PLOT) create dataset for plotting in SAS. Default is OFF, no plot data, turn on with PLOT

• (NO_PRINT) suppresses printing intermediate values.. lots of them.. default in ON suggest NO_PRINT if more than 10-15 variables

• (DE BUG) various debugging information... tons of it... default is OFF turn on with DE_BUG

• (P CRITERION VALUE) 05 01 or 001 default is 01 (no decimals in command line please)

EXAMPLE: roc5 pgen.txt 50 NO_PRINT > run_pgen.out.doc

4.1.2 Batch Files Quirks

Batch files seem a bit quirky in Windows. We have found that if you write one from scratch you may have to run it first in the DOS (Windows 98) or Command Prompt (Windows 2000, XP or Vista) window. After that it seems to run if you just tap on it. In Windows 2000 the prompt is found under Start: Programs: Accessories. If you have one that works, you can just edit it in Word. It is saved as a .bat file in text format.

Note Well: The batch file will not run if either the data.txt or the output.doc files are open…. So do not waste a lot of time debugging the .bat file if these are open.

How do I know it is running? Look in the folder to which the output file is directed. Under the View menu hit REFRESH, note the file size, hit REFRESH again. If it is larger, take heart that it is working and go get some coffee (or a good night’s sleep if you have a slow processor). To get a rough idea of how long it may take to run your ROC program please see Appendix 1.

4.2 What does the ROC Output Mean? (and how to read it)

The output should be readable in Word. It is designed to be read in and printed using the following format:

Font: Courier New size 7 or 6 with good eyes

Move margin all the way to the right.

Under File/Page Set-up and Paper Size Tab, set to Landscape

There are five segments to the output:

1) The output starts with some general description of the data you put in hopefully with the same n for each variable that you started with, i.e. this is here for a data check.

2) You then get a listing of the various signal detection results for each variable and for each value of each variable in your dataset.

3) Finally, you get a summary of the results for the five highest kappa (50) values for each variable. In general the best cut point to separate successes from failures will be the value of the variable with this highest kappa over all the variables.

4) The program will do a series of “iterations” basically taking the best cut point identified above and rerun the data that are above and below that cut point. This step is repeated until all cutpoints (up to three-way interactions) are identified. If you would like to identify interactions beyond three-way, see Section 5.

5) A summary of the results

6) The Decision tree (a simplified version was presented in Section 1.2.1)

4.3 How to Change Emphasis on Sensitivity versus Specificity

The program has an option for the user to input the value of r (between 0 and 1) to obtain the optimal predictor for that cutpoint. Why you might want to do this is described in Section 1.2.2 Weighing the Importance of False Positives versus False Negatives. Note how the script is changed to accomplish the change in emphasis:

roc4 Demo.txt 70> runDemoData.doc

This version of the script has a little 70 added – this will calculate a 70/30 split to kappa emphasizing sensitivity (70%) versus specificity (30%). Default is 50/50. You can use any proportion as long as it is a multiple of 10, e.g. 0, 10, 20, 30, 40, 50, 60, 70, 80, 90 and 100 are acceptable. Note that optimal sensitivity and specificity (100 and 0, respectively) are automatically calculated by the program, regardless of what is chosen here.

4.4 How to Get Results for Plots, i.e. ROC Curves

4.4.1 How to Actually Get a ROC Plot out of the Data

Several programs such as Excel or SAS can simply read in the ROC output.doc file. The output from ROC4 has some means at the top of the file and headers at the top of each variable, which needs to be stripped off (easily done in Excel and saved as a tab-delimited text file) before performing graphics. This is there as a data check that you have the right N and means. Note that in Windows 2000 if you right click on the output.doc file, you have the option of opening the file directly into Excel.

Although there are many programs that do graphics, programs such as Excel may only allow relatively simple plots. The SAS program supplied in Appendix 4 will read in the data and create classic ROC Plots, after a couple of lines in the supplied code are modified. See Appendix 3 for the SAS program and further details.

4.4.2 How do I get SAS or Better yet What is SAS?

SAS (Statistical Analysis System) is the most widely used professional statistical language in the world. It used to run only on mainframes and UNIX workstations but now runs on PCs. Most Universities have contacts to use it. Pharmaceutical company clinical trial data are submitted to the FDA in SAS datasets.

If you do not have SAS and still want a graph, maybe some other language will do it. The output can be read by a number of programs. I have failed to get Excel to do a nice graph of the results. Deltagraph, Statview or other programs perhaps might work? Whatever you try, the basic output from the ROC4 program is a tab-delimited text file, which any worthwhile statistical/graphics program should be able to input.

Finally, to facilitate SAS direct input, unnecessary lines of the output file are preceeded by a “**”. This allows these lines to be ignored by SAS.

5. Run it Again Sam? More on Decision Trees

One the first optimal cut is made, yielding the first positive and negative test groups, the program runs again on those with a positive first test, and those with a negative first test, to find the optimal cuts within those subpopulations. Thus if the optimal predictor at the first stage is gender, one might want to consider males and females separately in the next two runs of the program. Then, of course, when the second cuts are made (yielding 4 groups), one could run the program again on these four subpopulations. Doing this produces a "decision tree", a series of "and/or" rules that identify subgroups at different risks of the binary outcome.

Since the program will not run when the sample size is inadequate (10 or fewer in each of the marginal positions), this can only be done when the sample sizes are large enough. Moreover, a "stopping rule" should be set in advance. For example, one might stop when the optimal predictor produces a result that is not statistically significant at the 1% level (which is what this program does). Note: in the printout *, **, ***, mean p < .05, .01 and .001. This "significance" is the result of multiple testing and should not be regarded as a true significance level, but is useful as a stopping rule. This in fact is how the program works. It will result in a maximum of three levels of cuts and a final maximum of eight (2**3) subgroups. The reason it stops at this point is two-fold: 1) programmer exhaustion after having created 2,422 lines of C++ code and 2) we are in fact generating three-way interactions of predictors.

Note: In Vversion 4.21 by popular demand I added the option of making a l arger decision tree by over-riding the 0.01 criterion for continuing the tree. If you end the command line arguments with, 0.05, it will use 0.05 as criterion. Note well that this gives you a lot more output. Note also well that this will only work for 0.05, not 0.04 or other criteria. This is a monumental programming task to get all of the printing etc. out correctly if this one wanted multiple criteria. Here at least you have the choice of 0.05 or 0.01. Please note that this has not been fully debugged as of 4/1/06, thus, we consider 4.21 “Betaware” and 4.20 the definitive model. tThis modeificationmodification of to got to 4.21 required about 200 additional lines of code, and you never know, so watch output carefully. It seems to work in our hands. This modification is carried forward into Version 5.

If you would like a four-way (or more interaction), divide the initial dataset on the cutpoints identified by the three-way interaction and rerun the program. We have successfully identified up to a ten-way interaction using this method and program.

More details on all aspects of ROC including decision trees can be found in the mirecc.stanford.edu web site RealPresenter show:



If you do not have this capability, we can send the actual PowerPoint slides but this is a very large file.

6. FAQ (Frequently Asked Questions)

Q: How is this program different from CART, SPSS Answer Tree, etc:

A: This program is more concise as it uses kappa to minimize false positives and false negatives. We believe other programs too many ‘branches’ (and thus sometimes requires ‘pruning’ afterwards), and / or use the odds ratio, which we do not favor.

Q: In the SAS graphing program in Appendix 3 is there a specific reason why only the top 25 points (sorted by descending k0_50) are plotted:

A: We arbitrarily decided to plot the top 25 points because we felt plotting more would clutter the graph. Of course it is easy to change to program to plot more or less points if you desire.

Q: Does the ROC program use the empirical ROC curve rather than a fitted curve?

A: Yes, that is correct. The fitted curve assumptions are usually not true.

Q: I am curious how the weighting is done when you select differing values for the sensitivity/specificity emphasis. I have been taught that the bias parameter often represents the ratio of the ordinate of the S distribution over the ordinate of the SN distribution (I may have these backwards). Does ‘emphasis’ in the ROC program map onto that concept in some way?

A: No. The ROC program makes no distribution assumptions. It is non parametric

Appendix 1: Note on Memory Allocation and Run Time

Historical Information (circa 2001): The version 4 program is designed to make maximum use of the memory capacity of your machine. This is industrial-design programming, rather than convenience-designed programming. As written it uses about 12MB of memory to run a matrix of 25 variables for 35000 cases (largest test so far). In this situation, it may take the program over an hour to run data on a 650mHz Intel processor, but at least it runs in the background, though it takes up 90% of my processor. The maximum initialization is for 1000 variables and 50,000 cases. It does not initialize to a larger number because a larger initial data matrix may crash on a machine with small memory capacity or at least make the program use virtual memory (disk instead of RAM). If you have a large dataset, a big machine and want to edit the program, just change the values for NCOL and NROW to the number of variables and cases you need, then recompile, or ask me and I can do it. The standard program, however, allocates memory only as needed and will run much faster on small datasets, such those with as 10 variables and 500 cases (a few seconds). Slowness appears to increase exponentially with N because computations involve not only each case, but every other case too, etc. etc...

To see how all this memory allocation is done, look for the function malloc() in the program. Clever programmers with time on their hands could do this more efficiently and free memory once it is no longer used. The way it is written, you use about 6 bytes/case times the number of variables. Stingier allocation of memory would get it down from 5MB. FYI, on my machine Windows 2000 toys seem to use about 40MB, Microsoft Word 20 MB and Outlook 15MB. Bottom line, forget running this on a machine with 64MB main storage (it will have to use slow virtual or paging memory). I may compile a larger capacity version and leave it on the web site – for those who want to deal with US census data and have the machine to complete the task.

How long might the program take to run?

We have successfully run the ROC program with about 20 variables (half of which were binary) and 35000 observations. This took 2 hours on a Dell M60 laptop with Windows XP, 512MB memory and 1.7 Gz P4m processor, BUT, only by using the new NO_PRINT option. Adding this to the command line will supress the initial intemediate values. The sorting of these intermediate values could take literally DAYS in large datasets. The suggestion is, if it runs slowly on a large dataset, use this option.

Recent Information (circa late 2007-8): Version 5 was designed to read large dataset such as those from whole genome scan, i.e. 600,000 variables. This is not easy, but you can also use Version 5 on smaller datasets and it will run much faster due to many improvements in the programming. A dataset of 600,000 SNPs on 210 patients takes less than a day on a Dell Xenon Precision Workstation.

To run Version 5 with genetic data you need a big computer like that Dell Workstation: 64bit processor, 6 if not 8 or more GB of RAM and a fast processor does not hurt. We have precompiled a couple versions of the program but the code is different depending on your processor, so the precompiled code may not work on your machine.. ask us for help or get somebody to compile itproduced an installer due to the multiple processors running under windows. This version was compiled using Microsoft Visual Studio StandardPro 20085 for 32bit versions and the Beta 2008 version of this program for 64bit versions. It has also been compiled and run on a Linux 32bit workstation running Fedora 6.

The big computer is necessary because you simply cannot do all the computations this program does unless all data are in memory (RAM). Furthermore, a 32bit processor is limited to 2GBs for the program. 64bit programs can be bigger. An unpleasant nuance to this is that with the 64bit processor you may think you have a lot of RAM if you have 6GB, but it turns out that floating point numbers and even integers require 4 bytes to represent one number (32bit computing just takes 1). So, the 6GB 64bit processor actually gives you fewer floating point numbers than a 32bit processor with 2GB! Buyer beware. We are in the process of obtainingnow have a 64bit Dell with 32GB of RAM… and we will need it all.

Thanks to the folks at Dell and Microsoft who helped me on this project, especially Scot Brennecke of Microsoft who got the Version 4 program to compile without errors the first time on a 64bit processor.

Appendix 2: Note on Data Recoding

I think it is best to do this in Excel for two reasons:

By the time you indicate how you should recode data in command line arguments, they get too big and confusing and prone to error. Same thing for gold standard computations. Better to get the dataset fixed up nice in Excel before running the program. You can just delete any offending columns and you can compute any gold standard you want using conditional statements i.e. ifs. Here is an example. Say you want to consider an infant feeding program success if children are over 1800g birthweight, well just put:

=IF(C2C3,"Over Budget","OK") equals "OK"

Suppose you want to assign letter grades to numbers referenced by the name AverageScore. See the following table.

|If AverageScore is |Then return |

|Greater than 89 |A |

|From 80 to 89 |B |

|From 70 to 79 |C |

|From 60 to 69 |D |

|Less than 60 |F |

You can use the following nested IF function:

IF(AverageScore>89,"A",IF(AverageScore>79,"B",

IF(AverageScore>69,"C",IF(AverageScore>59,"D","F"))))

In the preceding example, the second IF statement is also the value_if_false argument to the first IF statement. Similarly, the third IF statement is the value_if_false argument to the second IF statement. For example, if the first logical_test (Average>89) is TRUE, "A" is returned. If the first logical_test is FALSE, the second IF statement is evaluated, and so on.

Even more important is the operator AND in Excel:

IF(AND(1 ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download