1



Applications of Machine Learning and High Dimensional Visualization in Cancer Diagnosis and Detection

John F. McCarthy*, Kenneth A. Marx, Patrick Hoffman, Alex Gee, Philip O’Neil, M.L. Ujwal, and John Hotchkiss

AnVil, Inc.

25 Corporate Drive

Burlington, MA 01803

*corresponding author

jmccarthy@;

(781) 828-4230

Abstract

{John M will provide}

Introduction

{John M. will provide}

0. Data Analysis by Machine Learning

Overview of Machine Learning and Visualization

Machine learning is the application of statistical techniques to derive general knowledge from specific data sets by searching through possible hypotheses exemplified in the data. The goal is typically to build predictive or descriptive models that distinguish the useful attributes of a dataset to allow the use of those attributes to draw conclusions from other similar datasets [0.1]. In cancer diagnosis and detection, machine learning helps identify significant factors in high dimensional data sets of genomic, proteomic, or clinical data that can be used to understand the disease state in patients. Machine learning techniques serve as tools for finding the needle in the haystack of possible hypotheses formulated by studying the correlation of protein or genomic expression with the presence or absence of disease.

In the process of analyzing the efficacy and correctness of the generalization of concepts from a data set, high dimensional visualizations give the researcher time-saving tools for analyzing the significance, biases, and strength of a possible hypothesis. In dealing with a potentially discontinuous high dimensional concept space, the researcher’s intuition benefits greatly from a visual validation of statistical correctness of the result. The visualizations can also reveal sensitivity to variance, non-obvious correlations, and unusual higher order effects that are scientifically important, but would require time consuming mathematical analysis to discover without a mechanism to picture the application of the discovered hypothesis to the dataset.

Cancer diagnosis and detection involves a group of techniques in machine learning called classification techniques. Classification can be done by supervised learning, where the classes of the objects are already known and are used to train the system to learn the attributes that most effectively disambiguate and describe the members of the classes. For example, given a set of gene expression data for samples with known diseases, a supervised learning algorithm might learn to classify disease states based on patterns of gene expression. In unsupervised learning, there either are no predetermined classes, or the class assignments are ignored and data objects are grouped together by cluster analysis based on some relationship between the objects. In both supervised and unsupervised classification (also know simply as clustering) an explicit or implicit model is created from the data to help to predict future data instances or understand the physical process behind the data. Creating these models can be a very compute intensive task, such as training a neural network, and these models are prone to the flaw of generalizing knowledge from a particular data set to “overfit” the model to the data, making it less generally valid when applied to other data sets of the same type. Feature selection and reduction techniques help with both compute time and overfitting problems, by reducing the data attributes used in creating a data model to those that are most important in characterizing the hypothesis. This process can reduce analysis time and create simpler and (sometimes) more accurate models.

In the three cancer examples presented, both supervised and unsupervised classification and feature reduction are used and will be described. In addition we will discuss the use of high dimensional visualization in conjunction with these analytical techniques. One particular visualization, RadViz™, incorporates machine learning techniques in an intuitive, interactive display. Two other high dimensional other visualizations, Parallel Coordinates and PatchGrid (similar to HeatMaps) are also used to analyze and display results.

Below we summarize the classification, feature reduction, validation, and visualization techniques we use in the examples, with a particular emphasis on explaining the techniques of RadViz and Principle Uncorrelated Record Selection (PURS), that have been developed by the authors.

Machine Learning Techniques:

Classification techniques vary from the statistically simple, testing dimensions for statistical significance in their contribution to the classification for example, to sophisticated probabilistic modeling. The supervised learning techniques used in the examples include Naïve Bayes, Support Vector Machines, Instance-based learning or K-nearest neighbor, Logistic regression, and Neural Networks. Much of the work in the following examples is supervised learning, but it also includes some unsupervised hierarchical clustering using Pearson correlations. There are many texts giving detailed descriptions of the implementations and use of these techniques, the authors particularly like [0.1].

Feature Reduction Techniques:

There are a number of statistical approaches to feature reduction that are quite useful. These include the application of pairwise tT-statistics and using F-statistics from the class labels to select the the most important dimensions.

A more sophisticated approach is one we call Principle Uncorelated Record Selection, or PURS. PURS involves selecting some initial seed attributes (or dimensions), based on a high t or f statistic, for example. We then repeatedly delete attributes that correlate highly to seed attributes. If an attribute does not correlate highly to one of the seed attribute set, it is added to the seed set. We repeat this process reducing the correlation threshold until the seed dimensions are reduced to an optimal or desired number.

We also use random feature selection to train and test classifiers. This technique is used to validate the predictive power of a more carefully selected feature set culled from a sparse data set.

Validation Techniques:

Perhaps the most significant challenge in the application of machine learning to biological data is the problem of validation, or the task of determining the expected error rate from a classifier when applied to a new dataset. The data used to create a model cannot be used to predict the performance of that model on other datasets. The attributes, or dimensions, selected as important for classification must be tested against a set of data that was not used in any way in the creation of the classifier. The easy solution to this is to divide the data into a training set and a test set, or even a training set, a tuning set to tune the classifer, and a test set to determine the error rate. The problem of course, is that biological data is usually expensive to aquire, and sufficiently large sets to allow this subdivision and still have the statistical power to generalize knowledge from the training set are hard to find. So, as well as the training set/test set approach we use a common machine learning technique called 10-fold cross-validation. This approach divides the data into 10 groups, creates the model using 9 of the groups, and tests it on the remaining group. We then repeat this using each group as the test group once. The ten error estimates are then averaged to give an overall sense of the predictive power of classification technique on that data set.

Another technique we use to help predict performance in limited data sets is an extension of the 10-fold validation idea called leave-one-out validation. In this technique one data point is left out of each iteration of model creation, and is used to test the model. This is repeated with every data point in the set being used once as the test data. This approach is nicely deterministic, as compared to 10-fold cross validation which requires the careful random stratification of the ten groups, but does not give as useful a characterization of the accuracy of the model for some distributions of classes within the datasets.

High Dimensional Visualization

Although there are a number of conventional visualizations that can help in the understanding of the correlation of a small number of dimensions to an attribute, high dimensional visualizations have been difficult to understand and use because of the potential loss of information when projecting high-dimensional data down to two or three-dimensional representations.

There are numerous visualizations and a good number of valuable taxonomies of visual techniques [0.2]. As a group we make use of many of the techniques in the analysis of biological data, especially : matrices of scatterplots [0.3]; Heat maps [0.3]; parallel coordinates [0.4]; RadViz™ [1.13]; and Principal component analysis [0.5].Of these, we find RadViz uniquely capable of dealing with ultra–high-dimensional (>10,000 dimensions) datasets, and very useful when used interactively in conjunction with specific machine learning and statistical techniques to explore critical attributes for classification.

RadViz™ is a visualization and classification and clustering tool that uses a spring analogy for placement of data points and incorporates machine learning feature reduction techniques as selectable algorithms [0.6, 0.7, 0.8]. The “force” that any feature exerts on a sample point is determined by Hooke’s law: [pic]. The spring constant, k, ranging from 0.0 to1.0 is the value of the feature(feature (scaled) for that sample, and d is the distance between the sample point and the perimeter point on the RadViz™ circle assigned to that feature-see Figure 0.1. The placement of a sample point, as described in Figure 0.1A is determined by the point where the total force determined vectorially from all features is 0. The RadViz display combines the n data dimensions into a single point for the purpose of clustering, but it also integrates analytic embedded algorithms in order to intelligently select and radially arrange the dimensional axes. This arrangement is performed through Autolayout, a set of algorithmic features based upon the dimensions’ significance statistics that optimizes clustering by optimizing the distance separating clusters of points. The default arrangement is to have all features equally spaced around the perimeter of the circle, but the feature reduction and class discrimination algorithms arrange the features unevenly in order to increase the separation of different classes of sample points. The feature reduction technique used in all figures in the present work is based on the t statistic with Bonferroni correction for multiple tests. The circle is divided into n equal sectors or “pie slices,” one for each class. Features assigned to each class are spaced evenly within the sector for that class, counterclockwise in order of significance (as determined by the t statistic, comparing samples in the class with all other samples). As an example (see Figure 1.2), for a 3 class problem, features are assigned to class 1 based on the sample’s t-statistic, comparing class 1 samples with class 2 and 3 samples combined. Class 2 features are assigned based on the t-statistic comparing class 2 values with class 1 and 3 combined values, and Class 3 features are assigned based on the t-statistic comparing class 3 values with class 1 and class 2 combined. Occasionally, when large portions of the perimeter of the circle have no features assigned to them, the data points would all cluster on one side of the circle, pulled by the unbalanced force of the features present in other sectors. In this case, a variation of the spring force calculation is used, where the features present are effectively divided into qualitatively different forces comprised of high and low k value classes. This is done via requiring k to range from –1.0 to 1.0. The net effect is to make some of the features pull (high or +k values) and others ‘push’ (low or –k values) the points to spread them absolutely into the display space, but maintaining the relative point separations. It should be stated that one can simply do feature reduction by choosing the top features by t-statistic significance and then apply those features to a standard classification algorithm. The t-statistic significance is a standard method for feature reduction in machine learning approaches, independently of RadViz. RadViz has this machine learning feature embedded in it and is responsible for the selections carried out here. The advantage of RadViz is that one immediately sees a “visual” clustering of the results of the t-statistic selection. Generally, the amount of visual class separation correlates to the accuracy of any classifier built from the reduced features. The additional advantage to this visualization is that sub clusters, outliers and misclassified points can quickly be seen in the graphical layout. One of the standard techniques to visualize clusters or class labels is to perform a Principle Component Analysis and show the points in a 2d or 3d scatter plot using the first few Principle Components as axes. Often this display shows clear class separation, but the most important features contributing to the PCA are not easily seen. RadViz is a “visual” classifier that can help one understand important features and how many features are related.

The RadViz Layout:

An example of the RadViz layout is illustrated in Figure 0.1A. There are 16 variables or dimensions associated with the 1 point plotted. Sixteen imaginary springs are anchored to the points on the circumference and attached to one data point. The data point is plotted where the sum of the forces are zero according to Hooke’s law (F = Kx): where the force is proportional to the distance x to the anchor point. The value K for each spring is the value of the variable for the data point. In this example the spring constants (or dimensional values) are higher for the darker lighter springs and lower for the lighterdarker springs. Normally, many points are plotted without showing the spring lines. Generally, the dimensions (variables) are normalized to have values between 0 and 1 so that all dimensions have “equal” weights. This spring paradigm layout has some interesting features.

For example if all dimensions have the same normalized value the data point will lie exactly in the center of the circle. If the point is a unit vector then that point will lie exactly at the fixed point on the edge of the circle (where the spring for that dimension is fixed). Many points can map to the same position. This represents a non-linear transformation of the data which preserves certain symmetries and which produces an intuitive display. Some features of this visualization include:

it is intuitive, higher dimension values “pull” the data points closer to the dimension on the circumference

points with approximately equal dimension values will lie close to the center

points with similar values whose dimensions are opposite each other on the circle will lie near the center

points which have one or two dimension values greater than the others lie closer to those dimensions

the relative locations of the of the dimension anchor points can drastically affect the layout (the idea behind the “Class discrimination layout” algorithm)

an n-dimensional line gets mapped to a line (or a single point) in RadViz

Convex sets in n-space map into convex sets in RadViz

Applications|

1. Data Mining the Public Domain NCI-60 Cancer Cell Line Compound GI50 Data Set

Introduction to the Cheminformatics Problem.

Important objectives in the overall process of molecular design for drug discovery are: 1) the ability to represent and identify important structural features of any small molecule, and 2) to select useful molecular structures for further study, usually using linear QSAR models and based upon simple partitioning of the structures in n-dimensional space. To date, partitioning using non-linear QSAR models has not been widespread, but the complexity and high-dimensionality of the typical data set requires them. The machine learning and visualization techniques that we describe and utilize here represent an ideal set of methodologies with which to approach representing structural features of small molecules, followed by selecting molecules via constructing and applying non-linear QSAR models. QSAR models might typically use calculated chemical descriptors of compounds along with computed or experimentally determined compound physical properties and interaction parameters (ΔG, Ka, kf, kr, LD50, GI50, etc) with other large molecules or whole cells. Thermodynamic and kinetic parameters are usually generated in silico (ΔG) or via high throughput screening of compound libraries against appropriate receptors or important signaling pathway macromolecules (Ka, kf, kr), whereas the LD50 or GI50 values are typically generated using whole cells that are suitable for the disease model being investigated. When the data has been generated, then the application of machine learning can take place. We provide a sample illustration of this process below.

The National Cancer Institute’s Developmental Therapeutics Program maintains a compound data set (>700,000 compounds) that is currently being systematically tested for cytotoxicity (generating 50% growth inhibition, GI50, values) against a panel of 60 cancer cell lines representing 9 tissue types. Therefore, this dataset contains a wealth of valuable information concerning potential cancer drug pharmacophores. In a data mining study of the 8 largest public domain chemical structure databases, it was observed that the NCI compound data set contained by far the largest number of unique compounds of all the databases (1.1). The application of sophisticated machine learning techniques to this unique NCI compound dataset represents an important open problem that motivated the investigation we present in this report. Previously, this data set has been mined by supervised learning techniques such as cluster correlation, principle component analysis and various neural networks, as well as statistical techniques (1.2, 1.3). These approaches have identified distinct subsets within of a variety of different classes of chemical compounds (1.4, 1.5, 1.6, 1.7). More recently, gene expression analysis has been added to the data mining activity of the NCI compound data set (1.8) to predict chemosensitivity, using the GI50 test data for each compound, for a few hundred compound subset of the NCI data set (1.9). After we completed our initial data mining analysis using the GI50 values (1.10), gene expression data on the 60 cancer cell lines was combined with NCI compound GI50 data and also with a 27,000 chemical feature database computed for the NCI compounds. . {Using what method or software??}(Molconz features, and Perl programs for organizing data)

In this study, we use microarray based gene expression data to first establish a number of ‘functional’ classes of the 60 cancer cell lines via a hierarchical clustering technique. These functional classes are then used to supervise a 3-Class learning problem, using a small but complete subset of 1400 of the NCI compounds’ GI50 values as the input to a the class layoutustering algorithm in the RadViz™ program (??).

Specific Methods Used.

For the ~ 4% missing values found in the 1400 compound data set, we tried and compared two approaches to missing value replacement: 1) record average replacement; 2) multiple imputation using Schafer’s NORM software (1.11). Since applying either missing value replacement method to our data had little impact on the final results of our analysis, we chose the record average replacement method for all subsequent analysis.

Clustering of cell lines was done with R-Project software using the hierarchical clustering algorithm with “average” linkage method specified and a dissimilarity matrix computed as [1 – the Pearson correlations] of the gene expression data. The AnVil Corporation’s RadViz™ techniquesoftware (??) was used for feature reduction and initial classification of the cell lines based on the compound GI50 data. The selected features were validated using several classifiers as implemented in the Weka (Waikato Environment for Knowledge Analysis, University of Waikato, New Zealand) software application program . The classifiers used were IB1 (nearest neighbor), IB3 (3 nearest neighbor), logistic regression, Naïve Bayes Classifier, support vector machine, and neural network with back propagation. Both ChemOffice 6.0 (CambridgeSoft Corp.) and the NCI website were used to identify compound structures via their NSC numbers. Substructure searching to identify quinone compounds in the larger data set was carried out using ChemFinder (CambridgeSoft).

Results and Discussion

Identifying functional cancer cell line classes using gene expression data.

Based upon gene expression data, we identified cancer cell line classes that we could use in a subsequent supervised learning approach. In Figure 1.1, we present a hierarchical clustering dendrogram using the [1-Pearson] distances calculated from the T-Matrix (term used by the original authors ) {?? Not sure what this is. Are you referring to the t-test statistic in matrix form?} , comprised of 1376 gene expression values determined for the 60 NCI cancer cell lines (1.12). There are five well defined clusters observed In this figure. Clusters 2-5 respectively, represent pure renal, leukemia, ovarian and colonrectal cancer cell lines. Only in Cluster 1, the melanoma class instance, does the class contain two members of another clinical tumor type; the two breast cancer cell lines - MDA-MB-435 and MDA-N. The two breast cancer cell lines behave functionally as melanoma cells and seem to be related to melanoma cell lines via a shared neuroendocrine origin (1.12). The remaining cell lines in this dendrogram, those not found in any of the five functional classes, are defined as being in a sixth class; the non- melanoma, leukemia, renal, ovarian, colorectal class. In the supervised learning analysis that follow, we treat these six computational derived functional clusters as ground truth.

3-Class Cancer Cell Classifications and Validation of Selected Compounds.

High class number classification problems are difficult to implement in cases where the data are not clearly separable into distinct classes. Thus, we could not successfully carry out a 6-class classification of cancer cell lines based upon the starting GI50 compound data. Alternatively, we implemented a 3-Class supervised learning classification using RadViz™ (1.13, 1.14, 1.15). Starting with the small 1400 compounds’ GI50 data set that contained no missing values for all 60 cell lines, we selected those compounds that were effective in carrying out a 3-way class discrimination at the p < .01 (Bonferroni corrected t statistic) significance level. A RadViz visual classifier for the melanoma, leukemia, and non-melanoma/non-leukemia classes is shown in Figure 1.22.1. A clear and accurate class separations of the 60 cancer cell lines can be seen. There were 14 compounds selected as being most effective against melanoma cells and 30 compounds selected as being most effective against leukemia cells. Similar classification results were obtained for the two separate 2-Class problems: melanoma vs. non-melanoma and leukemia vs. non-leukemia. For all other possible 2-Class problems, we found that few to no compounds could be selected at the significance level we had previously set.

In order to validate our list of computationally selected compounds , we applied six additional analytical classification techniques, as previously described, , to the original GI50 data set using the same set of chemical predictors and a hold-one-out cross-validation strategy. Using these selected compounds resulted in a greater than 6-fold lowered level of error compared to using the equivalent numbers of randomly selected compounds, thus validating our selection methodology.

Quinone Compound Subtypes

Upon examining the chemical identity of the compounds selected as most effective against melanoma and leukemia, an interesting observation was made. , For the 14 compounds selected as most effective against melanoma, 11 were p-quinones and all have an internal ring quinone structure as shown in Figure 1.3A. Alternatively, there were 30 compounds selected as most effective against leukemia, of which 8 contain p-quinones. In contrast to the internal ring quinones in the melanoma class however, 6 out of the 8 leukemia p-quinones were external ring quinones as shown in Figure 1.3B. In order to ascertain the uniqueness of the two quinone subsets we first determined the extent of occurrence of p-quinones of all types in our starting data set, via substructure searching using the ChemFinder 6.0 software. The internal and external quinone subtypes represent a significant fraction, 25 % (10/41) of all the internal quinones and 40 % (6/15) of all the external quinones in the entire data set (1.10).

Conclusion.

With this cheminformatics example we have demonstrated that the machine learning approach described above utilizing RadViz™ has produced two novel discoveries . First, a small group of chemical compounds, enriched in quinones, were found to effectively discriminate among melanoma, leukemia, and non-melanoma/non-luekemia cell lines on the basis of experimentally measured GI50 values. Secondly, two quinone subtypes were identified that possess clearly different and specific toxicity to the leukemia and melanoma cancer cell types. We believe that this example illustrates the potential of sophisticated machine learning approaches to uncovering new and valuable relationships in complex high dimensional chemical compound data sets.

2. Distinguishing lung tumor types using microarray gene expression data

Introduction to the high-throughput gene expression problem

Completion of the Human Genome Project has made possible the study of the gene expression levels of over 30,000 genes [2.1, 2.2]. Major technological advances have made possible the use of DNA microarrays to speed up this analysis. Even though the first microarray experiment was only published in 1995, by October 2002 a PubMed query of microarray literature yielded more than 2300 hits, indicating explosive growth in the use of this powerful technique [2.3]. DNA microarrays take advantage of the convergence of a number of technologies and developments including: robotics and miniaturization of features to the micron scale (currently 20-200 um surface feature sizes for spotting/printing and immobilizing sequences for hybridization experiments), DNA amplification by PCR, automated and efficient oligonucleotide synthesis and labeling chemistries, and sophisticated bioinformatics approaches.

An important application of microarray technology is the identification and differentiation of tissue types using differential gene expressions, either between normal and cancerous cells or among tumor subclasses. The specific aim of this project was to explore the potential for using machine learning and high dimensional visualization in building abuilding a classifier which could differentiate normal lung tissue from the various subclasses of non-small cell lung cancer, using microarray based differential expression patterns. We have previously reported on using such techniques to successfully construct classifiers which can solve the more general two-class problem of differentiating non-small cell lung cancer from normal tissue with accuracies greater than 95%. However, the analysis of the three-class problem of distinguishing normal lung tissue from the two subclasses of non-small cell lung carcinoma (adenocarcinomas and squamous cell carcinoma) was not directly addressed. Our ultimate aim was the creation of gene sets with small number of genes that might serve as the basis for developing a clinically useful diagnostic tool.

In collaboration with the NCI, we examined two data sets of patients with and without various lung cancers. The first data set was provided directly by the NCI and included 75 patient samples [2.4]. This set contained 17 normal samples, 30 adenocarcinomas (6 doubles), and 28 squamous cell carcinomas (2 doubles). Doubles represent replicate samples prepared at different times, using different equipment, but derived from the same tissue sample. A second patient set of 157 samples was obtained from a publically available data repository [2.5]. This set included 17 normal samples, 139 adenocarcinomas (127 of these with supporting information) and 21 squamous cell carcinomas. Both data sets included gene expression data from tissue samples using Affymetrix’s Human Genome U95 Set [2.6]; only the first of five oligonucleotide based GeneChip® arrays (Chip A) was used in this experiment. Chip A of the HG U95 array set contains roughly 12,000 full-length genes and a number of controls. Because we were dealing with two data sets both from different sources and microarray measurements taken at multiple times we needed to consider a normalization procedure. For this particular analysis we kept with a simple mean of 200 for each sample. This resulted in a set of 9918 expressed genes of which approximately 2000 were found to be statistically significant (p 1025), you begin to see that the chance of finding a set of seven “good” features is small. At a minimum, features should show a statistically significant difference between the two classes. Of the seven features given on the website, two are not even marginally significant before correcting for multiple tests. These contribute mostly noise to the classifier. Two or three more features would fail our strict p < .001 standard after a Bonferroni correction. This is an arbitrary standard, but since it still leaves more than 400 “good” features there is no reason to relax it.

Figure 3.4 shows parallel coordinates displays of the two feature sets. The display on the left is the data for the seven features given on the website. The display on the right is the data for the six features we selected. Five of the seven features on the left in Figure 3.4 have very low intensities. We eliminated these in the first step of feature reduction because they fail to reach the 17.5 threshold.

Conclusions.

It is clear that there are significant differences in proteins in serum between ovarian cancer patients and controls, and that mass spectroscopy is potentially a useful diagnostic tool. Because of differences in machines and instrumentation, the applicability of our models to a new data set is an open question. However, by applying intelligent feature reduction to mass spectroscopy data using high dimensional visualization prior to classification, the development of clinically accurate and useful diagnostic models using proteomic data should be possible.

Conclusions

{John M will provide}

Acknowledgements

AnVil and the authors gratefully acknowledges support from two SBIR Phase I grants R43 CA94429-01 and R43 CA096179-01 from the National Cancer Institute. Also, support is acknowledged from ………..X Y Z {John M needs to complete}

References

Section 0:

0.1 I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. San Francisco: Morgan Kaufmann, 2000.

0.2 B. Shneiderman, “The Eyes Have It: A Task by Data Type Taxonomy of Information Visualization,” presented at IEEE Symposium on Visual Languages '96, Boulder, CO, 1996.

0.3 J. W. Tukey, Exploratory Data Analysis. Reading, MA: Addison-Wesley, MA, 1977.

0.4 A. Inselberg, “The Plane with Parallel Coordinates,” Special Issue on Computational Geometry: The Visual Computer, vol. 1, pp. 69-91, 1985.

0.5 H. Hotelling, “Analysis of a Complex of Statistical Variables into Principal Components,” Journal of Educational Psychology, vol. 24, pp. 417-441, 498-520, 1933.

0.6 Ramaswamy, S., Ross, K.N., Lander, E.S. and Golub, T.R. A molecular signature of metastasis in primary solid tumors. Science, 22, 1-5.

0.7 Chaussabel., D. and Sher, A. Mining microarray expression data by literature profiling. Genomebiology, 3, 1-16

0.8 Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (Eds.) Advances in knowledge discovery and data mining, AAAI/MIT Press, 1996.

Section 1

1.1 Voigt, K. and Bruggeman, R. (1995)

Toxicology Databases in the Metadatabank of Online Databases

Toxicology, 100, 225-240

1.2 Weinstein, J.N.,et.al., (1997,) An information-intensive approach to the molecular pharmacology of cancer, Science, 275, 343-349.

1.3 Shi, L.M., Fan, Y.,Lee, J.K., Waltham, M., Andrews, D.T., Scherf,U., Paul, K.D., and Weinstein, J.N. (2000)

J. Chem. Inf. Comput. Sci., 40, 367-379.

1.4 Bai, R.L., Paul, K.D., Herald, C.L., Malspeis, L., Pettit, G.R., and Hamel, E. (1991) Halichondrin B and homahalichondrin B, marine natural products binding in the vinca domain of tubulin-based mechanism of action by analysis of fifferential cytotoxicity data

J. Biol. Chem., 266, 15882 – 15889.

1.5 Cleveland, E.S., Monks, A., Vaigro-Wolff, A., Zaharevitz, D.W., Paul, K., Ardalan, K.,Cooney, D.A., and Ford, H. Jr. (1995)

Site of action of two novel pyramidine biosynthesis inhibitors accurately predicted by COMPARE program

Biochem. Pharmacol., 49, 947-954.

1.6 Gupta, M., Abdel-Megeed M., Hoki, Y, Kohlhagen, G., Paul, K., and Pommier, Y. (1995) Eukaryotic DNA topoisomerases mediated DNA cleavage induced by new inhibitor: NSC 665517 Mol. Pharmacol., 48, 658-665

1.7 Shi, L.M., Myers, T.G., Fan, Y., O’Connors, P.M., Paul, K.D., Friend, S.H., and Weinstein, J.N. (1998)

Mining the National Cancer Institute Anticancer Drug Discovery Database: cluster analysis of ellipticine analogs with p53-inverse and central nervous system-selective patterns of avtivity

Mol. Pharmacology, 53, 241-251.

1.8 Ross, D.T. et. al., (2000)

Systemamtic variation of gene expression patterns in human cancer cell lines

Nat. Genet., 24, 227-235

1.9 Staunton, J.E.; Slonim, D.K.; Coller, H.A.; Tamayo, P.; Angelo, M.P.; Park, J.; Sherf, U.; Lee, J.K.; Reinhold, W.O.; Weinstein, J.N.; Mesirov, J.P.; Landers, E.S.; Golub, T.R. Chemosensitivity prediction by transcriptional profiling, Proc. Natl. Acad. Sci., 2001, 98, 10787-10792.

1.10 Marx, K.A., O’Neil, P., Hoffman, P.; Ujwal, M.L. Data Mining the NCI Cancer Cell Line Compound GI50 Values: Identifying Quinone Subtypes Effective Against Melanoma and Leukemia Cell Classes, J. Chem. Inf. Comput. Sci., 2003, in press.

1.11 Schafer, J.L. Analysis of Incomplete Multivariate Data, Monographs on Statistics and Applied Probability 72, Chapman & Hall/CRC, 1997.

1.12 Scherf, W.; Ross, D.T.; Waltham, M.; Smith, L.H.; Lee, J.K.; Tanabe, L.; Kohn, K.W.; Reinhold, W.C.; Myers, T.G.; Andrews, D.T.; Scudiero, D.A.; Eisen, M.B.; Sausville, E.A.; Pommier, Y.; Botstein, D.; Brown, P.O.; Weinstein, J.N. A gene expression database for the molecular pharmacology of cancer, Nature, 2000, 24, 236-247.

1.13 P. Hoffman and G. Grinstein, “Dimensional Anchors: A Graphic Primitive for Multidimensional Multivariate Information Visualizations,” presented at NPIV '99 (Workshop on New Paradigmsn in Information Visualization and Manipulation), 1999.

1.14 Hoffman, P.; Grinstein, G.; Marx, K.; Grosse, I.; Stanley, E. DNA visual and analytical data mining, IEEE Visualization 1997 Proceedings, pp. 437-441, Phoenix

1.15 Hoffman, P.; Grinstein, G. Multidimensional information visualization for data mining with application for machine learning classifiers, Information Visualization in Data Mining and Knowledge Discovery, Morgan-Kaufmann, San Francisco, 2000.

Section 2:

2.1 Venter, J.C., et.al., The Sequence of the Human Genome. Science, 291, 1303-1351 (2001).

2.2 Lander, E.S., et.al., Initial Sequencing and Analysis of the Human Genome. Nature, 409, 860-921 (2001).

2.3 Stoeckert, C.J., et.al., Microarray databases: standards and ontologies. Nat. Genet. 32 (Suppl) 469-473.

2.4 Jin Jen, M.D., Ph.D., Laboratory of Population Genetics, Center for Cancer Research, National Cancer Institute.

2.5 Matthew Meyerson Lab, Dana-Farber Cancer Institute, at

2.6 Affymetrix, Inc. at .

2.7 Weka (Waikato Environment for Knowledge Analysis), The University of Waikato, .

2.8 Dracheva, T., Shih, J., Jen, J., Gee, A., McCarthy, J.F., and Metrogenix; “Distinguishing lung tumors based on small number of genes using flow-through-chips” (In preparation).

Section 3:

3.1 Petricoin, E. F., A. M. Ardekani, B. A. Hitt, P. J. Levine, V. A. Fusaro, S. M. Steinberg, G. B. Mills, C. Simone, D. A. Fishman, E. C. Kohn, L. A. Liotta, Use of proteomic patterns in serum to identify ovarian cancer, Lancet, 2002, 359:572-77.

3.2 NCI Clionical Proteomics Website at

Figure Captions

Figure 0.1. Basic RadViz explained {Pat to supply caption}. One point with 16 dimensions (variables, features or attributes) in RadViz. Spring lines (not usually shown) are colored by value (K in Hooke’s law) for that variable (dark is higher, light is lower). The point is plotted were the sum of the forces is zero. Since variable 11,12 and 13 have higher values (scaled to between 0 and 1) the point is pulled closer to those variables.

Figure 1.1. Cancer cell line functional class definition using a hierarchical clustering (1-Pearson coefficient) dendrogram for 60 cancer cell lines based upon gene expression data. Five well defined clusters are shown highlighted. We treat the highlighted cell line clusters as the truth for the purpose of carrying out studies to identify which chemical compounds are highly significant in their classifying ability.

Figure 1.2. RadViz™ result for the 3-Class problem classification of melanoma, leukemia and non-melanoma, non-leukemia cancer cell types at the p < .01 criterion. Cell lines are symbol coded as described in the figure. A total of 14 compounds (bottom of layout) were most effective against melanoma and they are layed out on the melanoma sector (counterclockwise from most to least effective). For leukemia, 30 compounds were identified as most effective and are layed out in that sector. Some 8 compounds were found to be most effective against non-melanoma, non-leukemia cell lines and are layed out in that sector.

Figure 1.3. Internal vs external quinone figure. {ML please supply caption}

Figure 2.1. Classification results for the NCI data set showing the size of the gene sets compared to their associated best percent correct. Notice how the RadViz algorithm selected genes (black) generally perform better than either the top F-statistic genes (gray) or the randomly selected genes (white). As the gene set sizes increased from one to about twenty genes there was a shard increase in classification accuracy. In addition, as more random genes are selected their associated performance increases.

Figure 2.2. A RadViz display showing an example of a selected set of 15 genes from the Myerson data set defined by a balanced layout for the three classes: normal (gray squares), adenocarcinoma (black circles) and squamous cell carcinoma (white triangles). Ideally, the patient samples displayed by their associated representative glyph should fall within their respective regions, however some samples clearly fall into other regions thus being visually misclassified. This particular gene set performs very will with about 6 misclassifications visually, and after applying our collection of classification algorithms this gene set performed with 8 misclassifications.

Figure 2.3 {Alex: Please provide appropriate caption}

Figure 3.1. This is a parallel coordinates display of a peak at 417.732 and its closest neighbors. The intensities at nearby M/Z values are very similar. The correlation between this peak and its two nearest neighbors is about .97, and the correlation with the two next neighbors is about .91, illustrating one source of redundancy in the data.

Figure 3.2. The top graph shows the portion of the spectrum from M/Z of 5300 to 10600, while the bottom graph shows the portion from 2650 to 5300. Thus the range at the bottom is exactly half the range at the top. Notice that all peaks in the top graph are repeated in the bottom graph. This is consistent with molecules with the same mass and twice the charge suggesting production of doubly ionized forms of the original protein fragments. This is a second source of redundancy in the data.

Figure 3.3. This is a RadViz class discrimination display of the samples using the six selected features. Ovarian cancer samples are shown in white. Controls are shown in black. The two classes are reasonably well separated by these features indicating that the features can be used for classification. A neural network model based on these six features classified both the training set and the test set perfectly.

Figure 3.4. On the left are the seven M/Z ratios selected by Petricoin et al. On the right are the six features selected by the present authors. Data from ovarian cancer patients are displayed in white. Data from controls are displayed in black. Several of the features on the left have very low intensities. Some, but not all, of the features on the left show visible differences between the two groups of samples. There is a significant difference between the two groups for all of the features on the right.

Table 1: Gene ID crossreference with indication of literature support {ML}

Section 0 Figures

Figure 0.1

[pic]

Section 1 Figures

Figure 1.1

[pic]

Figure 1.2

[pic]

Figure 1.3

{ML’s simple quinone figure}

Section 2 Figures

|[pic] |[pic] |

|Figure 2.1 |Figure 2.2 |

[pic]

Figure 2.3

Section 3 Figures

Figure 3.1

[pic]

Figure 3.2

[pic]

Figure 3.3

[pic]

Figure 3.4

[pic]

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download