IRoot Series of Tutorials



IRootLab TutorialData analysis withDifference-between-mean spectraandCross-calculated LDAJulio Trevisan – juliotrevisan@Updated on 1st/Dec/2012 TOC Introduction PAGEREF _Toc342085537 \h 2Loading and checking the dataset PAGEREF _Toc342085538 \h 2Differences between mean spectra PAGEREF _Toc342085539 \h 4Running the LDAs PAGEREF _Toc342085540 \h 6Standardizing dataset PAGEREF _Toc342085541 \h 6Direct LDA PAGEREF _Toc342085542 \h 6Cross-calculated LDA PAGEREF _Toc342085543 \h 9Discussion PAGEREF _Toc342085544 \h 11IntroductionThis tutorial shows how to1) Use differences between mean spectra as a simple way to check for biochemical alterations2) Use leave-one-out cross-validation to calculate LDA scores (“cross-calculation” of scores)Difference between means consists of choosing one class to be the reference, and subtracting the mean spectrum from this class from all the spectra in the dataset. This allows one to find which classes have higher or lower absorption for each wavenumber, when compared with a reference class.As for the cross-calculated LDAADDIN CSL_CITATION { "citationItems" : [ { "id" : "ITEM-1", "itemData" : { "DOI" : "10.1016/j.envpol.2011.12.027", "abstract" : "With increasing production of carbon nanoparticles (CNPs), environmental release of these entities becomes an ever-greater inevitability. However, many questions remain regarding their impact on soil microorganisms. This study examined the effects of long or short multiwalled carbon nanotubes (MWCNTs), C60 fullerene and fullerene soot in Gram-negative bacteria. Attenuated total reflection Fourier-transform infrared (ATR-FTIR) spectroscopy was applied to derive signature spectral fingerprints of effects. A concentration-dependent response in spectral alterations was observed for each nanoparticle type. Long or short MWCNTs and fullerene soot gave rise to similar alterations to lipids, Amide II and DNA. The extent of alteration varies with nanoparticle size, with smaller short MWCNTs resulting in greater toxicity than long MWCNTs. Fullerene soot was the least toxic. C60 results in the most distinct and largest overall alterations, notably in extensive protein alteration. This work demonstrates a novel approach for assaying and discriminating the effects of CNPs in target systems.", "author" : [ { "family" : "Riding", "given" : "Matthew J" }, { "family" : "Martin", "given" : "Francis L" }, { "family" : "Trevisan", "given" : "J\u00falio" }, { "family" : "Llabjani", "given" : "Valon" }, { "family" : "Patel", "given" : "Imran I" }, { "family" : "Jones", "given" : "Kevin C" }, { "family" : "Semple", "given" : "Kirk T" } ], "container-title" : "Environ. Poll.", "id" : "ITEM-1", "issued" : { "date-parts" : [ [ "2012", "1", "19" ] ] }, "page" : "226-234", "title" : "Concentration-dependent effects of carbon nanoparticles in gram-negative bacteria determined by infrared spectroscopy with multivariate analysis.", "type" : "article-journal", "volume" : "163C" }, "uris" : [ "" ] } ], "mendeley" : { "previouslyFormattedCitation" : "[1]" }, "properties" : { "noteIndex" : 0 }, "schema" : "" }[1], the principle is:1) Use a dataset containing all, except one, samples to train the LDA model2) Use the model with the left-out sample to calculate the scores for this sample3) Repeat steps 1) and 2) to all the samples until all scores are calculated.Sample = patient, slide etc (depending on the experiment). All spectra from one sample need to be kept together.Loading and checking the datasetThis tutorial uses Ketan’s Brain dataADDIN CSL_CITATION { "citationItems" : [ { "id" : "ITEM-1", "itemData" : { "DOI" : "10.1039/c2ay25544h", "author" : [ { "family" : "Gajjar", "given" : "Ketan" }, { "family" : "Heppenstall", "given" : "Lara" }, { "family" : "Pang", "given" : "Weiyi" }, { "family" : "Ashton", "given" : "Katherine M" }, { "family" : "Trevisan", "given" : "Julio" }, { "family" : "Patel", "given" : "Imran I" }, { "family" : "Llabjani", "given" : "Valon" }, { "family" : "Stringfellow", "given" : "Helen F" }, { "family" : "Martin-Hirsch", "given" : "Pierre L" }, { "family" : "Dawson", "given" : "Tim" }, { "family" : "Martin", "given" : "Francis L" } ], "container-title" : "Analytical Methods", "id" : "ITEM-1", "issue" : "0", "issued" : { "date-parts" : [ [ "2012" ] ] }, "page" : "2-41", "title" : "Diagnostic segregation of human brain tumours using Fourier-transform infrared and/or Raman spectroscopy coupled with discriminant analysis", "type" : "article-journal", "volume" : "44" }, "uris" : [ "" ] } ], "mendeley" : { "previouslyFormattedCitation" : "[2]" }, "properties" : { "noteIndex" : 0 }, "schema" : "" }[2], which is shipped with IRootLab.At MATLAB command line, enter browse_demosClick on “LOAD_DATA_KETAN_BRAIN_ATR”Click on “objtool” to launch objtool23The next step will generate a report on the dataset.Click on Apply new blocks/more actionsClick on visClick on Default reportClick on Create, train & use4675A window should open displaying the following. As seen, the “Normal” class, which will be the reference class, is the first class (it has index 1).Differences between mean spectraClick on preClick on Subtract mean of a reference classClick on Create, train & use8109Accept the value 1 (refers to the first class (“Normal”)) Click on ds01_refmean01Click on Class meansClick on Create, train & use121314The following figure should appear.Note – For example, the average absorbance of the “Astrocytoma” samples is higher than “Normal” between 1600 and 1300 cm-1, peaking at 1500 cm-1, and is lower than “Normal” between 1300 and 900 cm-1.The “Normal” curve is a flat line as a consequence of the Subtract mean of a reference class operation.Running the LDAsStandardizing datasetClick on ds01Click on preClick on StandardizationClick on Create, train & useNote – The dataset must be either mean-centered or standardized before cross-calculated LDA. Standardization provides more numerical stability. StandardizationADDIN CSL_CITATION { "citationItems" : [ { "id" : "ITEM-1", "itemData" : { "author" : [ { "family" : "Hastie", "given" : "T." }, { "family" : "Friedman", "given" : "Jerome H." }, { "family" : "Tibshirani", "given" : "R." } ], "edition" : "2nd", "id" : "ITEM-1", "issued" : { "date-parts" : [ [ "2007" ] ] }, "publisher" : "Springer", "publisher-place" : "New York", "title" : "The Elements of Statistical Learning", "type" : "book" }, "uris" : [ "" ] } ], "mendeley" : { "previouslyFormattedCitation" : "[3]" }, "properties" : { "noteIndex" : 0 }, "schema" : "" }[3] is mean-centering followed by scaling of each variable so that their standard deviations become 1.15161718Direct LDAFirst we will apply LDA to later compare with the cross-calculated LDA.Click on ds01_std01Click on fconClick on Linear Discriminant AnalysisClick on Create, train & use19202221Click on OKThe next step will generate a scores plotClick on ds01_std01_lda01Click on visClick on 2D ScatterplotClick on Create, train & use24252627Click on OKThe following figure should appear:But … are the classes so well separated because LDA overfits the data???We will do the cross-calculated LDA to find this out.Cross-calculated LDAClick on ds01_std01Click on ASClick on Cross-calculateClick on Create, train & use29303132Click on OK (a new Log will be created, may take a few seconds)Note – The SGS can be left blank because the cross-calculation block (which we are creating now) will automatically create one. The default SGS is a leave-one-out (LOO) cross-validation that keeps together all the spectra from the same group (patient in this case).In our case, LOO cross-validation is equivalent to 22-fold cross-validation because the dataset has 22 groups (check report generated on step 7). This implies that to calculate the scores for each group, the 21 other groups will be used to train an LDA block that will be then used on the spectra from that group.Click on LogClick on log_as_crossc_crossc01Click on extract datasetClick on Execute (will create a dataset)34353637Click on DatasetClick on irdata_crossc01Click on Existing blocks (we are going to re-use the 2D Scatterplot block)Click on vis_scatter2d01Click on Use3839414042The following figure should appear:DiscussionThe classes are not as well separated as with direct LDA. The segregation seen before with direct LDA was unrealistic because of overfitting.Nevertheless, the results from cross-calculated LDA are still excellent, because the classes are nearly completely separated; just a small overlap is seen between “Normal” and “Glioblastoma”.ReferencesADDIN Mendeley Bibliography CSL_BIBLIOGRAPHY [1]M. J. Riding, F. L. Martin, J. Trevisan, V. Llabjani, I. I. Patel, K. C. Jones, and K. T. Semple, “Concentration-dependent effects of carbon nanoparticles in gram-negative bacteria determined by infrared spectroscopy with multivariate analysis.,” Environ. Poll., vol. 163C, pp. 226–234, Jan. 2012.[2]K. Gajjar, L. Heppenstall, W. Pang, K. M. Ashton, J. Trevisan, I. I. Patel, V. Llabjani, H. F. Stringfellow, P. L. Martin-Hirsch, T. Dawson, and F. L. Martin, “Diagnostic segregation of human brain tumours using Fourier-transform infrared and/or Raman spectroscopy coupled with discriminant analysis,” Analytical Methods, vol. 44, no. 0, pp. 2–41, 2012.[3]T. Hastie, J. H. Friedman, and R. Tibshirani, The Elements of Statistical Learning, 2nd ed. New York: Springer, 2007. ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download