Iowa State University



[pic]SETDiR User Manual

for version 1.7.02

Copyrights:

Prof. Baskar Ganapathysubramanian

Department of Mechanical Engineering,

Iowa State University (ISU).

baskarg@iastate.edu

Documentation:

Sai Kiranmayee Samudrala,

Department of Mechanical Engineering,

Iowa State University (ISU).

saikiran@iastate.edu

Prof. Baskar Ganapathysubramanian

Developers:

Baskar Ganapathysubramanian.

Sai Kiranmayee Samudrala.

The work was partly supported by the National Science Foundation under contracts CCF-0917202 and NSF PHY-0941576.

Disclaimer:

This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor Iowa State University, nor any of their employees or officers, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by Iowa State University, United States Government or any agency thereof.

Availability of This Report

This report is available here.

Abstract

Significant advances in high-throughput experiments and computational capability has ushered in an era of data-driven analysis and discovery. Recent years have seen the development of various statistical methods to reduce the noise, redundancy and dimensionality of this huge amount of data to make analysis more tractable. These range from traditional linear PCA based techniques to methods for non-linear model reduction like IsoMap. There is a need for a comprehensive software suite that can perform such analysis in a scalable, parallel way. This resulted in the development of SETDiR (Scalable Extensible Toolkit for Dimensionality Reduction).

SETDiR is a software tool developed to perform dimensionality reduction on a given high-dimensional set of data. The emphasis of the software is on various methods and techniques appropriate for a variety of high-dimensional data sets problems. These techniques can be used to efficiently unravel possible (non)-linear structures in the data. In addition, techniques to estimate the optimal dimensionality of the low-dimensional representation are included. The techniques are packaged into a modular, computational scalable software framework with a graphical user interface. This interface helps to separate out the mathematics and computational aspects from the applications, thus significantly enhancing utility to the scientific community.

This manual describes how to use SETDiR (Scalable Extensible Toolkit for Dimensionality Reduction). This manual answers questions like: “How to install SETDiR?”, “How to run SETDiR?”, “Where is the output stored?”, “Where should the input files be stored?” along with representative examples.

Since SETDiR is under continued development, small changes in usage and calling sequences of routines may occur. SETDiR is supported; see the web site for information on contacting support

Acknowledgments

We thank SETDiR users for their many suggestions and bug reports.

SETDiR uses routines from

• PETSc – Portable, Extensible Toolkit for Scientific Computation

• SLEPC – Scalable Library for Eigenvalue Problem Computations

Contents

1 Introduction 7

1.1 Scope 7

1.2 SETDiR Features 7

2 Users 8

3 Dimensionality Reduction Methods 8

3.1 PCA 8

3.2 IsoMap 9

3.3 LLE (Locally Linear Embedding) 10

3.4 LE (Laplacian Eigenmaps) – To be tested 11

3.5 Dimensionality Estimators 12

3.5.1 Dimensionality Estimation using GMST (Geodesic Minimum Spanning Tree) 12

3.5.2 Correlation Dimension 12

3.6 Comparison of DR methods 14

3.7 Time Complexity Analysis 14

4 Software Details 16

4.1 Overview of SETDiR 16

4.2 Software Requirements 17

5 How to use? 18

5.1 Folder Structure 18

5.2 How to get it running? 19

Input Details 20

5.2.1 Basic 20

5.2.2 Advanced 22

5.3 Choose Output 24

5.4 Output Plot 25

5.5 Output Report 26

6 References 27

7 Appendices 28

7.1 Example 28

7.2 External Libraries 31

Introduction

SETDiR (Scalable Extensible Toolkit for Dimensionality Reduction) is a dimensionality reduction tool which accepts high dimensional data and reduces it to lowest possible dimensions ensuring minimal loss in the information.

1 Scope

Scope of this document is to help the user in making the best possible use of the software SETDiR. This document is targeted for readers both who are familiar with the mathematics and who aren’t. Section 2 describes the roles of users. Section 3 gives an overview of the software. Besides, Section 3 also describes the system requirements that the software demands before one can successfully run it. Section 4 describes the nuances of the software and significance of each field variable present in the tool. Appendix section describes the use of SETDiR with example data sets.

2 SETDiR Features

1. Provides a user-friendly interface for performing Dimensionality Reduction on high dimensional data

2. Displays the output (low-dimensional data, eigenvalues) in graphical form

3. Allows the user to perform the dimensionality reduction by choosing different options

4. Allows easy post processing of the data by giving the user an option to save the graphs in jpeg format.

Users

This tool is applicable in various fields and hence the user can be anyone-- with interests from materials scientist to psychology. For ease of usage, we broadly classified users as:

1. Basic User – with little knowledge in dimensionality reduction

2. Advanced User – has an expertise in Dimensionality Reduction Techniques.

This is implemented in the software by providing two different tabs:

1. Basic

Basic settings involve minimum information required to run the dimensionality reduction on a given set of data.

2. Advanced

Advanced settings give more flexibility to the user to change the way the DR method is being applied.

Dimensionality Reduction Methods

Functionalities of a DR (Dimensionality Reduction) method:

• Estimate the number of latent variables

• Embed data to reduce the dimensionality

• Retrieve the latent variables

Most of the times it gets difficult to find all the three in one method. A combination of several methods as a package is used to achieve the objectives. Classification of DR Methods based on the data they are usually applied to:

• Linear DR methods – give accurate results for data lying only a linear manifold

• Nonlinear DR methods – give accurate results for data lying on a non-linear manifold too

• Each of the DR methods try to achieve the dimensionality reduction by preserving one specific property of the data. For example pair-wise distances for PCA, geodesic distances matrix for IsoMap and topology preservation for LLE (Locally Linear Embedding).

1 PCA

Principal Component Analysis is a DR method which tries to find a set of linear combinations of original variables, such that the percentage of variance of the data described by the principal components thus formed is decreasing.

Assumptions:

1. Data lies on a linear manifold

2. For a given set of high-dimensional input vectors Y, there exists a set of low-dimensional mapping X.

3. Data is centered.

Algorithm:

1. Read the input data X

2. Perform centering (and normalization) of the input data X

3. Find the pair-wise distances between each of the n objects (or high-dimensional data points) D

4. Evaluate the eigenpairs of the pairwise-distance matrix. D = VΛVT

5. Compute low-dimensional points Y = VTX

Advantages

1. Simple, easy to understand and implement

2. Gives an estimate of dimensionality with the help of eigenvalue-plot.

Limitations

1. Cannot give accurate results for data lying on a non-linear manifold

3 IsoMap

IsoMap is similar to PCA in many respects except that it implements a metric called geodesic distance metric.

Assumptions:

1. The given data lies on a convex surface (and the surface has no holes).

Algorithm:

1. Find k-nearest neighbors of the data points.

2. Then construct the geodesic distance matrix for the N points in higher dimensional space using Floyd’s algorithm.

3. Calculate the optimal dimensionality (d) using an external dimensionality estimator.

4. Construct a matrix A of size N x N such that the elements of A are squares of the elements of the geodesic distance matrix, G.

5. Find the dissimilarity matrix (S) by double centering the matrix A.

6. Solve for eigenvalues of S where S = UΛUT

7. Find low-dimensional points X = IPxN Λ1/2UT

Advantages:

1. Can be successfully applied to a convex set of points which lie on a nonlinear manifold without paying much computational price.

2. Landmark IsoMap version exists which can work well on a few selected set of points.

Limitations

1. Cannot give accurate results for manifolds with holes.

2. Results depend upon neighborhood size.

3. Sensitive to noise.

4 LLE (Locally Linear Embedding)

LLE is a topology preserving method. In other words, it preserves neighborhood relationship between data points.

Algorithm:

1. For each datum y(i), compute:

a. The K nearest neighbors of y(i).

b. The local covariance matrix or the Gram matrix G(i) must be computed using:

gr,s(i) = (yi-ν(r))T(yi- ν(s))

where ν(r) and ν(s) are the neighbors of y(i)

c. The weights w(i) can be computed by solving for the linear system:

G(i).w(i) = 1

2. Knowing the vectors w(i), build the sparse matrices W and M, with M = (I-W)T(I-W)

3. Compute the EVD of M; the estimated coordinates are given by the eigenvectors associated with the second to (1+P)th smallest eigenvalues.

Advantages:

1. Sound theoretical foundation

2. Simple method

3. Global operation of EVD couples the individual pieces of information from local neighborhood relations of the graph.

Limitations:

1. Since it has to solve an N x N EVD problem, the algorithm becomes computationally untractable when N becomes large.

2. Parameter tuning is not easy. It needs two parameters as inputs: neighborhood size and tolerance to regularize the Gram matrix

5 LE (Laplacian Eigenmaps) – To be tested

LE is similar in approach to LLE since they both follow a local approach to the problem of nonlinear dimensionality reduction. LE too, like LLE, tries to preserve neighborhood relationships and thus achieves topology preservation. Please note that the LE algorithm implemented in this tool is still under testing.

Algorithm:

1. Compute pair-wise distances, Neighborhood matrix for K neighbors.

2. Build the graph and its adjacency matrix A such that ai,j = 1 if i and j are neighbors else ai,j = 0.

3. Compute W using:

4. Compute sum of each row of W to build D (D is a diagonal matrix).

5. Compute L = (W-D)

6. Normalize the Laplacian matrix LT = D-1/2LD-1/2.

7. Compute EVD of LT: LT =UΛUT.

8. A low-dimensional embedding is finally obtained by multiplying eigenvectors by D1/2, transposing them and keeping those associated with the P smallest eigenvalues except the last one, which is zero.

6 Dimensionality Estimators

1 Dimensionality Estimation using GMST (Geodesic Minimum Spanning Tree)

This method is based on BHH (Breadwood-Halton-Hammerseley) Theorem. Following are the graphs show the variation of Slope of the graph drawn between log(GMST) and log(n) with respect to the neighborhood size.

[pic][pic]

Figure 3.1 (a) and (b): Figues show the change in estimated (GMST) dimensionality with respect to a change in neighborhood. (a) on left is for helix and (b) is for Swiss roll.

2 Correlation Dimension

Correlation Dimension is a special case of q-dimension given by:

[pic][pic],

where Bє(y) is the closed ball of radius є centered on y and μ is a Borel probability measure on a given metric space. When q = 0 it is capacity dimension, q=1 it is Information dimension and q = 2 implies correlation dimension.

Correlation dimension estimator was tested on helix and swiss roll data. It seemed far more robust with a choice of neighborhood than the GMST estimator. The following are the graphs representing the change in slope of the log(є) vs log(Correlation) plot ) with a change in neighborhood:

[pic][pic]

Figure 3.2 (a) and (b): Figues show the change in estimated (Correlation) dimensionality with respect to a change in neighborhood. (a) on left is for helix and (b) is for Swiss roll.

7 Comparison of DR methods

The following table shows the comparison between the algorithms of the four DR methods.

|Isomap |PCA |LLE |Laplacian Eigenmaps |

|Read Input Files |Read Input Files |Read Input Files |Read Input Files |

|Distance Matrix |Distance Matrix |Distance Matrix |Distance Matrix |

|Sorting |Sorting |Sorting |Sorting |

|Neighborhood Matrix |Neighborhood Matrix |Neighborhood Matrix |Neighborhood Matrix |

|Adjacency Matrix |Adjacency Matrix |Linear Solver |Adjacency Matrix |

|Floyd's Algorithm |Eigenvalue Solver |Eigenvalue Solver |Eigenvalue Solver |

|Eigenvalue Solver |Original Eigenvectors |Compute Low-dimensional points |Compute Low-dimensional points |

|Original Eigenvectors |Compute Low-dimensional points | | |

|Compute Low-dimensional points | | | |

|Normalization of Input Vectors |

|K-Means Clustering |

|Correlation Dimensionality Estimator |

|Dimensionality Estimator using Geodesic Minimum Spanning Tree Length |

8 Time Complexity Analysis

|S No. |Steps |Complexity |

|1 |Pair-wise Distance Calculation |O(n2) |

|2 |Sorting |O(n2 log n) |

|3 |Floyd’s algorithm |O(n3/2) |

|4 |Prim’s algorithm |O(k * log n) |

Software Details

SETDiR is built based on a Java based GUI which interacts with an executable which is written in C++.

1 Overview of SETDiR

Flow of the functionality from GUI to executable is the following way:

1. GUI Receives the Inputs from the user

2. GUI Calls the C++ executable

3. Executable:

a. Reads the input files

b. Computes the dimensionality

c. Performs Dimensionality Reduction

d. Clusters the low-dimensional points

e. Writes the output files to a Common Workspace

4. GUI now reads the output files from the Common Workspace

5. GUI displays the output to the user

The following diagram describes the interaction between C++ executable and the Java GUI

[pic]

2 Software Requirements

SETDiR needs JRE to be installed on the system (See Section 4.2). The current version of the GUI can only be run on a Windows Operating system. We are currently looking to make the GUI platform independent (web-based). However, note that the dimensionality reduction methods are written in a platform independent framework.

To install JRE (Java Runtime Environment), instructions are given in Section 5.2.

How to use?

1 Folder Structure

This section describes how to use the software. First sub-section (4.1) shows a hierarchical diagram of the folder sructure.[pic]

|S. No. |Folder |Purpose |

|1 |SETDiR |Contains all the files related to the software |

|2 |Dist |Folder containing the executable “View.jar”. |

|3 |View.jar |Double click this executable to get the application running. |

|4 |Workspace |Folder containing the input files, output files and dependable files. |

|5 |Input |User should place all the input files in this folder |

|6 |output |Stores all the generated output files - Eigenvalue plots and low-dimensional plots |

|7 |compute |This sub-folder of “output” folder contains pair-wise, geodesic distance matrix and |

| | |Neighborhood matrix files. |

2 How to get it running?

To run the software executable:

1. To know if you have a JRE already or to install it if you do not have one, user can click on: and click on the hyperlink which says: “Do I have Java?”. This link verifies if you have any java installed on your system and suggest accordingly.

2. If you do not have a JRE installed, you can download it from the Download Java Now button on the same page.

3. Ensure you have the SETDiR software on your system.

4. Go to dist folder and double click on View.jar.

5. It should open an About box dialog

This section gives the reader step-by-step instructions on how to use the software. Notice that there are four major modules of the software:

1. Input Details

2. Choose Output

3. Output Plot

4. Output Report

[pic]

Input Details

This module takes the inputs from the user which answers the following questions - Where is the data located, what is the current (higher) dimensionality of the data, how many points in the higher dimensional space, etc; These details are grouped into basic and advanced settings based on the amount of technical knowledge required to manipulate them.

1 Basic

Basic tab in the application consists of the following field variables:

|S No. |Field Name |Description |Remarks |

|1 |Number of Objects |Accepts integers | |

|2 |Higher Dimensionality |Accepts integers | |

|3 |Distance Norm |Accepts integer which describes the p value in | |

| | |p-norm. | |

|4 |Normalization of Input Vectors |Normalization allows the proper scaling of data which| |

| | |allows the comparison of different variables of | |

| | |different magnitudes and units. | |

|5 |Method of DR |A dropdown that lets the user choose the |Methods: |

| | |Dimensionality Reduction method to be applied on the |PCA |

| | |data. |IsoMap |

| | | |LLE (Locally Linear Embedding) |

| | | |Laplacian Eigenmaps |

|6 |Number of Clusters |Accepts integers. |Input k for K-means clustering |

| | | |method. |

|7 |Dimensionality Estimator |A dropdown that lets the user choose the method using| |

| | |which the dimensionality of the data can be | |

| | |estimated. | |

|8 |Upload of Input File(s) |Captures Input file location | |

|To perform this task |Follow these steps |

|To run DR on a set of data |Enter the details of the input data in the Input Details pane. |

| |Click on the Run Dimensionality Reduction button in the Input Details (Basic) pane. |

|To view a plot |Enter details in Choose Output pane. |

| |Click on the Display button in Choose Output pane. |

|Read Multiple Files. |Choose File > Multiple File Input. |

| |The file chooser that gets opened when the user clicks on Browse button next to Upload Files option then reads |

| |the folder name that contains the list of files to be uploaded. |

[pic]

2 Advanced

|S No. |Field Name |Description |Remarks |

|1 |Neighborhood Size |Accepts integers. Number of neighbors for each point | |

|2 |EPS Chosen |A dropdown allowing users to choose the EPS (Eigen |Krylov-Schur |

| | |Problem Solver) |Arnoldi |

| | | |Lanczos |

| | | |Power |

| | | |Subspace Iteration |

|3 |Range of Eigenvalues |Smallest or Largest Eigenvalues | |

|4 |Maximum Iterations |Integer indicating the maximum number of iterations | |

| | |to be allowed for the iterative process of EPS | |

|5 |Tolerance |A floating point number which represents the amount | |

| | |of error to be allowed before the eigenvalues | |

| | |converge to the correct solution. | |

|6 |Shift-Invert |Shift-Invert is a method which allows the EPS to | |

| | |focus on a certain specific range of eigenvalues. | |

| | |This field accepts a ‘Yes’ or a ‘No’. | |

|7 |Shift |Shift is the value where the eigenvalues have to be | |

| | |zoomed in | |

|8 |KSP Solver Chosen |KSP (Krylov Subspace) Solver is used to solve a set |Applicable only for Locally Linear |

| | |of linear equations. |Embedding. |

|9 |Tolerance |A floating value which represents the error to be |Applicable only for Locally Linear |

| | |allowed (in the solution) before the iterative |Embedding. |

| | |process of KSP converges to a solution. | |

|10 |Maximum Iterations |Integer indicating the maximum number of iterations |Applicable only for Locally Linear |

| | |to be allowed for the iterative process of EPS |Embedding. |

|11 |Pre-conditioning Type |This option chooses the type of pre-conditioning to |Applicable only for Locally Linear |

| | |be applied to the input matrix before solving for the|Embedding. The default value is |

| | |linear system A*x= B |“None”. |

|12 |T for Laplacian LLE |T is an external parameter required to run Laplacian |Applicable only for Laplacian |

| | |Eigenmaps method of DR. |Eigenmaps. The default value is |

| | | |100. |

The default values for each of these variables depend upon the Dimensionality Reduction method chosen. User should note that each time there occurs a change in the selected DR method values for these variables get set to their respective default values.

Of all the variables in the Advanced tab, not all are applicable to all the methods of DR. Following table shows which variables are applicable to which of the methods.

|Property |PCA |IsoMap |LLE |LLE Default |

|linear |0 |1 |2 |2 |

|EigenRange |Largest |Largest |Smallest |Smallest |

|EPS Max Iterations |Default |Default |300 to 1000 |500 |

|EPS Tolerance |Default |Default |Close to smallest Eigenvalue |1.00E-12 |

|Shift-Type |N/A |N/A |Shift-Invert |Shift-Invert |

|Shift Value |N/A |N/A |Zero |1.00E-25 |

|KSP Solver Type |N/A |N/A |cg, richardson, chebychev, |Conjugate Gradient |

| | | |gmres, lsqr | |

|Pre-Conditioning |N/A |N/A |None |None |

|KSP Tolerance |N/A |N/A |varies according to the |1.00E-32 |

| | | |problem | |

|KSP Max Iterations |N/A |N/A |varies according to the |500 |

| | | |problem | |

[pic]

5 Choose Output

|S No. |Field Name |Description |Remarks |

|1 |Scree Plot |On Selection displays the plot between Eigenvalues | |

| | |and index of eigenvalues or, in other words, is the | |

| | |scree plot. | |

|2 |Normalized Scree Plot |Replaces the eigenvalues in scree plot with a ratio | |

| | |of eigenvalue to the maximum of the eigenvalues. | |

|3 |Low-dimensional Points |Represents the low-dimensional co-ordinates of the | |

| | |points with respect to the selected two principal | |

| | |components. | |

|4 |Clustering |Shows the clustering in the low-dimensional space |It uses K-means clustering method. |

| | |with each color representing one cluster | |

|5 |PC X-axis |Accepts integer which represents the index of the | |

| | |dimension to be plotted on X (or horizontal) axis. | |

|5 |PC Y-axis |Accepts integer which represents the index of the | |

| | |dimension to be plotted on Y (or vertical) axis. | |

[pic]

On entering the fields required and pressing display button the selected plot will be displayed in the Output Plot pane.

6 Output Plot

Settings of the output plot to be displayed can be controlled using the options in Choose Output panel (Section 4.2). A sample plot is shown in the figure below.

[pic]

7 Output Report

Each time a data set is input and successfully run using the SETDiR, Summary.dat - a Summary File - is generated which contains a report of input details and output summary. The Report section of the output panel displays the contents of this file in a text area.

[pic]

References

1. S. Samudrala, B. Ganapathysubramanian, K. Rajan, "Nonlinear Dimensionality Reduction toolkit to accelerate structure-property-process investigations", Computational Material Science, preprint

2. John A. Lee, Michel Verleysen, Nonlinear Dimensionality Reduction, Springer, 2008.

Appendices

1 Example

Consider an example of running SETDiR on a data file “Arc200.dat” which contains 200 points on an Arc in three dimensions.

STEP 1: Double click on the ”View.jar” icon to start the application. Click Proceed button.

[pic]

STEP 2: Enter the basic details required in the Input Details pane and click Run Dimensionality Reduction button.

[pic]

STEP 3: Once the application is run successfully the summary of the results will be shown on the Report pane. In case of any error, an error message will be displayed at the bottom left corner of the application window.

[pic]

STEP 4: To view the results in the form of a plot, choose the output options from the Choose Output pane and click on the Display button.

[pic]

STEP 5: The results will be displayed on the Plot pane. PC1 and PC2 represent the indices of the Principal Components considered. Since PCA simply rotates the axes it represents the low-dimensional points on an arc such that the axes here capture the maximum variability.

[pic]

2 External Libraries

Each of the dimensionality reduction methods used here are spectral. This implies that they involve the computation of eigen-pairs for matrices. For this purpose external libraries are used.Two external libraries used are:

1. Petsc – Portable Extensible Toolkit for Scientific Computation

2. Slepc – Scalable Library for Eigenvalue Problem Computation

These two libraries together helped the tool in solving the eigenvalue problem and linear system of equations. More information can be found from:

1. Petsc -

2. Slepc -

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download