Author Guidelines for 8 - University at Buffalo



Evolutionary NO TITLE BioGrid 1Molecular Structure Determination Uusing Grid-Eenabled Data Miningon a Computational & Data Grid

Mark L. Green1 and Russ Miller1,2

1Center for Computational Research, University at Buffalo

2Department of Computer Science and Engineering

State University of New York, Buffalo, NY 14260, University at Buffalo

mlgreen@ccr.buffalo.edu, miller@buffalo.edu

Abstract

Abstract

A new computational framework is developed for the evolutionary determination of molecular crystal structures using the Shake-and-Bake methodology. Genetic algorithms are performed on the SnB results of known structures in order to optimize critical parameters of the program. The determination of efficient SnB input parameters can significantly reduce the time required to solve unknown molecular structures. Further, the grid-enabled data mining approach that we introduce is able to exploit computational cycles that would otherwise go unused.

1. Introduction

The ACDC-Grid [11,20,211,2,3,4,5] is a proof-of-concept grid that has been implemented in Western New York. The driving application, which comes from structural biology, provides a cost-effective solution to the problem of determining molecular structures from X-ray crystallographic data via the Shake-and-Bake direct methods procedure. SnB [61], a computer program based on the Shake-and-Bake method [7,82,3], is the program of choice for solving such structuresstructure determination inin numerous laboratories [9,10,114,5,6]. This computationally-intensivecomputationally intensive procedure can exploit the grid’s ability to present the user with a computational infrastructure that will allow for the processing of a large number of related molecular trial structures [12,137,8].

The SnB program uses a dual-space direct-methods procedure for determining crystal structures from X-ray diffraction data. This program has been used in a routine fashion to solve difficult atomic resolution structures, containing as many as 1000 unique non-Hydrogen atoms, which could not be solved by traditional reciprocal-space routines. Recently, the focus of the Shake-and-Bake research team has been on the application of SnB to solve heavy-atom and anomalous-scattering substructures of much larger proteins, provided that 3-4Å diffraction data can be measured. In fact, while direct methods had been applied successfully to substructures containing on the order of a dozen selenium sites, SnB has been used to determine as many as 180 selenium sites. Such solutions have led to the determination of complete structures containing hundreds of thousands of atoms.

The Shake-and-Bake procedure consists of generating structure invariants and coordinates for random-atom trial structures. Each such trial structure is subjected to a cyclical automated procedure that includes computing a Fourier Transform to determine phase values from athe proposed set of atoms (initially random), determining a figure-of-merit [149] associated with these phases, refining the phases to locally optimize the figure-of-merit, computing a Fourier Transform to produce an electron density map, and employing a peak-picking routine to examine the map and find the maxima. These peaks (maxima) are then considered to be atoms, and the cyclical process is repeated for a predetermined (by the user) number of cycles.

The running time of SnB varies widely as a function of the size of the structure, the quality of the data, the space group, and choices of critical input parameters, including the size of the Fourier grid, the number of reflections, the number and type of invariants, the number of cycles of the procedure used per trial structure, and critical real-space and reciprocal space refinement methods, to name a few. Therefore, the running time of the procedure can range from seconds or minutes on a PC to weeks or months on a supercomputer. Trial structures are continually and simultaneously processed, with the final figure-of-merit valuess of alleach structures stored in a file. The user can review a dynamic histogram during the processing of the trials in order to determine whether or not a solution is likely present in the set of completed trial structures.

2. Genetic algorithms

Genetic Algorithms (GAs) were developed by Holland [15ref] and are based on natural selection and population genetics. Traditional optimization methods focus on developing a solution from a single trial, whereas genetic algorithms operate with a population of candidate solutions. This approach is philosophically similar to the Shake-and-Bake methodology in that a figure-of-merit of initially random atoms is evaluated during a cyclical process in an effort to determine the correct molecular structure.

We propose to use a GA to determine an efficient set of SnB input parameters in an effort to reduce the time-to-solution for determining a molecular crystal structure from X-ray diffraction data. We use a population of candidate SnB input parameters. Each member of the population is represented as a string in the population and a fitness function is used to assign a fitness (quality) value for each member. The members in the population obtain their fitness values by executing the SnB program with the input parameter values represented by their strings. Using “survival-of-the-fittest” selection, strings from the old population are used to create a new population based on their fitness values. The member strings selected can recombine using crossover and/or mutation operators. A crossover operator creates a new member by exchanging substrings between two candidate members, whereas a mutation operator randomly modifies a piece of an existing candidate. member string forming a new member. This procedure of combining and randomly perturbing member strings has, in many cases, been shown to produce stronger (i.e., more fit) populations as a function of time (i.e., number of generations).should produce more fit populations of candidate solutions as the members evolve.

Sugal [16ref] (sequential execution) and PGAPack [17, 18ref] (parallel and sequential execution) genetic algorithm libraries were used in our work. The Sugal library provided a sequential GA and has additional capabilities, including asuch as, restart functionality that proved to be very important when determining fitness values for optimizing large molecular structures. The PGAPack library provided a parallel master/slave MPICH/MPI implementation that proved very efficient on distributed- and shared- memory ACDC-Grid compute platforms. Other key features include C and Fortran interfaces, binary-, integer-, real-, and character-valued native data types, object-oriented design, and multiple choices for GA operators and parameters. In addition, PGAPack is quite, and easy extensibleility. The PGAPack library was extended by Green to include restart functionality and is currently the only library used for the ACDC-Grid production work.

3. SnB computer program input parameters

The SnB computer program has approximately 100 input parameters, though not all parameters can be optimized. For the purpose of this study, 17 parameters were identified for participation in the optimization procedure. The parameter names and brief descriptions follow.

1. NUM_REF: Number of reflections used for invariant generation and phase determination

2. RESO_MAX: Minimum data resolution

3. E_SIG_CUT: E/Sigma(E) > Cut

4. NUM_INV: Number of three-phase invariants to generate and utilize during the Shake-and-Bake procedureor use

5. NUM_CYCLE: Number of Shake-and-Bake cycles performed on every trial structure

6. PH_REFINE_METHOD: fast parameter shift, slow parameter shift, tangent formula method

7. PS_INIT_SHIFT: parameter shift angle in degrees

8. PS_NUM_SHIFT: maximum number of angular shift steps

9. PS_NUM_ITER: maximum number of parameter shift passes through phase list

10. TAN_NUM_ITER: maximum number of passes through phase list when PH_REFINE_METHOD is set to tangent formula method

11. MIN_MAP_RESO: Fourier grid map resolution

12. NUM_PEAKS_TO_OMIT: number ofd peaks to omit

13. INTERPOLATE: a Boolean value that specifies whether or not to interpolate the density map

14. C1: cycle 1 start

15. C2: cycle 2 end

16. P1: number of peaks to pick

17. P2: number of heavy atom peaks to pick

Eight known molecular structures were initially used to evaluate the genetic algorithm evolutionary molecular structure determination framework performance. These structures are 96016c [19ref], 96064c [20ref], crambin [21,22ref-crambin], Gramicidin A [23ref-gram], isoleucinomycin [24ref-iled], pr435 [25ref-pr435], Triclinic Lysozyme [26ref-trilys], and Triclinic Vancomycin [27ref-vanco].

In order to efficiently utilize the computational resources of the ACDC-Grid, an accurate estimate must be made in terms of the resource requirements for SnB jobs that need to be run in order to performthat are necessary for the GA optimization. This includes runs with varying parameter sets over the complete set of eight (8) known structures from our initial database.

This is accomplished as follows. First, a small number of jobs are run in order to determine the required running time for each of the necessary jobs. Typically, this consists of running a single trial for each of the jobs in order to predict the time required for the required number of trials for the job under consideration.

Approximately 25,000 population members were evaluated for the eight known molecular structures and stored in a MySQL database table, as shown in Fig. 1.

[pic]

Figure 1: MySQL database table for SnB trial results.

From these trial results, the mean ([pic]) and standard deviations ([pic]) are calculated for each input parameter j and used to determine the standard scores ([pic]) for each trial ii,

[pic],

for all i and j, where the trial parameter value for trial i and parameter j is [pic]. , see Fig. 2 showsing the standard scores of the parameters under consideration.

[pic]

Figure 2: Standard scores for Pearson product-moment correlation coefficient calculations.

The Pearson product-moment correlation coefficients ([pic]) are calculated for input parameter j and molecular structure k by

[pic],

for all j and k, wherewhere NN denotes the degrees of freedom and [pic] represents the standard score of the GA trial run time., Refer tosee Fig. 3.

[pic]

Figure 3: Pearson product-moment correlation coefficient database table.

The input parameters that have the largest absolute magnitude Pearson product-moment correlation coefficient with respect to the observed trial run times are selected and used to form a predictive run time function that is fit using a linear least squares fit routine

[pic],

where the observed [pic] trial run time is fit to a selected sub-set of input parameter values j, [pic] denotes the input parameter value, and [pic]denotes the respective molecular structure k Pearson product-moment correlation coefficient, and [pic]denotes the linear least square fit coefficients for each j input parameter. We use this function within the grid-enabled data miningdata-mining infrastructure to estimate the maximum number of SnB GA generations and the maximum size of the population that would run on a given computational resource within thea specified time frame.

The ACDC-Grid infrastructure automatically updates the correlation coefficients based on the availability of new trial data appearing in the SnB trial result table. Thus, run time estimates for any given structure continually evolve throughout the GA optimization process.

For example, if there are 50 processors available for 150 minutes on ACDC-Grid compute platform “A”, we are interested in determining the maximum number of GA generations and the size of the population that can run on “A” and complete within 150 minutes. Based on this information, the data mining algorithms can make intelligent choices of not only which structures to evaluate, but they can completely define the SnB GA job that should be executed. This type of run time prediction is an essential component of our system for providing a level of quality of service. Further, in our experience, this type of run time parameter-based prediction is almost always necessary when queue managed computational resources are employed.

Add laterThe focus of this research is on the design, analysis, and implementation of a critical program in structural biology onto two computational and data grids. The first is the ACDC grid, which has been implemented in Buffalo, NY, using facilities at SUNY-Buffalo and affiliated research institutions. The second grid is the one established for Grid2003 at SC2003. In this paper, we present an overview of the ACDC Grid, Grid2003, and the implementation of the SnB program on both of these platforms.

1. Introduction

The In order to implement a proof-of-concept ACDC-Grid [ref] is a proof-of-concept grid implemented in Buffalo, NY. The driving structural biology application provides a cost-effective solution to the problem of determining molecular structures from X-ray crystallographic data via the Shake-and-Bake direct methods procedure. SnB [ref], a computer program based on the Shake-and-Bake method [ref], we consider as an application a cost-effective solution to the problem of determining molecular crystal structures via direct methods as implemented in a grid setting. We use the SnB computer program, which is based on the Shake-and-Bake method of molecular structure determination, as the prototype application for the template design presented in this paper. Shake-and-Bake was developed in Buffalo and is the program of choice for structure determination in many of the 500 laboratories that have acquired it. In addition, the SnB program is well understood by the authors, one of whom is a principle author of the Shake-and-Bake methodology and the SnB program. FinallyThis, SnB is a computationally- intensive proceduregram that can take exploitadvantage of the grid’'s ability to present the user with a large computational infrastructure-scale that will allow for the processing of a large number of related molecular trial structures.desktop or distributed supercomputer in order to perform computations that are equivalent to parameter studies, which are areas that the grid excels at.

The SnB program uses a dual-space direct-methods procedure for determining crystal structures from X-ray diffraction data. This program has been used in a routine fashion to solve difficult atomic resolution structures, containing as many as 1000 unique non-Hydrogen atoms, which could not be solved by traditional reciprocal-space routines. Recently, the focus of the Shake-and-Bake research team has been on the application of SnB to solve heavy-atom and anomalous-scattering substructures of much larger proteins provided that 3-4Å A diffraction data can be measured. In fact, while direct methods had been applied successfully to substructures containing on the order of a dozen selenium sites, SnB has been used to determine as many as 180 selenium sites. Such solutions by SnB have led to the determination of complete structures containing hundreds of thousands of atoms.

The Shake-and-Bake procedure consists of generating structure invariants and coordinates for random-atom trial structures. Each such trial structure is subjected to a cyclical automated procedure that includes a Fourier routine to determine phase values from a proposed set of atoms (initially random), determination of a figure-of-merit, refining phases to locally optimize the figure-of-merit, computing a Fourier to produce an electron density map, and employing a peak-picking routine to examine the map and find the maxima. These peaks are then considered to be atoms, and the cyclical process is repeated for a predetermined number of cycles.

Trials are continually and simultaneously processed until a solution is discovered by viewing a histogram of final figure-of-merit values. The running time of this procedure ranges from minutes on PCs to months on supercomputers. Trial structures are continually and simultaneously processed, with For each completed trial structure, the final value of the figure-of-merit values of each perturbation is stored in a file., A dynamicand a histogram can be reisviewed by the user during the processing in order produced to determine whether or not a solution is likely present in the set of completed trial structures. A bimodal distribution with significant separation is a typical indication that solutions are present, whereas a unimodal, bell-shaped distribution typically indicates a set comprised entirely of nonsolutions.

The current premise is that the computing framework for this Shake-and-Bake procedure need not be restricted to local computing resources. Therefore, a grid-based implementation of Shake-and-Bake methodology can afford scientists with limited local computing capabilities the opportunity to solve structures that would be beyond their means.

2. Grid Introduction

The Computational Grid is a rapidly emerging and expanding technology that allows geographically distributed resources (CPU cycles, data storage, sensors, visualization devices, and a wide variety of Internet-ready instruments), which are under distinct control, to be transparently linked together (see ). The concept of “the grid” is, in many ways, analogous to that of the electrical power grid, where the end user cares only that the computational resource (electricity) is available and not how or where it is performed (produced and transported). The power of the Grid lies in its ease of use and perhaps most importantly in the total aggregate computing power, data storage, and network bandwidth that can readily be brought to bear on a particular problem. The Grid will support remote collaborative common operational sessions to coordinate homeland security activities throughout the US. The Grid environment will enable secure interactions between geographically distributed groups, providing each group with a common operational picture and access to national and local databases. Since resources in a grid are pooled from many different domains, each with its own security protocol, insuring the security of each system on the Grid is of paramount importance.

Grids have recently become a cost-effective mode of computation to evaluate for a variety of reasons, including the following: (a) The Internet is reasonably mature and able to serve as basic infrastructure. (b) Network bandwidth has increased to the point of being able to provide reliable services. (c) Storage capacity is now at a commodity level where one can purchase a terabye of disk for roughly the same price as a high-end PC. (d) Many instruments are hitting the market that are Internet aware. (e) Clusters, supercomputers, storage and visualization devices are becoming more easily accessible to many critical communities. (f) Applications have been parallelized. (g) Collaborative environments are moving out of the alpha phase of implementation.

For these and other reasons, grids are starting to move out of the research laboratory and into early-adopter production systems. The focus, however, continues to be on the critical middleware.

that can be used to build relatively simple or extremely complex mission critical workflows. Workflows are comprised of processing elements such as signal processing algorithms, database query, or visualization rendering and data elements such as files, databases, or data streams of course many other options are also currently available. Using this system of Grid services that have very well defined standards and protocols for communicating, one can build dynamic, scalable, fault-tolerant, and efficient workflows that can meet the needs of somewhat unpredictable computational challenges.

The Grid community has advanced the efficient usage of grids by defining metrics to measure performance of grid applications and architectures and rate functionality and efficiency of grid architectures. These metrics will facilitate good engineering practices by allowing alternative implementations to be compared quantitatively. Also, they provide grid users with information about systems capabilities so that they can develop and tune their applications towards informed objectives. Therefore, we propose a set of tasks that assess grid performance at the level of user applications. The tasks will demonstrate the:

· dynamic quality of well designed Grid services that have the ability to self-replicate in the event of failure,

· demonstrate the ability to scale the amount of Grid resources utilized to meet Quality of Service constraints,

· simultaneously execute multiple workflows on independent Grid services to provide fault-tolerance and mission critical redundancy, and

· provide efficient solution time while reclaiming unused computational resources within the organization and potential partner institutions.

Many typeskinds of computational tasks are naturally suited tofor grid environments, including data-intensive applications. Therefore, we can characterize existing applications with emerging grid applications to understand and capture their computation needs and data usage patterns. Research and development activities relating to the Grid have generally focused on applications where data is stored in files. However, in many scientific and commercial domains, database management systems have a central role in data storage, access, organization, and authorization, etc, for numerous applications. Part of this research effort is targeted at We do not seek to develop new data storage systems, but ratherenabling to make such systems that are more readily accessibleusable within a gindividually or collectively within a Grid framework.

As computers become ubiquitous, ideas for implementation and use Grid computing are developing rapidly and gaining prominence. But issues of interoperability, security, performance, management, and privacy are very important. Many of these issues relate to infrastructure, where no value will be gained from proprietary solutions. The (SEC) Security Area is concerned with various issues relating to authentication and authorization in Grid environments to insure application and data integrity. They are also generating best practice scheduling and resource management documents, protocols, and API specifications to enable interoperability. Several layers of security, data encryption, and certificate authorities already exist in grid-enabling toolkits such as Globus Toolkit 3.

3. Grid Capabilities

Internet is Infrastructure

Increased network bandwidth and advanced services

Advances in Storage Capacity

Terabyte costs less than $5,000

Internet-Aware Instruments

Increased Availability of Compute Resources

Clusters, supercomputers, storage, visualization devices

Advances in Application Concepts

Computational science: simulation and modeling

Collaborative environments → large and varied teams

Grids Today

Moving towards production; Focus on middleware

34. Advanced Computational Data Center-Grid Development

Grid Portal Development

Globus Integration

Globus version 2.2.4 installed and in production.

Globus preview 4 version 3 being evaluated.

All Metacomputing Directory Service, or MDS information stored in the portal database (need for LDAP eliminated).

Condor and Condor-G used for resource management and Grid job submissions.

Meta-scheduler Integration

Grid Portal scheduler incorporates:

Platform statistics obtained on 1 hour time scale including

load,

running/queued jobs,

backfill availability,

queue schedule,

and production rate.

All statistics are stored in the portal database and can be charted for any historical period.

SnB standard user interface can be used to create and post process any Grid job submission.

User secure upload and download facility is provided for staging jobs on and off the Grid.

Current and historical Grid job status is provided by a simple portal database query interface.

User usage profiles are charted up to the minute.

Current Computational Grid statistics can be charted over user defined intervals.

New LDAP Group “cluster” Information

Compute node processor information (types of processors, memory, scratch space, etc.)

Queue information username, nodes, queue, runtime, showbf, showq

Move platform scratch space information into this script (where, size, availability, etc.)

Run a common benchmark on all compute nodes for access compute power across platforms (in-core and out-of-core numbers).

Script will run on hourly time scale on all platforms

Historical record kept for all platforms on hourly time scale.

Metascheduler/Grid Status Information

Need node availability for every platform (# of nodes and time available, # nodes online/down/active/total, etc.)

Need any information that is available that can be used to give Grid statistics on platforms, users, and entire center.

All grid user files are located in local directory structure (grid username used as root).

Quota enforced.

Secure upload/download/storage of files.

All file management is performed thru the Grid Portal File Management Interface.

All Grid-based jobs are submitted thru the Grid Portal Job Management Interface.

All Grid-based jobs are staged to the grid user’s Grid Portal local directory prior to execution/submission.

All grid user’s also have access to Grid Portal local scratch space for exchanging files and/or data with other grid users, etc.

No grid user applications are executed on the Grid Portal and no individual user accounts are created for the grid user.

In an increasing number of scientific disciplines, large data collections are emerging as important community resources.

Data Grids have a significant role to play that inherently complements the Computational Grids that manipulate these collections.

A data grid denotes a large network of distributed storage resources such as archival systems, caches, and databases, which are linked logically to create a sense of global persistence.

To design and implement transparent management of data distributed across heterogeneous resources, such that the data is accessible via a uniform, web interface.

Enable the transparent migration of data between various resources while preserving uniform access for the user.

Maintain metadata information about each file and its location in a global database table.

Currently using MySQL tables.

Periodically migrate files between machines for more optimal usage of resources.

Implement basic file management functions accessible via a platform-independent web interface.

Features include:

User-friendly menus/ interface.

File Upload/ Download to and from the Data Grid Portal.

Simple web-based file editor.

Efficient search utility.

Logical display of files for a given user in three divisions (user/ group/ public).

Hierarchical vs. List-based

3 divisions: (user/ group/ public)

Sorting capability based on file metadata, i.e. filename, size, modification time, etc.

Support multiple access to files in the data grid.

Implement basic Locking and Synchronization primitives for version control.

Integrate security into the data grid.

Implement basic authentication and authorization of users.

Decide and enforce policies for data access and publishing.

Gather and display statistical information particularly useful to administrators for optimizing usage of resources.

Migration Algorithm

File migration depends upon a number of factors:

User access time

Network capacity at time of migration

User profile

User disk quotas on various resources

We need to mine log files in order to determine

How much data to migrate in one migration cycle?

What is an appropriate migration cycle length?

What is a user’s access pattern for files?

What is the overall access pattern for particular files?

Global File Aging vs. Local File Aging

User aging attribute

Indicative of a user’s access across their own files.

Attribute of a user’s profile.

During migration time, this attribute will determine which user’s files should be migrated off of the grid portal onto a remote resource.

Function of (file age, global file aging, resource usage)

File aging attribute

Indicative of overall access to/migration activity of a particular file.

Attribute in file_management table.

Scale: 0 to 1 probability of whether or not to migrate file.

File_aging_local_param initialized to 1.

During migration time after a user has been chosen, this attribute will help determine which files of the user to migrate.

i.e. Migrate a maximum of the top 5% of user’s files in any one cycle.

For a given user, the average of the file_aging_local_param attributes of all files should be close to 1.

Operating tolerance before action is taken is within the range of 0.9 – 1.1.

In this way, the user file_aging_global_param can be a function of this average.

If the average file_aging_local_param attribute > 1.1, then files of the user are being held to long before being migrated.

The file_aging_global_param value should be decreased.

If the average file_aging_local_param attribute < 0.9, then files of the user are being accessed at a higher frequency than the file_aging_global_param value.

The file_aging_global_param value should be increased.

Issues to consider

What is the effect of publishing on file/user aging?

What is the format for a user profile?

When do we update the user file_aging_global_param attribute?

What is the relationship between the user aging attributes of two users? In the same group? In different groups?

The Data Grid algorithms are continually evolving to minimize network traffic and maximize disk space utilization on a per user basis by data mining user usage and disk space requirements.

Grant a user access to specific or all ACDC-Grid Portal:

resources,

software, and

web pages.

45. Shake-and-Bake Grid-Eenabled Data Miningd Data Mining

The SnB grid-enabled data mining application utilizes the ACDC-Grid infrastructure and web portal, as shown in Fig. 4.

A typical SnB job uses the Grid Portal to supply the molecular structures parameter sets to optimize, the data file metadata, the grid-enabled SnB mode of operation (dedicated or back fill), and the SnB termination criteria. This information can be provided via the point and click web portal interface or by specifying a batch script, as shown in Fig. 5.

[pic]

Figure 4: The ACDC-Grid web portal user interface.

A typical SnB job uses the Grid Portal to supply the molecular structures parameter sets to optimize, the data file metadata, the grid-enabled SnB mode of operation (dedicated or back fill), and the SnB termination criteria. This information can be provided via the point and click web portal interface or by specifying a batch script, as shown in Fig. 5.

[pic]

Figure 5: The ACDC-Grid web portal database job interface.

The database job script can accept command line arguments and can be de-activated or activated or de-activated at any time by adjusting the database job grid portal parameters. A fully configurable time interval is used by the grid portal for executingto execute some or all of the configured database jobs (normally this time interval is set to 10 minutes).

The Grid Portal then assembles the required SnB application data and supporting files, execution scripts, database tables, and submits jobs for parameter optimization based on the current database statistics. ACDC-Grid job management automatically determines the appropriate execution times, number of trials, number of processors for each available resource, as well as logging and status of all concurrently executing resource jobs. In addition, it automatically incorporates the SnB trial results into the molecular structure database, and initiates post-processing of the updated database for subsequent job submissions. Fig. 6 shows the logical relationship for the SnB grid-enabled data mining routine described.

[pic]

Figure 6: ACDC-Grid grid-enabled data mining diagram.

For example, starting September 8, 2003, a backfill data mining SnB job was activated at the Center for Computational Research using the ACDC-Grid computational and data grid resources. The ACDC-Grid historical job monitoringjob-monitoring infrastructure is used to obtain the jobs completed for the period of September 8, 2003 to January 10, 2004, seeas shown in Fig. 7.

[pic]

Figure 7: ACDC-Grid job monitoring information for all resources and users.

Problem Statement

Use all available resources in the ACDC-Grid for executing a data mining genetic algorithm optimization of SnB parameters for molecular structures having the same space group.

Grid Enabling Criteria

All heterogeneous resources in the ACDC-Grid are capable of executing the SnB application.

All job results obtained from the ACDC-Grid resources are stored in a corresponding molecular structure databases.

There are two modes of operation and two sets of stopping criteria:

Data mining jobs can be submitted in

a dedicated mode (time critical), where jobs are queued on ACDC-Grid resources, or

in a back fill mode (non-time critical), where jobs are submitted to ACDC-Grid resource that have unused cycles available.

The activated data mining SnB job template is being run by user mlgreen. By hovering over the bar in the chart, as shown in Fig. 8in chart, one can see the user mlgreen’s activated the data mining job job statistics. Further, notice that and has completed 3118 jobs have been completed on the ACDC-Grid resources over this time period. The ACDC-Grid job monitoring also dynamically reports job statistics for the data mining jobs, Fig. 8. The total number of jobs completed by all users on all resource is 19,868 whereas the data mining jobs represent 15.69% of the total. The average number of processes for a data miningdata-mining job was 19.65 and the total number of processorses used over this period was 433,552, where the data mining jobs accounted for 16.85% of the total. The data mining jobs consumed 291,987 CPU hours, which was 19.54% of the total CPU hours consumed (1,494,352 CPU hours).

[pic]

Figure 8: ACDC-Grid job monitoring statistics for user mlgreen.

A subsequent mouse click on the bar chart drills down further describing the jobs completed by mlgreen. Here we see five computational resources produced the 3118 data mining jobs. The statistics for the Joplin compute platform are shown in Fig. 9, all statistics are based on only the jobs completed by the mlgreen user. There were 869 jobs processed by the Joplin compute platform representing 27.87% of the 3118 data mining jobs.

[pic]

Figure 9: ACDC-Grid job monitoring statistics for user mlgreen.

A subsequent mouse click on the bar chart drills down further describing the jobs completed by user mlgreen. Here, we see five computational resources that processed the 3118 data mining jobs. The statistics for the Joplin compute platform are shown in Fig. 9. Note that all statistics are based only on the jobs completed by the mlgreen user. There were 869 jobs processed by the Joplin compute platform representing 27.87% of the 3118 data mining jobs.

Clicking on the bar chart drills down into a full description of all jobs processed by the Joplin compute platform, seeas shown in Fig 10. The information presented includes job ID, username, group name, queue name, node count, processes per node, queue wait time, wall time used, wall time requested, wall time efficiency, CPU time, physical memory used, virtual memory used, and job completion time/date.

[pic]

Figure 10: ACDC-Grid job monitoring tabular accounting of completed job statistics.

The ACDC-Grid data mining backfill mode of operation only uses computational resources that are currently not scheduled for use by the native queue scheduler. These resources are commonly called “backfill” as users can run jobs on the associated nodes without aeffecting the queued jobs. Many queues and schedulers give this information in the of X number of nodes available for Y amount of time. The ACDC-Grid infrastructure monitors this information for all of the computational resources and stores this information in a MySQL database table, as shown insee Fig. 11.

Fig. 11 also shows the number of processors and wall time that are available for each resource. Note a value of –1 for the available wall time represents an unlimited amount of time (no currently queued job require the use of these processors). The activated data mining template can obtain the number of processors and wall time available for a given compute platform and then check the status of the platform before determining the actual GA SnB data mining job parameters. See Fig. 12.

[pic]

Figure 11: ACDC-Grid backfill information for all resources.

Fig. 11 show the number of processors and wall time that is available for each resource, note a value of –1 for the available wall time represents an unlimited amount of time (no currently queued job require the use of these processors). The activated data mining template can obtain the number of processors and wall time available for a given compute platform and then check the status of the platform before determining the actual data mining job parameters, Fig. 12.

[pic]

Figure 12: ACDC-Grid computational resource status monitor.

Using the Pearson product-moment fit function derived earlier, the new data mining job run time is estimated based on the current ACDC-Grid SnB molecular structure database information. The data mining job template is then executed leading to the migration and submission of the designed data miningdata-mining job to the respective ACDC-Grid computational resource.

The activated data miningdata-mining template has two setsoptions of stopping criteria, as follows.:

1. Continue submitting SnB data mining application jobs until the optimal parameters have been found based on pre-determined criteria., or

2. Ccontinue indefinitely (the data mining template is manually de-activated by the user when optimal parameters are found).

This illustrative example summarizes the evolutionary molecular structure determination optimization of the Shake-and-Bake method as instantiated in the SnB computer program.

There are two sets of stopping criteria:

Continue submitting SnB data mining application jobs until

the grid-enabled SnB application determines optimal parameters have been found, or

indefinitely (grid job owner determines when optimal parameters have been found).

Execution Scenario

User defines a Grid-enabled data mining SnB job using the Grid Portal web interface supplying:

designate which molecular structures parameter sets to optimize,

data file metadata, and

Grid-enabled SnB mode of operation dedicated or back fill mode, and

Grid-enabled SnB stopping criteria.

The Grid Portal assembles the required SnB application data and supporting files, execution scripts, database tables, and submits jobs for parameter optimization based on the current database statistics.

ACDC-Grid job management includes:

automatic determination of appropriate execution times, number of trials, and number of processors for each available resource,

logging and status of all concurrently executing resource jobs,

automatic incorporation of SnB trial results into the molecular structure database, and

post processing of updated database for subsequent job submissions.

56. Grid2003 Experience

1.2 Goals of the Project

The Grid2003 project was defined and planned by Stakeholder representatives in an effort to align iVDGL project goals with the Software and Computing projects of the LHC experiments. The planning process converged during the iVDGL Steering Committee at Argonne Laboratory, June 8-9, 2003, with a set of agreed to principles, that the Project must:

• · Provide the next phase of the iVDGL Laboratory

• · Provide the infrastructure and services needed to demonstrate LHC production and analysis applications running at scale in a common grid environment

• · Provide a platform for computer science technology demonstrators

• · Provide a common grid environment for LIGO and SDSS applications.

Planning details were iteratively defined, and are available in the iVDGL document server, c.f. Plan V21. The goals of the project included meeting a set of performance targets using metrics listed in the planning document. The central project milestone can be summarized as delivery of a shared, multi-VO, multi-application grid laboratory in which performance targets were pursued through deployment and execution of application demonstrations during the period before, during, and following the SC2003 conference in Phoenix (November 16-19). The Project was organized as a broad, evolving team including the application groups, site administrators, middleware and core service providers, and operations. The Project was able to call on additional effort through the Stakeholder organizations.

The design and configuration of the Grid were driven by the requirements of the applications. The Project included those responsible for installing and runnning the applications, the system managers responsible for each of the processing and storage sites (including U.S. LHC Tier1 Centers, the iVDGL funded prototype Tier2 Centers, resources from physics departments and leveraged facilties from large scale computing centers) as well as the groups responsible for delivery and support of the grid system services and operations. The overall approach of the project was “end-to-end” in terms of giving equal attention to the application, organization, site infrastructure and system services needed to achieve science applications running on a shared grid.

The applications running on Grid3 included official releases corresponding to production environments that the experiments will use in run in production and analysis over the next year. Applications from the computer science research groups (GridFTP, Exerciser, Netlogger) were used to explore the performance of different aspects of the Grid.

The project plan included basic principles of the project, which contributed to making life simpler and more flexible. In particular the decisions to: have dynamic installation of applications; not presume the installation and configuration of the “worker” processing nodes; use existing facilities and batch systems without reinstallation of the software; all contributed to the success of the project.

The active period of the project, where the people involved were expected to contribute and be responsive to the needs of the project, was a period of 5 months from July through November 2003. Subsequent to this Grid3 remains, with many applications running, but there are reduced expectations as to response time to problems and the attention of the members of the team. The collaborative organization of the project allowed us to address problems as they arose and focus our efforts in response to unanticipated issues. The team took decisions and was flexible enough to accept additional sites and applications into the project as it evolved. Identifying people with coordination roles has helped the project to scale in size and complexity. These roles were filled by responsibles from their respective projects, including: Sites (iVDGL operations team), Applications (iVDGL applications coordinator, with liaisons from each VO’s Software and Computing project), Monitoring (Grid Telemetry), Operations (iVDGL operations) and Troubleshooting (VDT). The Project was coordinated by the iVDGL and PPDG project coordinators.

As stated, the Grid2003 Project was organized to meet several strategic project goals, including building the “next phase” of the iVDGL Laboratory, according to the stated goals and mandate of NSF funded iVDGL ITR Project. The iVDGL Project previously had two deployments (both in 2002): a small testbed consisting of VDT deployed on a small number of U.S. sites (iVDGL-1), followed by an a second, joint deployment with the EU DataTAG project (iVDGL-2) which coincided with the IST 2002 and SC2002 confrences (the “WorldGrid” Project). Grid3 (originally proposed iVDGL-3) is the third phase of the iVDGL Laboratory.

The Grid2003 Project deployed, integrated and operated Grid3 with ~25 operational processing sites comprising at peak ~2800 CPUs for more than 3 weeks up to, during and after the SC2003 conference on November 16, 2003. Other iVDGL proposal themes in which progress was made:

• · Multiple VO grid: six different virtual organizations participated with 10 application deployed and successfully run. All applications were able to run on sites that were not owned by the organization whose application it was. The applications were all able to run on non-dedicated resources.

• · Multi-discinplinary grid: during the project two new applications, one from biology and the other from chemical informatics, were run across Grid3. The fact that these could be installed and run on a Grid infrastructure designed and installed for Particle and Astrophysics Experiments gives us added confidence that this infrastructure is general and can be adapted to other applications as needed.

· Use of shared resources: many of the resources brought into the Grid3 environment were leveraged facilites in use by other VO’s. Examples include successful incorporation sites

· Grid Operations and estabishment of the iGOC. Resources from the Indiana University-based Abilene NOC were leveraged to provide a number of operations services, including: VOMS services for iVDGL VO participants (CS, Biology, Chemistry), the MonALISA Grid3 database which served double duty for online resource display and archival storage for the Metrics Data Viewer (MDViewer) used for analysis of Grid3 metrics, the top level GIIS information service, development and support of the iVDGL:Grid3 Pacman cache, coordination, development and hosting of site status scripts and displays, creation/support of Ganglia Pacman caches and hosting of toplevel Ganglia collector and web server.

· Dynamic resource allocation: the University of Buffalo CCR was able to configure their local schedulers bring resources into and out of Grid3 nightly according to local policies, satisfying local requirements and (external) Grid3 users.

· International connectivity: though one site was located abroad (Kyunpook National University, Korea), international operations were not a primary focus in contrast to last year’s WorldGrid VDT-EDG interoperability demonstration project which focused on TransAtlantic Grids.

· VDT installation and configuration. Improvements included enhanced support for post-install configurations at sites with native Condor installations. Pacman3 development was concurrent to most of the project and was not used for deployment. However, initial tests of Pacman3 with iVDGL:Grid3 have demonstrated backwards compability of the new tool.

· VDT testing and robustification. The Troubleshooting team, lead by the VDT group, oversaw a number of VDT improvements and patches in response to bugs uncovered by site administrators and application users and developers. These included, most importantly, patches required for job managers and provisions for the MDS 2.4 upgrade.

The Grid2003 project met the metrics from the planning document as listed below. The “status” numbers fluctuate over the course of several weeks around SC.

1. 1. Number of CPUS: Target: 400, Status: 2163: More than 60 % of available CPU resources are non-dedicated facilities. The Grid3 environment effectively shares resources not directly owned by the participating experiments (ref: list of sites). Include pie-chart metrics.

2. 2. Number of Users: Target: 10, Status: 102: About 10% of the users are application administrators who do the majority of the job submissions. However, more than 102 users are authorized to use the resources through their respective VOMS services.

3. 3. Number of Applications: Target: >4, Status: 10: Seven scientific applications including at least one from each of the five GriPhyN-iVDGL-PPDG participating experiments were and continue to run on Grid3. In addition, three computer science demonstrators (instrumented gridftp, a multi-site I/O generator, and health monitors the grid) are run periodically. (ref: applications page)

4. 4. Number of sites running Concurrent Applications: Target: >10 Status: 17 : This number is related to the number of Computational Service (CS,CSE) sites defined on the catalog page and varies with application.

5. 5. Data Transfers Per Day: Target: 2-3 TB, Status: 4 TB: Metric met with the aid of the GridFTP-demo which runs concurently with scientific applications (ref:GridFTP statics page) (Show the data consumed/produced plots, versus time).

6. 6. Percentage of Resources Used: Target: 90%, Status: 40-70%: The maximum number of CPUs on Grid3 exceeded 2500. On November 20, 2003 there were sustained periods when over 1100 jobs ran simultaneously (the metrics plots are averages over specific time bins, which can report less that the peak depending on chosen bin size). Each time we upgraded a component of the grid there was a significant length of time before stable operation was regained. In the latter part of the project most of the upgrades were for the monitoring systems which did not prevent applications from running.

7. Efficiency of Job Completion: Target: up to 75 %; Status: varies: This varies depending on the application, and defintion of failure. Generally speaking, for well-run Grid3 sites and stable applications this figure exceeds 90%. We have not had the time to explore why individual jobs fail. US CMS MOP… Add US ATLAS results from Y. Smirnov/Xin Zhao here.

8. Peak Number of Concurrent Jobs: Target: up to 1000; Status: 1100: Achieved on 11/20/03.

9. Rate of Faults/Crashes: Target: ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download