Technical Report GriPhyN-2002-10



GriPhyN Technical Report

June 30 2004



GriPhyN Annual Report for 2003 – 2004

The GriPhyN Collaboration

[pic]

NSF Grant 0086044

Table of Contents

1 Overview 3

2 Computer Science Research Activities 4

2.1 Virtual Data Language and Processing System 4

2.2 Grid Workflow Language 8

2.3 Grid Workflow Planning 8

2.4 Research on Data placement planning. 11

2.5 Policy-based resource sharing and allocation mechanisms 12

2.6 Workflow Execution and Data Movement 12

2.7 Peer-to-Peer Usage Patterns in Data Intensive Experiments 13

2.8 Resource Discovery 14

2.9 Grid Fault Tolerance and Availability 14

2.10 Prediction 15

2.11 GridDB: A Data-Centric Overlay for Scientific Grids 17

3 VDT – The Virtual Data Toolkit 18

3.1 VDT Development, Deployment, and Support Highlights 18

3.2 The VDT Build and Test Process 19

3.3 VDT Plans 20

4 GriPhyN Activities in ATLAS 20

4.1 Application Packaging and Configuration with Pacman 20

4.2 Grappa: a Grid Portal for ATLAS Applications 21

4.3 Virtual Data Methods for Monte Carlo Production 21

5 GriPhyN Activities in CMS 22

5.1 Data Challenge 04 and Production of Monte Carlo Simulated CMS Data 22

5.2 Grid Enabled Analysis of CMS Data 23

5.3 Future Plans for CMS Activities in GriPhyN 28

6 GriPhyN Activities in LIGO 29

6.1 Building a Production Quality Analysis Pipeline for LIGO's Pulsar Search Using Metadata-driven Pegasus 30

6.2 Lightweight Data Replicator 39

7 GriPhyN Activities in SDSS 40

8 Education and Outreach 41

8.1 New UT Brownsville Beowulf System 41

8.2 VDT software installation and Grid Demos 41

8.3 Portal Development 41

8.4 Grid applications in gravitational wave data analysis 41

8.5 Teacher Assessment Workshop 42

8.6 Grid Summer Workshop 42

8.7 QuarkNet – GriPhyN Collaboration: Applying Virtual Data to Science Education 44

8.8 A New Initiative: Understanding the Universe Education and Outreach (UUEO) 45

8.9 Milestones for year 2004 46

9 Publications and Documents Resulting from GriPhyN Research 47

10 References 50

Overview[1]

GriPhyN is a collaboration of computer science researchers and experimental physicists who aim to provide the IT advances required to enable Petabyte-scale data intensive science in the 21st century. The goal of GriPhyN is to increase the scientific productivity of large-scale data intensive scientific experiments through a two-pronged Grid-based approach:

• Apply a methodical, organized, and disciplined approach to scientific data management using the concept of virtual data to enable more precisely automated data production and reproduction.

• Bring the power of the grid to bear on the scale and productivity of science data processing and management by developing automated request scheduling mechanisms that make large grids as easy to use as a workstation.

Virtual data is to the Grid what object orientation is to design and programming. In the same way that object orientation binds method to data, the virtual data paradigm binds the data products closely to the transformation or derivation tools that produce data products. We expect that the virtual data paradigm will bring to the processing tasks of data intensive science the same rigor that scientific method brings to the core science processes: a highly structured, finely controlled, precisely tracked mechanism that is cost effective and practical to use.

Similarly for the Grid paradigm: we see grids as the network OS for large-scale IT projects. Just as today the four GriPhyN experiments use an off-the-shelf operating system (Linux, mainly), in the future, our goal is that similar projects will use the GriPhyN VDT to implement their Grid. The road to this level of popularization and de facto standardization of GriPhyN results needs to take place outside of and continue beyond the time frame GriPhyN; hence the partnerships that we will create between GriPhyN and other worldwide Grid projects are of vital importance.

One of GriPhyN’s most important challenges is to strike the right balance in our plans between research – inventing cool “stuff” – and the daunting task of deploying that “stuff” into some of the most complex scientific and engineering endeavors being undertaken today. While one of our goals is to create tools so compelling that our customers will beat down our doors to integrate the technology themselves (as they do with UNIX, C, Perl, Java, Python, etc), success will also require that we work closely with our experiments to ensure that our results do make the difference we seek. Our metric for success is to move GriPhyN research results into the four experiments to make a difference for them, and demonstrate the ability of science to deal with high data volumes efficiently.

Achieving these goals requires diligent and painstaking analysis of highly complex processes (scientific, technical, and social); creative, innovative, but carefully focused research; the production of well-packaged, reliable software components written in clean, modular, supportable code and delivered on an announced schedule with dependable commitment; the forging of detailed integration plans with the experiments; and the support, evaluation, and continued improvement and re-honing of our deployed software.

This report is being written at the start of GriPhyN’s 33rd month of funding. Our understanding of the problems of virtual data tracking and request execution have matured, and our research software for testing these concepts has now been applied to all participating experiments. The Virtual Data Toolkit (VDT) has increased in capability, robustness, and ease of installation, continues to prove itself as an effective technology dissemination vehicle, being adopted by many projects worldwide including the international LHC Computing Grid project. We demonstrated several important milestones in GriPhyN research and its application to the Experiments at Supercomputing 2003 in November 2003, where in collaboration with our Trillium partners we used real science scenarios from HEP, LIGO, and SDSS to demonstrate the application of virtual data management and grid request planning and execution facilities on the large Grid3 distributed environment. We demonstrated large virtual-data-enabled pipelines running galaxy cluster finding algorithms on Sloan Digital Sky Survey data, and have applied virtual data techniques and tracking databases on ATLAS Monte Carlo collision event simulations. We have moved the CMS event simulation environment close to production readiness by greatly increasing the event production rates and reliability. Notably, another research project is now using GriPhyN technology for BLAST genome comparison challenge problems.

We have continued to push our CS research forward, particularly in the areas of extending the scope of the virtual data paradigm, simulating and testing mechanisms and algorithms for policy-cognizant resource allocation, and continued to deploy and stress test scalability enhancements to various Globus Toolkit components (e.g., its replica location service and GRAM job submission system), Condor –G, and DAGman.

Our project continues to be managed by project coordinators Mike Wilde of Argonne and Rick Cavanaugh of the University of Florida. Alain Roy of the University of Wisconsin, Madison, manages Virtual Data Toolkit development, and other staff manage distinct project areas. Technical support for the project and for the development and enhancement of web communication resources and conferencing facilities has been supplied by staff at the University of Florida.

During this time, we have held an all-hands collaboration-wide meeting in conjunction with the iVDGL project, with broad participation from researchers and graduate students, application scientists and computer scientists. We have also conducted smaller, more focused workshop-style meetings on specific CS research topics and CS-experiment interactions.

Computer Science Research Activities

Computer science research during the past 12 months of the GriPhyN project has focused on the following primary activities:

• Development, packaging, and support of twelve new releases of the Chimera Virtual Data System and the Pegasus planner

• Request planning and execution: algorithms and architectural issues

• Understanding the architectural interactions of data flow, work flow management, and scheduling

• Data scheduling research into advance data placement algorithms and mechanisms

• Request execution – continued to enhance, harden, and tune grid execution mechanisms

• Understanding fault tolerance for the grid: metrics and strategies

• Resource sharing policy algorithm design, simulation, and prototyping

• Performance prediction of data grid workloads in specific GriPhyN application domains

• Development of languages for specific aspects of Grid computing: database workflow and grid data flow.

In this section we review our progress in each of these areas.

1 Virtual Data Language and Processing System

The GriPhyN virtual data language, VDL, continues to evolve and experience expanding usage across all four GriPhyN experiments, as well as in other projects that collaborated with the project via iVDGL and the Grid2003 effort (including BTeV, and the NCSA “PACI Data Quest” project in computational biology and astronomy)

GriPhyN research in the area of virtual data during this period continued to focus on the packaging, support, tuning and functional enhancement of the virtual data language processing stack.

During this period, the virtual data language and the Chimera Virtual Data System were enhanced and supported for the application to all GriPhyN experiments. The team also provided support for the use of Chimera and the VDT to problems in computational biology and astronomy, as part of the NCSA Alliance Expedition “PACI Data Quest”.

Twelve editions of the Virtual Data System component of the VDT were created and released, and another release is posted in release-candidate state at the time of this writing. The features provided in these releases included the following highlights:

• Attributes were provided to specify run-time handling for virtual data arguments – whether or not data transfer was required, and whether or not cataloging of each output data product was required. These features, driven by application needs, enabled the system to address a wider set of application needs. For example, some applications (the first of which was SDSS galaxy cluster finding) required that the data transfer and cataloging stage after a transformation executes should depend on whether or not the application actually produced a file. This feature allows for such semantics, where an application may optionally decide to create an output file depending on conditions present in its inputs.

• Compound transformations were enhanced to simplify their interface, eliminating the need to placed internal temporary files into the argument list of the outer compound.

• The uniform execution environment capture mechanism, named “kickstart” was significantly enhanced in a new version that can be more flexibly, scalably and transparently integrated into application workflows. This utility launches applications in a uniform manner across different schedulers/resource managers, and captures important details about the runtime environment and behavior of the application. This information is then returned with each job and placed into the virtual data catalog to complete the “provenance trail” of each Grid job execution and of the resulting data files. Support for Cygwin (for Windows execution) and for LCG1 was added; ability to create necessary directories on the fly. New version under development to provide stat information on arbitrary transformation input and output files, multiple pre-, post-, init-, and cleanup jobs, optional md5sums on files, 2nd level staging (for LCG2), improved XML provenance records, and is config-file driven to overcome command line limitation.

• Transfer scripts enhanced and generalized to permit a wider range of transfer utilities and techniques to be used at workflow runtime, including support for the efficient batching of the files needed by each job, a local file symlinking option to avoid reduncant file copies, new POSIX-threaded implementation to permit multiple simultaneous transfers, and failed transfer retries. A new data transfer utility under development supports: an input file of transfer directives, added logical filename for logging purposes, and multiple choices for input sources.

• Added features to the abstract workflow generator (“gendax”): local variable declarations for compound statements, an optimized abstract DAG format, improved logical filename attributes for data transfer control

• A greatly improved Java API for direct embedding of VDL processing into portals and other user interfaces, an “ant”-driven build process, and additional Java documentation files for packages.

The chimera-support email list was used to support the virtual data system tools, with the support load being jointly shared between the University of Chicago and ISI.

Major enhancements were made in the virtual data catalog and the run time environment for virtual data jobs, to record the detailed execution environment and run-time behavior of invoked transformations. An execution wrapper named “gridlaunch” was created to give each executed transformation a uniform run-time environment, and to record the details of the invocation’s return code and exit status, to record the amount of system resources consumed by the invocation, and to capture any log and effort output produced by the invocation.

Work on datasets and a supporting dataset typing system has continued during this period, but has proven to be an exceedingly difficult problem and has not yet reached the stage where results can be included into the virtual data system. This area will continue to be a focus of our virtual data research in the next project year.

Research in virtual data at U of Chicago has been conducted by Yong Zhao, Ian Foster, Jens Voeckler, and Mike Wilde. This research has been focused on virtual data discovery and metadata management, and has yielded several developments:

• a web-based visual interface for the discovery and management of virtual data definitions

• support for user-provided metadata in the virtual data model and catalog

• support for a unified treatment of services and application programs as virtual data transformations.

The current virtual data system offers a set of command-line-based interfaces to interact with the virtual data system, where users across all these domains need an interactive environment where they can easily discover and share virtual data products, compose workflows and monitor their executions across multiple grid sites, and integrate virtual data capabilities into their specific user communities and science applications. Moreover, it is also nontrivial to set up and configure the virtual data system and tie it together with grid compute and storage resources.

These considerations led Yong Zhao to develop, apply, and evaluate Chiron, a “Grid portal” designed to allow for convenient interaction with the virtual data system, and to provide an integrated, interactive environment for virtual data discovery and integration. [Chiron] The portal has been used extensively by the Quarknet project as the platform for their physics education applications.

The Chiron virtual data portal is an integrated environment that provides a single access point to virtual data and grid resources. It allows users to

• manage user accounts and track user activities;

• describe and publish applications and datasets;

• discover, validate, share and reuse applications and datasets published by other users;

• configure and discover Grid resources;

• compose workflows and dataset derivations;

• execute workflow either locally or in a Grid environment and monitor job progress; and

• record provenance information.

Chiron provides both user level and service (middleware) level functionalities that portal users can use to build customized virtual data interfaces for specific user communities and science applications.

The architecture of the Chiron interface is depicted below, along with workflows captured from several applications using the GriPhyN Virtual Data Language:

[pic]

Figure 1 Chiron System Architecture

[pic]

Figure 2 Workflows for CompBio, SDSS, Quarknet, and TerraServer

Yong also successfully integrated web services invocation into Chimera, so that VDL could provide a common abstract description to both applications and web services, and users can compose workflows based on the abstract description, invocate applications or web services in the planning stage, and track the provenance of both. I'm working on prototyping VDL2 that has a type system and would allow intelligent workflow composition, and dataset management.

The lifecycle of virtual data consists of description, discovery, validation, and reuse, and the key to virtual data sharing is discovery. Metadata facilitates flexible and powerful discovery of virtual data. Yong enhanced the virtual data system with metadata management capabilities, which allow users to associate metadata attributes with virtual data objects and search for them in a SQL style query language.

2 Grid Workflow Language

Gridflow (or Grid Workflow), the work of Reagan Moore and Arun Jagatheesan at SDSC, is a form of workflow that is executed on a distributed infrastructure shared between multiple autonomous participants, according to a set of local domain policies and global grid policies. Since the resources are shared, the execution of these tasks has to obey constraints created by the participants. Scientific applications executed on the grid often involve multiple data processing pipelines that can take advantage of the gridflow technology.

A language to describe scientific pipelines needs an abstract description of execution infrastructure, to facilitate late binding of resources in a distributed grid environment. The language has to provide the ability to describe or refer to virtual data, derived data sets, collections, virtual organizations, ECA (event-condition-access) based rules, gridflow patterns, and fundamental data grid operations. Even though traditional workflow languages are available, they don’t provide a way to describe the fundamental data grid operations as part of the language itself. In addition, the language needs to support data types like logical collections or logical data sets for location and infrastructure independent descriptions of gridflow pipelines. Additional requirements we have observed based on production data grid projects is the description of simple ECA rules that govern the execution of the gridflows and a way to query another executing gridflow as part of the gridflow language it self.

The SDSC Matrix Project, supported by GriPhyN, carries out research and development to create a gridflow language capable of describing collaborative gridflows for the emerging grid infrastructures. The GriPhyN Data Grid Language (DGL) developed by GriPhyN IT research is currently used in multiple data grid projects at SDSC. DGL is a XML Schema based language that defines the format of gridflow requests and gridflow responses. DGL borrowed well-known computer science concepts in: Compiler Design (recursive grammar, scoped variables and execution stack management), Data Modeling (Schema definitions for all logical data types), Grid Computing (Fundamental data grid operations) and other features like XQuery to query the gridflow execution. DGL currently supports the following gridflow patterns: sequential flow, parallel flow, while loop, for loop, milestone execution, switch-by-context, concurrent parameter sweep, n of m execution, etc.. It allows description of pre-process and post-process rules to be executed, very similar to triggers in databases.

As part of the drive to make DGL an accepted standard for gridflow description, query and execution, we are investigating the integration of DGL with other languages like BPEL [BPEL], to describe web service based process pipelines and VDL [VDL] to describe virtual data within DGL. Researchers from SDSC and ISI collaborated on the development of research prototypes to compile DGL at runtime into a format called DAX (DAG in XML). This format serves as input for the GriPhyN Pegasus Planner [PEGS] and can be used to start DGL computational processes in the grid using Condor-G [CONG]. The acceptance of GriPhyN DGL in multiple data grid projects could lead it to become a viable abstract assembly language for grid computing [SCEC].

3 Grid Workflow Planning

During the third year of GriPhyN research at ISI focused on exploring the elements of the request planning space. In this period, this work made many advances; it was led by Ewa Deelman, and involved the efforts of Gaurang Mehta and Karan Vahi.

Deferred planning has been incorporated into the Pegasus framework. This mode of Pegasus defers the mapping of the jobs to the time that they are ready to be run. It consists of two steps

• Partitioning the workflow

• Creating a Condor MegaDAG to run the partitions

1 Partitioning the workflow

Partitioning involves subdividing the workflow into smaller workflows(partitions) by a partitioner. Currently, the partitioner can partition the workflow in the following two ways

Level based partitioning: In this approach, a modified version of breadth first search of the abstract workflow is performed.

Single node partitioning: Each job in the abstract workflow is treated as a partition. This is being used to integrate Euryale functionality into Pegasus. Such partitioning refers to delaying the decision of mapping to just before a job is about to run.

Figure 3 shows a breadth-first partitioning:

Figure 3: A simple workflow partitioning scheme. The Resulting abstract workflow determines the lookahead for Pegasus.

2 Creation of “MegaDAGs”

In the default deferred planning mode, each partition is run as a DAGMan job . The DAGMan job maintains the flow of execution for that partition. Hence, one has to create a mega condor dag that runs Pegasus to create the concrete condor submit files for the partition, and then run that partition as a DAGMan job. The structure of the MegaDAG is determined by the partition graph, which is generated during the partitioning of the original workflow. The partition graph identifies the relationships between the partitions.

[pic]Figure 4: A Example of a MegaDAG.

The above approach is expected to work well in the case of medium to large sized partitions. However, for partitions with just one node, the overhead of running a DAGMan job for it can be significant compared to the execution time of the job itself. Currently we are investigating other ways to incorporate this into MegaDAG, where the single node partition would be run as a condorg job and not a DAGMan running a Condor-G job.

3 Retry Mechanism for deferred planning

A retry mechanism is also being incorporated in deferred planning case that would allow us to retry at a partition level, if any failures occur during the execution of that partition. The existence of an error occurring would be determined both from kickstart and Condor log files.

Currently, this has been prototyped as follows. A prescript checks the condor log files for any failure. If any error is found in the condor log files, then the pool of resources on which that failure occurred will be removed from subsequent consideration. To achieve this a per partition cache file with the sites is maintained. The log files are checked to see which job in the partition failed, and the site at which the failed job was run is taken out of the list of runnable execution pools that is given to Pegasus on retry. Thus Pegasus generates the plan again for that partition and ignoring the problem site.

4 Site Selection

Site selection techniques and api were improved on. A round robin site scheduler was incorporated into Pegasus in addition to the existing Random based site scheduler.

A non-java callout site selector has also been incorporated, that allows Pegasus to interface with any external site selector. The callout involves writing out of a temporary file containing the relevant job description and grid parameters (execution pools, grid ftp servers) , that is parsed by the external site selector . The external site selector writes out the solution on it’s stdout , which is piped back to Pegasus.

5 Simulator

A workflow simulator was developed on top of NS network simulator to test various workflow scheduling heuristics. This simulator has been developed with the idea of testing various scheduling techniques before their incorporation into Pegasus. We have currently evaluated min-min, max-min and suffrage heuristics within the simulator. In addition, an AI-based search technique (GRASP) is being evaluated.

The simulator allows the user to specify sites, the hosts making up the sites, the compute powers of the hosts and the bandwidths between the sites. For detailed simulations, one can either use ns routing during the simulation to transfer data between sites or use average bandwidth data obtained from previous runs to estimate the runtime bandwidths.

6 Transformation Catalog

A new Transformation Catalog (TC) has been implemented using the MySQL database. Schemas to support Postgres and OtherDB will be provided by UChicago. The other implementation is for a file based TC. A common tc-client has been implemented which ca work with any TC implementation (stock or user provided) as long as they stick to the TC Api's. Operations supported are basic add/update mechanism to add information about transformations, profiles associated with transformations and the type and architecture and OS. Also operations to delete and query the TC are provided.

7 Performance tests and tuning

ISI has been tuning the condor pool setting using grid-monitor and changing the condor submission rates to see what and if any overhead can be removed.

Pegasus performance tests have not been done as yet but improvements have been made slowly and steadily like adding BULK RLS LFN and attribute queries to decrease query time to the RLS substantially for huge workflows. Also RLS mechanism jobs have been moved from the vanilla universe to the scheduler universe to allow them to run much faster.

A full JAVA profiler needs to be run on a huge workflow with Pegasus to identify other hot spots in the code if any but this requires a machine with lots of memory to see all the states.

Also discussions have been done with UofC to set up nightly CVS validating tests with real examples from ATLAS/Montage/SDSS/LIGO

4 Research on Data placement planning.

The research effort on this topic at Chicago considered scheduling and planning algorithms for large jobs which need large input data files, as is typical in high energy physics experiments. Our studies (both simulation based and on a Grid testbed) indicate that decoupling replication and scheduling strategies works to efficiently use Grid resources and decrease response times of users. In our design for an architecture to support this model of data placement planning, three elements manage the distribution of jobs and data files. Each user is associated with an External Scheduler (ES), and submits jobs to that ES. The Grid can have any number of ES's and different combinations of users and ES's give rise to different configurations. The ES then decides which remote site to send the job to depending on the scheduling algorithms being tested.

Jobs submitted to a site are managed by the sites Local Scheduler (LS). The LS decides which job to run on which local processor, depending on the local scheduling policy/algorithm. Each site also  has a Data Scheduler (DS), which manages the files present at that site. The DS decides when to replicate data files to remote sites (depending on a replication policy). It also decides when and which files to delete from the local storage.

 We have implemented the architecture described above on a large Grid testbed (Grid3) by both implementing new modules and using the Globus infrastructure for basic services like replica location and authentication. A simulator, ChicSim, was also constructed to evaluate different Grid scheduling and replication strategies. ChicSim is a modular and extensible discrete event Data Grid simulation system built using Parsec (a discrete event simulation language). A wide variety of scheduling and replication algorithms can be implemented and their joint impact can be evaluated using ChicSim. Discrete event simulation typically is composed of entities and messages (interactions between the entities). The main entities in our simulation correspond to our view of a grid as three distinct components: the underlying network infrastructure, the actual entities that are part of the grid and applications which run at the entities.

We have designed and implemented a Grid testbed that run many of the simulated experiments on real-life scenarios from the GriPhyN science experiments. This work is based on the execution of a Chimera workload. We have implemented tested and used new modules for file transfer jobs, data movement scheduler, site selector, and a file popularity updater used for both cache management and dynamic replication of files.

We have run a large number of jobs on the Grid3 testbed of 27 sites, using a range of application workloads, to compare our various data and job scheduling strategies. These results corroborate our earlier simulation results, confirming that it is important for job-placement decisions to include data-location information, among other parameters. We also find that dynamic replication of popular data-sets helps ease load across sites.

5 Policy-based resource sharing and allocation mechanisms

Catalin Dumitrescu at U of Chicago conducted research on policy-based resource allocation for Grid-enabled environments. This effort focused on challenging policy issues that arise within virtual organizations (VOs) that integrate participants and resources spanning multiple physical institutions. Its goal is to create techniques for site and VO resource allocation policies to be expressed, discovered, interpreted, reconciled and enforced by, for, and to Grid jobs in order to maximize both local site and distributed VO utilization and value.

Resource sharing within Grid collaborations usually implies specific sharing mechanisms at participating sites. Challenging policy issues can arise within virtual organizations (VOs) that integrate participants and resources spanning multiple physical institutions. Resource owners may wish to grant to one or more VOs the right to use certain resources subject to local policy and service level agreements, and each VO may then wish to use those resources subject to VO policy. Thus, we must address the question of what usage policies (UPs) should be considered for resource sharing in VOs.

Additional work [UPRS, CORC] done during a summer internship at IBM, is closely related to, and continued the GriPhyN research, by examining the relationship between SLAs and usage policies. This work focused on agreement-based resource allocation for Grid-enabled environments, specifically the use of agreement-based resource reservation to provide the guarantees to access certain resources within VOs. This work explored how are such agreements negotiated, deployed and monitored, in order to signal agreement or policy violations.

6 Workflow Execution and Data Movement

In the last year, the University of Wisconsin-Madison team has made significant progress on request execution and data movement.

We have continued developing Stork, our tool for data placement. Stork has become significantly more stable. In addition to the graduate student who initially developed Stork, there are two academic staff employees who have contributed to hardening Stork and adding features. This allows progress on the research aspects of Stork to be made while simultaneously ensuring that a users have access to stable software.

Stork has added support for more protocols. It has also improved security in two ways: first, by allowing multiple users to share a Stork server without sharing certificates, and second by using MyProxy to control access to proxy certificates.

Recently, collaborators have been showing increased interest in Stork, both within and without GriPhyN. Within GriPhyn, the Pegasus team has been working with Stork to see if it can be integrated into the workflows generated by Pegasus. Outside of GriPhyN, collaborators from LCG at CERN are investigating CERN to see if it can be used in their workflows. We are hoping to build on the developing momentum by continue to support these users and to find new user for Stork.

We have continued development of DAGMan, for executing workflows. We have recently had conversations with the Virtual Data System team about enhancements needed to DAGMan, and we have implemented an improved retry capability to allow DAGMan to more easily interact with planners.

7 Peer-to-Peer Usage Patterns in Data Intensive Experiments

The GriPhyN project addresses the data sharing and analysis requirements of large-scale scientific communities through a distributed, transparent infrastructure. This infrastructure necessarily includes components for file location and management as well as for computation and data transfer scheduling. However, there is little information available on the specific usage patterns that emerge in these data-intensive, scientific projects. As a consequence, current designs either employ a cover-all approach (provisioning for all possible usage patterns and not optimizing for any), or start from unproved assumptions on user behavior by extrapolating results from other systems (such as file systems, the web, etc.). If inappropriate, these assumptions may lead to inadequate design decisions or to performance evaluations under unrealistic conditions. Research in this area was conducted at the University of Chicago by Foster, Iamnitchi, Ranganathan, and Ripeanu. [IR02]

In [RR+03] we analyzed the performance of incentive schemes in P2P file-sharing systems by defining and applying an analytic model based on Schelling’s Multi-Person Prisoner’s Dilemma (MPD). We use both this framework and simulations to study the effectiveness of different incentive schemes for encouraging sharing in distributed file sharing systems.

Internet traffic is experiencing a shift from web traffic to file swapping traffic. A significant part of Internet traffic is generated by peer-to-peer applications, mostly by the popular Kazaa application. Yet, to date, few studies analyze Kazaa traffic, thus leaving the bulk of Internet traffic in dark. In [LRW03] we present a large-scale investigation of Kazaa traffic based on logs collected at a large ISP, which capture roughly a quarter of all traffic between Israel and US.

The Gnutella network was studied as a potential model for file sharing in large scientific collaborations. The open architecture, achieved scale, and self-organizing structure of this network make it an interesting P2P architecture to analyze. The main contributions of this macroscopic evaluation are: (1) although Gnutella is not a pure power-law network, its current configuration has the benefits and drawbacks of a power-law structure, (2) we estimate the aggregated volume of generated traffic, and (3) the Gnutella virtual network topology does not match well the underlying Internet topology, hence leading to ineffective use of the physical networking infrastructure. This research is reported in [RF02]

Further research in P2P resource discovery was reported in A peer-to-peer approach in resource discovery in grid environments, to be presented at HPDC-11, July 2002. Our poster On Fully Decentralized Resource Discovery in Grid Environments will further explore that issue at the same conference.

In [FI03] we contend that any large-scale distributed system must address both failure and the establishment and maintenance of infrastructure.

We propose a novel structure, the data-sharing graph, for characterizing sharing patterns in large-scale data distribution systems. We analyze this structure in two such systems and uncover small-world patterns for data-sharing relationships. We conjecture that similar patterns arise in other large-scale systems and that these patterns can be exploited for mechanism design.

Lastly, continuing important work on replication location service (RLS) from the previous period, we propose a general framework for creating replica location services, which we call the GIGa-scale Global Location Engine (Giggle). This framework defines a parameterized set of basic mechanisms than can be instantiated in various ways to create a wide range of different replica location services. By adjusting the system parameters we can trade off reliability, storage and communication overhead, and update and access costs. Matei Ripeanu assisted in the construction and operation of the Giggle world-wide replica location challenge demonstrated at SC'02.

8 Resource Discovery

Adriana Iamnitchi completed her dissertation "Resource Discovery in Large-scale Resource Sharing Environments" and received her Ph.D. from the U of Chicago in December 2003.

[SWFS] studies the commonalities between web caches, content distribution networks, peer-to-peer file sharing networks, distributed file systems, and data grids. These all involve a community of users who generate requests for shared data. In each case, overall system performance can be improved significantly if we can first identify and then exploit interesting structure within a community's access patterns. To this end, we propose a novel perspective on file sharing based on the study of the relationships that form among users based on the files in which they are interested.

This work proposes a new structure that captures common user interests in data--the data-sharing graph-- and justify its utility with studies on three data-distribution systems: a high-energy physics collaboration, the Web, and the Kazaa peer-to-peer network. We find small-world patterns in the data-sharing graphs of all three communities. We analyze these graphs and propose some probable causes for these emergent small-world patterns. The significance of small-world patterns is twofold: it provides a rigorous support to intuition and, perhaps most importantly, it suggests ways to design mechanisms that exploit these naturally emerging patterns.

In [SWUP] we built on previous work showing that the graph obtained by connecting users with similar data interests has small-world properties. We conjectured then that observing these properties is important not only for system description and modeling, but also for designing mechanisms. In this paper we present a set of mechanisms that exploit the small-world patterns for delivering increased performance. We propose an overlay construction algorithm that mirrors the interest similarity of data consumers and a technique to identify clusters of users with similar interests. To appreciate the benefits of these components in a real context, we take file location as a case study: we show how these components can be used to build a file-location mechanism and evaluate its performance.

Iamnitichi’s research focused on designing decentralized mechanisms for the management of large-scale distributed systems and on understanding the associated tradeoffs between performance and scalability. Grid and peer-to-peer (P2P) resource sharing environments provide the real-world challenges for this research. These two environments appear to have the same final objective-the pooling and coordinated use of large sets of distributed resources-but are based in different communities and, at least in their current instantiations, focus on different requirements: while Grids provide complex functionalities to relatively small, stable, homogenous scientific collaborations, P2P environments provide rudimentary services to large-scale, untrusted, and dynamic communities.

This research explored the convergence area of these two environments: mechanisms that cope with the scale and volatility of peer-to-peer systems while providing complex services typical of Grids. Several relevant problems were explored in this context: resource and file location, replication for data availability, characterization of system and usage behavior, performance predictions for a numerical simulation application, and fault-tolerant algorithms. In order to deal with large scale and unpredictability, the solutions often exploit problem and usage characteristics to improve performance.

9 Grid Fault Tolerance and Availability

Research on Grid fault tolerance at UCSD has been concentrating on the infrastructure needed to support fault-tolerance in a Grid services architecture. At first glance, it doesn't seem to be a terribly challenging problem to address: there are commercial products, for example, that support making network services highly available. Looking more closely, there are four aspects to Grid services that need to be considered when high availability is desired:

Engineering is an issue. Despite decades of experience demonstrating repeatedly that fault-tolerance, like security, is an aspect of a service that is best considered at the onset of design, it is almost always added after deployment. Luckily, many of the design constraints for scalable services (such as statelessness, which means that any state a service maintains is held in a database) make retrofitting high availability easier.

Grid services are often stateful. This means that there can be information in a server that is not stored in a database. Grid services are stateful in part to avoid the cost of using a database, and also because much of the state is "soft": it can be recovered in the face of a failure. This complicated retrofitting high availability.

Grid services are often nondeterministic. Services are often nondeterministic for low-level reasons (eg, when interrupts occur), which makes replication hard. But, many grid services are frankly nondeterministic. For examples, resource schedulers often make nondeterministic assignments in order to spread load more effectively.

Grid services are built on high-level protocols. One of the major trends in distributed systems has been the constant raising of the level of abstraction used for gluing RPC like systems together. We keep going higher to make interconnection easier. The current level - SOAP and HTTP - works well because of their use as interconnection of web services. Building at this high a level has an impact on performance (average latency becomes larger and peak bandwidth may be reduced) and on timeliness (worst-case latency is finite but it may be effectively unbounded).

Our status is as follows:

Primary-backup is a logical choice for a stateful, nondeterministic service. It is also amenable to being implemented as a "wrapper" based on interception, and so should be relatively easy to add to an existing service. It has significant drawbacks: it can tolerate only benign failures and requires a bound on the worst-case latency. Building it on top of a high level protocol is worrisome both in terms of correctness and of performance. So, we have designed and built two primary-backup service for Grid services: one using the high-level protocols supplied by Globus, including a notification service, and one using low-level TCP/IP. The former is attractive for retrofitting purposes. We also built a protocol that used an intermediate approach: state replication was done at a high level but not using the notification service. Not surprisingly, we found the performance of the high-level approaches worse than the low level approach. However, the most severe performance problems were due to the notification service. This gives hope that tuning might make the high-level approach feasible in terms of performance. We have a paper in the upcoming Clusters conference on this work.

Our treatment of state is naive. Some of the latency in primary-backup has to do with ensuring data is stable before replying to the client (thus avoiding the "output commit" problem). Recall that Grid services are often based on soft state. Hence, data doesn't have to be as stable in a primary-backup service before replying: it only has to be in the volatile memory of the backups rather than written to disk. This implies that some performance gain can be obtained by associating with each datum the recovery semantics it has. If one were to do this for a database replication protocol, then one might be able to use a "stateless" approach for Grid services without losing the performance benefits of having the services be based on soft state. We are currently investigating this approach.

We need to address the issue of asynchronous execution. Our current plans are to use an optimistic approach based on Paxos. Paxos is an agreement protocol that is safe but not live: it terminates only if the system behaves in a timely manner for some period of time. Paxos, like all state machine-based replication approaches, requires all replicas to be deterministic. We believe that we can modify Paxos to work for replicated nondeterminsitic state machines. We will start investigating this approach this Summer.

10 Prediction

This year the TAMU group has worked extensively with the ISI and University of Chicago groups to provide predictions for identifying appropriate resources with the resource planning tool with GriPhyN. To get good predictions, one must have mechanisms for collecting and archiving adequate performance data. In terms of performance data, it is important to collect this information at three different levels:

1. Applications: need to have performance data about the computations and I/O

2. Middlewares: need to have performance data about the different middleware functions that interface to the application as well as the ones that interact with the storage resources.

3. Resources: need to have performance data about the compute, network and I/O resources.

The focus thus far has been on predicting application performance for use with the Pegasus Planner. The current approach that we have developed entails interfacing the Pegasus Planner with the Prophesy Predictor. Prophesy is an infrastructure for performance analysis and modeling of parallel and distributed applications. Prophesy includes three components: automatic instrumentation of applications, databases for archival of information, and automatic development of performance models using different techniques. It is the last component, the model builder, which is used to formulate predictions based upon performance data. This current plan is represented in the figure below.

Figure 4. The relations between Prophesy and Chimera/Pegasus

Pegasus receives an abstract workflow description from Chimera, contacts Prophesy Predictor to get the desired prediction information (this process may be iterative), then produces a concrete workflow, and submits the concrete workflow to Condor-G/DAGMan [FT01] for execution. Prophesy also provides interfaces with other performance tools and databases, so that Prophesy Predictor can make more accurate predictions for Pegasus planner.

Currently, we are working on the Pegasus-Prophesy interface using the GeoLIGO application. This benchmark is used to develop an initial language to allow for different queries to predict the computer resources as well as the I/O resources needed for an application. Upon working through the details with GeoLIGO, we will work with the other experiments.

The initial performance data is focused on the following events:

• Execution time (real time) and number of processors needed for efficient executions

• File access time (combination of access latency and time to get files in place for execution)

• Input files

• File size and location

• Disk requirements

• Output & temporary files

The results of this work are described in detail in [PRED].

11 GridDB: A Data-Centric Overlay for Scientific Grids

In the 2003-2004 academic year, the GriPhyN group at UC Berkeley generalized their interactive query processing framework from the previous year into GridDB, a full-fledged system for providing data-centric services in scientific grids. The GridDB system integrates ideas from workflow and database systems to address new challenges posed by computational scientists using grid platforms. The GridDB design was informed through a long-running collaboration among various groups inside and outside of GriPhyN: the database group at UC Berkeley; the chimera team at Argonne National Labs; and two groups of scientists, HEP physicists at LBNL and University of Chicago and astrophysicists at Fermilab. The work on GridDB has resulted in publications in VLDB’04 and SemPGrid’04. It has also served as the basis of David Liu’s master’s thesis at UC Berkeley. Finally, several presentations and demos of the system have been given, which have opened the door for further collaborations, including an interest in adoption from LSST, an astrophysics project in its embryonic stages.

GridDB is a software overlay built on top of currently deployed “process-centric” middleware such as Condor and Globus (Figure 1). Process-centric middleware provides low-level services, such as the execution of processes and management of files. In contrast, GridDB provides higher level services, accessed through a declarative SQL interface. As a result, users can focus on the manipulation of data generated by process workflows, rather than on the workflows themselves.

[pic]

Figure 5: The GridDB data-centric software overlay

GridDB’s model mirrors that of traditional databases: a user first describes user workloads during a data definition phase enabling a system to provide special services to users. Specifically, a modeler describes grid workloads in the Functional Data Model with Relational Covers (FDM/RC), a model we have defined to represent grid workflows and data. In contrast to existing workflow models (e.g., those used by Chimera and Dagman) and data models (e.g., the relational model), the FDM/RC allows the description of both workflows and the data transformed by workflows. By modeling both concepts, GridDB provides a uniform set of services typically found either in workflow systems or in database systems, including: declarative interfaces, type checking, work-sharing and memorization, and data providence. Furthermore, GridDB is able to provide novel services such as data-centric computational steering and smart scheduling, which arise from the synergy between workflow and data models.

We have presented GridDB at several venues and workshops, including the Lawrence Livermore National Lab (LLNL), the Lawrence Berkeley National Lab (LBNL), Microsoft’s Bay Area Research Center (BARC), the 2nd International Workshop on Semantics in Peer-to-peer and Grid Computing (SemPGrid’04), and the GriPhyN 2004 all-hands meeting. As a direct consequence of these talks, we have established a collaboration with the Large Scale Synoptic Telescope (LSST) project, a next-generation astronomical survey requiring real-time and ad hoc analysis of telescope image data.

At this point, we believe that main research ideas of GridDB are mature and we are working on rounding out the functionality of the system and transferring the technology into the hands of grid scientists.

VDT – The Virtual Data Toolkit

1 VDT Development, Deployment, and Support Highlights

The VDT has made significant strides in the last year. In particular:

The VDT has upgraded many of the software packages, particularly Condor (which is now version 6.6.5) and Globus (which is now version 2.4.3). In addition, the VDT has roughly doubled the number of components that are included.

The VDT has solidified and strengthened its collaboration with European collaborations, particularly LCG and EGEE. LCG has contributed many bug reports and bug fixes to the VDT, and has proven to be a valuable partner. Collaboration with EGEE is just beginning, but promises to be

The VDT was adopted by the PPDG collaboration in the Fall of 2003.

The VDT was installed and on 27 sites for Grid2003 in the Fall of 2003. The VDT team interacted with the Grid2003 team to improve the VDT in reaction to user suggestions, and the team was closely involved in the deployment. Such close involvement is critical to the VDT's success: it not only is useful to the users, but is educational for the VDT team.

Late in 2003, the VDT developed and deployed a nightly test infrastructure, to test both the installation process and the functionality of the VDT. The nightly tests have proven invaluable by catching many errors before the releases of the VDT.

The VDT has expanded the ways it can be installed. In addition to the standard Pacman installation, RPMs are now available on three varieties of Linux. Also, instead of a only having a monolithic VDT Pacman installation, users can now install just the portions of the VDT they wish, using the same Pacman mechanism.

During the development of VDT 1.1.13, the VDT provided 21 bug fixes to the Globus Alliance, 15 of which were accepted into the standard Globus distribution. Not all patches were accepted because some components were being replaced with newer components, so the bug fixes didn't always make sense. In addition, the VDT provided four new Globus features as patches, and two of them were accepted.

Collaboration with NMI has increased, and NMI is now central to the build process for the VDT.

The VDT team hired a new member to work with our European collaborations.

2 The VDT Build and Test Process

Most of the VDT components are built using machinery (software and computers) developed and deployed by the NSF Middleware Initiative. This picture illustrates how the software is built:

[pic]

Figure 6: VDT Building Process and Relation to NMI

1 NMI build

NMI builds several components for VDT 1.1.13: Globus, Condor, MyProxy, KX509, GSI OpenSSH, PyGlobus, and UberFTP. NMI checks the software out of the appropriate CVS repositories, and then patches the software. Currently, only Globus is patched. Each of these software packages uses GPT, so GPT source bundles are then created. These are built and tested on the NMI build pool, then given to the VDT.

The NMI build pool is a cluster of nearly forty computers, and it uses Condor to distribute the builds. There are a wide variety of computer architectures and operating systems in the build pool. When a VDT build is started, we currently tell Condor to run the same build on three architectures: RedHat 7, RedHat 9, and RedHat 7 with the gcc 3 compiler. (This last architecture is a requirement from LCG.) After the build, NMI automatically deploys the software and does basic verification testing to ensure that the software works. After NMI completes the build, the VDT team imports the software into the VDT cache.

There is very close collaboration between the VDT and NMI groups. In fact, one person, Parag Mhashilkar, spends 50% of his time on the VDT and 50% of his time on NMI. He is able to facilitate a close collaboration.

2 VDT build

The VDT team builds a few software packages such as the fault tolerant shell and expat. Currently we are working with the NMI team to use the NMI machinery to build this software for us. As we expand the number of architectures that we support, this will save us a lot of time and will eliminate errors.

3 Contributor builds

Some software in the VDT is built by the contributors that wrote the software. The Virtual Data System (VDS), Netlogger, and Monalisa are three such systems. We trust these contributors to give us good builds of their software.

4 Final packaging

When all of the builds have been assembled, the VDT team packages them together. All software goes into a Pacman cache, along with scripts to configure the software during the installation. A subset of the software is turned into RPMs and put onto the VDT web site for download. (We expect this subset to grow larger in the near future, and it will hopefully encompass the entire set of VDT software in the future.)

5 Testing

At this point, the VDT team begins rigorous testing. We have a nightly test that tests that the installation went smoothly (nothing failed, all files look good, etc.) and also tests the functionality of the software (we can submit jobs, transfer files, etc.) After a couple of days of good tests, the VDT Testing Group is brought in to do more testing. This is a group that installs the VDT on a wide variety of sites and does both local testing and testing between sites. After about a week of testing, the VDT is released to everyone.

3 VDT Plans

In the next year, we look forward to meeting several goals, particularly:

• Expanding the number of platforms supported by the VDT, to better meet the needs of our users.

• Improve our testing structure to better test the VDT, particularly to test compatibility between versions.

• Supporting the emerging web services infrastructure by supporting both Condor 6.7.x and Globus 4.0.

GriPhyN Activities in ATLAS

GriPhyN research for ATLAS is done within the matrix of grid computing projects, the U.S. ATLAS Software and Computing project, and the international ATLAS Collaboration. The GriPhyN-ATLAS group members come from the University of Chicago, Argonne National Laboratory, University of Texas (Northwestern), and Boston University. Work is done closely with PPDG, iVDGL, and the European peer Grid projects, e.g., DataTAG, EDG, and LCG. Over the past year the main areas of concentration were in application performance monitoring (covered earlier in Prophesy section), application packaging/configuration for grid execution, grid portal development for physicist user-interfaces, and prototyping of virtual data methods using SUSY Monte Carlo simulations and ATLAS reconstruction data challenges (DC1) as application drivers.

1 Application Packaging and Configuration with Pacman

Packaging and configuring the ATLAS software so that it is suitable for execution in a grid environment is a major task, but necessary before any research using virtual data techniques can proceed. Work continued here, in collaboration iVDGL (Pacman development, and VDT integration for ATLAS). The GriPhyN related work here required designing transformations appropriate for the particular dataset creation task. For the Athena-based reconstruction data challenge, this meant, for example, that scripts which setup the Athena environment (including specification of input datasets and outputs) and input parameters (as specified an accompanying “JobOptions” text file).

2 Grappa: a Grid Portal for ATLAS Applications

Work continued on the development of Grappa, a grid portal tool for ATLAS applications. Grappa is a Grid portal effort designed to provide physicists convenient access to Grid tools and services. The ATLAS analysis and control framework, Athena, was used as the target application. Grappa provides basic Grid functionality such as resource configuration, credential testing, job submission, job monitoring, results monitoring, and preliminary integration with the ATLAS replica catalog system, MAGDA.

Grappa uses Jython to combine the ease of scripting with the power of Java-based toolkits. This provides a powerful framework for accessing diverse Grid resources with uniform interfaces. The initial prototype system was based on the XCAT Science Portal developed at the Indiana University Extreme Computing Lab and was demonstrated by running Monte Carlo production on the U.S. ATLAS testbed. The portal also communicated with a European resource broker on WorldGrid as part of the joint iVDGL-DataTAG interoperability project for the IST2002 and SC2002 demonstrations. The current prototype replaces the XCAT Science Portal with an Xbooks JetSpeed portlet for managing user scripts. The work on Grappa was presented at CHEP (Conference on Computing in High Energy Physics) and will be published in the forthcoming proceedings.

3 Virtual Data Methods for Monte Carlo Production

Research on virtual data methods for ATLAS was driven by three goals:

1. To prototype simple “analysis-like” transformations to gain experience with the VDS system.

2. Use of VDS to perform SUSY dataset creation; these were Monte Carlo simulations which contributed to the DC1 dataset.

3. Use of VDS to perform reconstruction of DC1 output.

Several talks on these results were given at ATLAS meetings in the U.S. and at CERN.

We employ and/or interface with grid components, services and systems from a large number of development projects, with the initial aim of maximizing ATLAS’s use of grid resources in a production environment. An environment is specific to the recent reconstruction data challenge, and use of the Chimera virtual data system, was constructed. Figure 7 below shows the major functional components in the current “GCE environment”. The two main components that request user intervention are “submit host” (for the client/physicist) and “computing host” (for the site administrator).

This environment was packaged and documented, based on VDT 1.1.9, Chimera VDS 1.0.3, and ATLAS release version 6.0.3. A set of command line tools were developed to create transformations and load them into a virtual data catalog (VDC); to generate abstract DAGs corresponding to a particular reconstruction task, expressed as “reconstruct this list of partitions (files) from this particular dataset”; and to generate concrete DAGs, and submit the jobs to remote execution hosts using Condor-G.

Before running the GCE reconstruction framework, the user defined a dataset and a list of partitions to reconstruct. Input data consisted of Monte Carlo simulation output (Zebra files), and were specified according to logical file identifiers (LFN) according a particular naming convention. To define the site/location of corresponding physical file names (PFN) we used the MAGDA file catalog, both with the web interface[2] and client tools. Users also interacted with a production database and “Cookbook” of JobOptions files.

[pic]

Figure 7: ATLAS Grid Component Environment used for reconstruction data challenge

GriPhyN Activities in CMS

The CMS group within the GriPhyN Collaboration includes the California Institute of Technology, the Fermi National Accelerator Laboratory, the University of California-San Diego, the University of Florida, and the University of Wisconsin-Madison. All activities continue to be conducted in close collaboration with the iVDGL, the Particle Physics Data Grid (PPDG), and the US-CMS Software and Computing Project (USCMSSC).

1 Data Challenge 04 and Production of Monte Carlo Simulated CMS Data

During year four—aided by the earlier prototyping, testing and hardening efforts—large scale and continuous production of simulated CMS data was achieved. Relevant components of the grid enabled US-CMS Production system currently include a centralised Production metadata database, known as the RefDB [REFDB2], a metadata manager and Production workflow generator, known as MCRunJob [MCRUN], and a grid job-planner, known as MOP [MOP]. A typical CMS production request requires that many thousands of jobs be created, submitted, and monitored for successful completion. In addition, CMS requires that all sites producing CMS simulated data read their input parameters from the RefDB metadata database and write back to the database a list of values describing how and when the data was produced [REFDB1].

1 Pre-parallel-post-Challenge Production using Grid3

Production of CMS simulated data was distributed across the resources of Grid3 [GRID3] during year four, helping the production effort to reach new milestones in scale and overall throughput. Grid3 is a consortium of GriPhyN, iVDGL, PPDG, and several experiments, for which GriPhyN provides the grid middleware infrastructure. CMS participation in Grid3 resulted in an additional 50% of simulated events over what was produced by CMS-owned computing resources. Over 15 million events were simulated with a GEANT4 application over a 6 month period (see Figure 1), and peak simultaneous CPU usage reached 1200 CPUs managed by a single FTE. The achieved overall throughput corresponds to approximately one CPU century and 15 TB of data produced. The overall number of files produced was on the order of 45,000.

[pic]

Figure 8: USCMS simulation throughput. The lower area (red) represents the number of events produced on canonical USCMS resources, whilst the upper area (blue) represents the number of events produced on non-US-CMS resources from Grid3.

2 Grid Enabled Analysis of CMS Data

Activities in grid enabling the analysis of CMS data continued during year four and, as in previous years, the CMS software environment also continues to evolve at a fast rate. Relevant components for the local analysis of CMS data include the CMS software frameworks COBRA [COBRA], CARF, ORCA, and IGUANA [IGUANA], as well as the LHC data persistency implementation, known as POOL [POOL]. CMS participation in GriPhyN is collaborating closely with PPDG [PPDG] to design, prototype and build grid-enabled systems of distributed services supporting both interactive and scheduled analysis of distributed CMS data.

1 CMS Analysis Application Activities and Status

Over the past year, the CMS software environment has matured to the level of supporting local analysis of CMS data. As a result, a significant effort is being devoted towards the development of several prototype data analysis projects, including a Higgs search analysis (Caltech), an exotic-physics search analysis and a standard model Drell-Yan dimuon analysis (both at Florida). These analysis projects serve as analysis benchmarks to define the user-level interaction with a grid-enabled analysis environment. In addition, these activities serve to provide a detailed understanding of the actual application execution environment, metadata management needs for data discovery, and job preparation/submission requirements for local and distributed CMS data analysis.

As part of the recently completed DC04, several complete CMS datasets (also called DSTs or the misnomer “Data Summary Tapes”), representing several million fully simulated and reconstructed events and amounting to several gigabytes, were transferred from the CERN Tier-0 site to the Fermilab Tier-1 site and subsequently distributed to the Caltech and Florida Tier-2 sites. Now, in the post-DC04 phase, these datasets are available for analysis by physicists. Prototypes applications of the above benchmark analyses are being developed using the CMS ORCA libraries and have, so far, analysed over 1 million events from DST datasets at local Tier-2 resources.

Using this work, several important lessons and requirements for current CMS Data Analysis are relevant for GriPhyN integration activities and are summarized below:

• The CMS POOL metadata model (critically required for managing data access and data discovery) is not yet mature and is rapidly evolving. This presents a challenge to the integration of CMS with GriPhyN technology, as understanding the metadata model is crucial for identifying input/output datasets and in building analysis applications to process the data.

• The model for CMS low- to mid-level data types is beginning to solidify and currently consist of three types of datasets: Hits, Digis, and DSTs. Derived object collections in a derived dataset contain internal “links” back to the original object collections used in the original dataset. This implies that a “complete” dataset (or sub-dataset consisting of runs or events) is only completely specified when one specifies the corresponding Hits, Digis, and DSTs all-together. Access to all of the above CMS POOL datasets requires applications to be built from ORCA/COBRA libraries. The requirement on the GriPhyN Virtual Data Language (and System) is that a strong notion of “datasets” will need to be supported.

• The model for CMS high-level Data Types such as Analysis Object Data (AODs) and TAGs are not yet defined and may be generic enough so as not to require CMS specific ORCA/COBRA libraries for access. The extent to which AODs and TAG data contain copies of previously derived data or simply provide internal links back to the external original data is not yet decided. Currently, ROOT Trees are being used in the place of a well defined AOD type. This implies that some degree of uncertainty in GriPhyN integration tasks will persist until such high level datasets become mature.

• The CMS Software is composed of a set of libraries and tools, and effectively provides a Software Development Kit (SDK) and Environment for data analysis. This enables users to build and execute personalised CMS analysis applications in which a significant part of the user analysis activity at the DST level (and possibly the AOD level) is devoted towards developing the analysis application based upon iterations over a small local dataset. Once refined, the application needs to be run over much larger DST datasets in a production like manner across the Grid to produce personal ROOT Trees (or personal AODs when they are available in CMS). The requirement on GriPhyN technology is that the analysis application can not be considered a simple shell script or binary executable, but is rather a user composed complex execution environment.

• Users need the ability to easily package, publish and/or distribute their own complete local execution environment (including any and all shared object libraries, environment setup tools, wrapper scripts, etc) of their personal CMS application to grid resources for distributed execution over distributed datasets. Such a package might be considered as an input dataset to a “Transformation” in the GriPhyN Virtual Data Language.

• The CMS Software Environment actively supports “reconstruction on demand” which means that if the requested object collection does not persistently exist in the input dataset, it will be instantiated by recursively invoking “reconstruction” algorithms on any necessary internal or external (but linked) object collections. This provides the user with significant power and flexibility in defining precisely the reconstructed object collection of interest. It also implies that users will need to take care in specifying which object collections they wish to access: reconstructing a new object collection on demand can take several orders of magnitude longer than simply reading a persistent object collection from disk; indeed, different datasets may require very different resource usage constraints for the same application. Such differences in performance will strongly influence GriPhyN planning decisions of whether to move the dataset with the job (to exploit compute resources) or to simply schedule the job at the original dataset location (to exploit storage resources).

• Once the ROOT Trees (or AODs when available) are produced from DSTs (using the user’s specific application) the user needs access to them. If small enough, they can be transported, stored, and analysed locally. If too large, they could be accessed interactively over the wide area network or, they could be further refined and reduced so that they are small enough for local storage constraints. Users will desire to have both options.

2 Grid and Web Services Architecture for CMS Data Analysis

In parallel to and in close step with the above CMS analysis application work, important and substantial activity continues to be devoted towards designing a distributed, dynamic system of services to support CMS Data Analysis.

1 Initial Prototype at Supercomputing 2003

Considerable work over the past year was devoted towards the deployment of an initial end-to-end system of distributed high-level services supporting grid-enabled data analysis, and resulted in a demonstration at Supercomputing 2003. In particular, the prototype integrated a typical physics user interface client [ROOT], a uniform web-services interface to grid services [CLARENS], a virtual data service [CHIMERA], a request scheduling service [SPHINX], a monitoring service [MONALISA], a workflow execution service [VDT, CONDOR], a remote data file service [CLARENS], a grid resource service [VDT, GLOBUS], and a replica location service [RLS]. For testing and evaluation purposes, the prototype is deployed across a modest sized U.S. regional CMS Grid Test-bed (consisting of sites in California, Florida, Illinois, and Iowa) and exhibited the early stages of interactive remote data access [GAE], demonstrating interactive workflow generation and collaborative data analysis using virtual data and data provenance [VDCMS], as well as showing non-trivial examples of policy based scheduling of requests in a resource constrained grid environment [QOS]. Using similar prototypes, this work will be used to characterize the system performance as a whole, including the determination of request-response latencies in a distributed service model and the classification of high-level failure modes in a complex system. The prototype was based upon a ROOT based data analysis and did not at that time make use of the CMS Software Environment. Later services (see below) are beginning to integrate the rest of the CMS Software Environment.

[pic]

Figure 9: Diagram showing operation of the CMS Grid-enabled Analysis Prototype at SuperComputing 2003

2 The Clarens Web Services Architecture

Clarens [CLARENS] provides a “portal” for a host of GRID computing services. This portal will take the form of not only a standard web-based application, but also programmatic interfaces to analysis services that physicists use to access and analyze data as part of a dynamic Virtual Organization. Currently a Java and a Python version of Clarens are available.

Clarens acts as the “backbone” for web services that are currently being developed within the context of the Grid Analysis Environment [GAE). After the Supercomputing 2003 conference several (new) core services have been developed within the Clarens Web Service Environment, and deployed on the GAE testbed:

• Rendezvous: dynamically discovers services that are deployed on multiple Clarens services within a distributed environment. A javascript front end enables users to search for specific services, and displays the service interface using the Web Service Definition Language (WSDL).

• System management: enables fine granularity access control management and information about services active on a specific Clarens server.

• File Access: enables fine granularity access control management and lookup capability on files hosted by Clarens servers.

• VO Management.

Part of the Clarens development is providing a web service interfaces to (existing) applications that are used in (or are important for) CMS. Equally important is providing transparent access to these services, not only for applications (by using WSDL interfaces and dynamic discovery), but also to users using graphical front ends (see figure 1).

There are currently 20 known deployments of Clarens: Caltech (5), Florida (4), Fermilab (3), CERN (3), Pakistan (2+2), INFN (1).

Work has started to include Clarens (and in a later stage other GAE services) in the official CMS software distribution and the DPE distribution.

[pic]

Figure 10: Snapshot of graphical front ends to web services within the Clarens framework.

3 Collaborative Analysis Versioning Environment Service (CAVES)

The Collaborative Analysis Versioning Environment System (CAVES) project concentrates on the interactions between users performing data and/or computing intensive analyses on large data sets, as encountered in many contemporary scientific disciplines. In modern science increasingly larger groups of researchers collaborate on a given topic over extended periods of time. The logging and sharing of knowledge about how analyses are performed or how results are obtained is important throughout the lifetime of a project. Here is where virtual data concepts play a major role. The ability to seamlessly log, exchange and reproduce results and the methods, algorithms and computer programs used in obtaining them enhances in a qualitative way the level of collaboration in a group or between groups in larger organizations. It makes it easier for newcomers to start being productive almost from day one of their involvement or for referees to audit a result and gain easy access to all the relevant details. Also when scientists move on to new endeavours they can leave their expertise in a form easily utilizable by their colleagues. The same is true for archiving the knowledge accumulated in a project for reuse in future undertakings.

The CAVES project takes a pragmatic approach in assessing the needs of a community of scientists by building series of prototypes with increasing sophistication. In extending the functionality of existing data analysis packages with virtual data capabilities these prototypes provide an easy and habitual entry point for researchers to explore virtual data concepts in real life applications and to provide valuable feedback for refining the system design. The architecture is modular based on Web, Grid and other services which can be plugged in as desired. As a proof of principle we build a first system by extending the very popular data analysis framework ROOT, widely used in high energy physics and other fields, making it virtual data enabled.

4 Requirements and analysis for Distributed Heterogeneous Relational Data Warehouses under the CMS context

Progress in studying Distributed Heterogeneous Relations Data Warehouses (DHRD) for CMS continued in year four. In particular, functional requirements on DST and AOD data types deriving from the recent Data Challenge ’04 were studied. One of the conclusions of this work is that a single schema for all DST and AOD data is desirable and conceivably achievable. This would require CMS applications be aware of only one schema, enabling economical access to multiple heterogeneous databases. Finally, a framework of DHRD services is proposed which would provide a coherent system for CMS data analysis services and replication from Tier-0 to Tier-1 to Tier-2 centres.

3 Future Plans for CMS Activities in GriPhyN

The main goal of the GAE is to provide a distributed analysis environment for the CMS experiment in which users can perform (interactive/batch) analysis based on a transparent view of the resources available. Within the GAE project, Clarens web servers will act as the “backbone” for deployment and development of web services.

The first phase of the GAE project has focused on the design of a distributed analysis environment architecture, and on creation and deployment of robust “basic” services (see section 5.2.2.2) that can be used to develop “higher level” services.

The second phase focuses on integration of existing CMS software within the GAE. Part of this activity is deploying current GAE services and CMS software on the GAE testbed, to identify what CMS components can be exposed as a web service. Several of these components are made into a web services: [POOL], [BOSS], [REFDB], [RUNJOB], [SPHINX]. Multiple instances of these web services are being deployed on multiple hosts in the current GAE testbed to provide a robust analysis environment (if one host fails instances of a certain service are available on other hosts). Within the second phase work has also started to integrate monitor information from MonaLisa [MONALISA] into the Clarens environment. The monitoring information will be used in developing “high level” services.

A risk of integration of third party components (in our case CMS related components) is that currently not every application has a stable interface. As a result from time, to time existing web services need to be updated to reflect the latest versions of the third party software it interfaces to. The GAE team has identified web service updates as important but also as a risk (as it prevents team member from doing other tasks). It is therefore important to perform “education and outreach” to developers of third party components the GAE team thinks are necessary within the GAE, and to convince third party developers to provide a web service interface to their applications.

The third phase of the GAE project will focus on developing “high level” services. High level services can be described as services that take input from the “low level” passive services and proactive take decisions on where to perform an analysis job, when to replicate data, etc… Part of this phase is the development of an accounting service that allows for “fair” sharing of resources. The accounting service, together with other high level services prevent such scenarios in which one users allocate all resources for a very long time (without the organizations consent)

1 Application Level Integration

During all phases the GAE team conducts and investigate different types of analysis scenarios based on ROOT [ROOT] and ORCA [ORCA]. As a result, efforts will continue in year five to integrate the CMS Analysis Software Environment into the GAE using a “bottom-up” approach. In particular, the benchmark data analyses which were developed in year four in a local environment will begin to migrate in year five to use GAE services for job submission, remote data access, and scientific collaboration. The work plan below represents the currently envisaged migration from local CMS data analysis to fully distributed CMS data analysis in a collaborative GAE environment:

1. Migrate Benchmark Analysis to GAE

1. Demonstrate interactive, remote data analysis capability

1. Distribute signal and background ROOT/AOD samples to different GAE sites

2. Access Root Tree AODs using Root-Clarens Client

3. Optimize selection of signal and rejection of background

4. Apply statistical tools for setting discover potential limits

2. Demonstrate integration with CMS Software Environment

1. Integrate ORCA and POOL into GAE

2. Access POOL AODs using ORCA dictionary libs in Root-Clarens Client

3. Optimize selection of signal and rejection of background

4. Apply statistical tools for setting discovery potential limits

5. Perform systematic analysis of results

3. Demonstrate collaborative capability

1. Integrate Analysis steps with CAVES and Chimera Virtual Data Service

2. Divide analysis between 2 people

3. Perform analysis demonstrating shared use of data

4. Perform analysis demonstrating shared use of code/libraries

5. Apply statistical tools for setting discover potential limits

6. Perform systematic analysis of results

2. Scale up Analysis on GAE

1. Perform analysis on ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download