ITR/AP+IM+SI+SY: A Prototype Knowledge Environment for …



A. Project Summary

We propose to create a prototype Knowledge Environment for the Geosciences (KEG) that demonstrates a seamless, virtual laboratory for Earth system science research and education. This environment is a platform for fundamental IT research enabling advances in methodologies and tools for distributed, large-scale collaborative research, knowledge evolution, and distributed information environments. It will organize research products into a searchable, shared group resource and thus a knowledge-based problem-solving environment for geosciences. It is targeted at the heart of the research activity—the process—not the end result. This will be the first time a comprehensive knowledge system has been proposed to integrate frontier research on high performance simulation of the Earth system with both archival and interactive geosciences learning environments. The complexity of a geophysical knowledge framework is a new and fertile context for basic IT research. Moreover, the proposed research is structured to foster iteration among IT and the core geophysical problems, whereby basic IT research contributions spur new geosciences developments that in turn pose new IT research challenges.

The need to understand the physical and biological processes that shape our environment is a grand challenge for this century. The possible influence of human activities on the Earth system and the fragility of a complex global economy to severe natural events make this an urgent problem. However, the Earth/Sun system operates at disparate spatial and temporal scales and advancing our understanding of this system requires a vast array of observational data, many scientific disciplines, and scientific models. Moreover, scientific understanding of the environment must be accessible to diverse groups, from scientists to policy makers and from educators to students. Traditional modes of scientific research are stymied when applied to the breadth of scales encountered in the geosciences and often only reach a limited audience of specialists. To meet this grand challenge, new ways of collaboration and dissemination will be necessary that leverage IT.  This proposal will contribute to the digital infrastructure vital for the next generation of geosciences research.

The prototype knowledge environment for the geosciences proposed here will be assembled in three layers:

1) Interaction Portal (IP)

2) Knowledge Framework (KF)

3) Multiscale Earth System Repository (MS)

The IP is the connection between the community and the knowledge environment and consists of tools, components and interfaces built upon the common fabric of the Knowledge Framework. Some areas of emphasis for IP are a common code development environment, visualization and supporting interactions among a geographically distributed group.  The KF mediates knowledge between the MSESM and the IP using principles of encapsulation, polymorphism, and data abstraction to facilitate interdisciplinary research with a set of distributed methods, classes, and tools. Finally, the MSESM generalizes current Earth system models and will be a linked hierarchy of models at several scales. The prototype multiscale model will have a nonhydrostatic atmosphere with interactive chemistry and cloud microphysical processes.

A fundamental component of our system design is a shared Earth system modeling framework that will provide a “commons” for university and NCAR computer and Earth scientists to compare, test, and evaluate new tools and methods for modeling complex Earth system processes. This diverse scientific effort will be organized and archived with “middleware” that enhances opportunities for applications of the model products to research on impacts and consequences of weather and climate variability.  The final modeling and analysis products will be transmitted to the Digital Library for Earth System Education [DLESE] for peer-review and posting in the collection. The integrative nature of learning about the Earth demands a core information technology infrastructure that makes distributed learning a reality—the time to act is now.

The IT research will be accomplished through a multidisciplinary scientific team with expertise ranging from knowledge representation, reasoning and problem solving environments, collaboration research to parallel computation, scientific visualization and data analysis to human-computer interaction, software process and architecture. Support for the substantive geophysical model development is also broad and leverages the considerable resources of the participating universities (Courant Institute of Mathematical Sciences, Howard University, Purdue University, Stanford University, University of Alabama-Huntsville, University of California at Los Angeles, University of Chicago, University of Colorado, University of Illinois Urbana-Champaign, University of Michigan, and University of Wisconsin) as well as the National Center for Atmospheric Research.

B. Table of Contents

A. Project Summary i

B. Table of Contents iii

C. Project Description 1

1 The Information Technology Revolution for the Geosciences 1

1.1 National and Global Context 1

1.2 A Vision for Enabling Virtual Communities of Researchers and Educators 1

1.3 NCAR’s Role 2

2 Elements of the KEG and Related Work 2

2.1 Problem Solving Environments] 2

2.1.1 Problems/limitations with existing systems: Need integration, scalability, etc… 2

2.2 Collaboratories and Related Infrastructure 2

2.3 Portals for Scientific Research & Education 2

2.3.1 An Environment for Hypothesis Development and Testing 2

2.3.2 Frameworks for Realizing the Portal 2

2.4 Scientific Data: Complex, Diverse, and Very Large 2

2.5 Mining: Data, Information, and Knowledge 3

2.6 Discovery of Information, Data, Software, Tools, and Knowledge [Jessup? 4

2.7 Visualization: Multiscale and Terascale 5

2.8 Advanced Collaborative Environments 5

2.9 Next-generation Multiscale Earth System Models [Tribbia, Ghil] 6

2.10 Distributed Group Development of Frameworks, Tool, Models, and Agents 6

2.11 Executing and Managing Simulation Processes 6

2.12 Knowledge Systems: Ontologies for the Geosciences 6

2.13 The IT Challenges: Scalability, Overall Integration, IT Research [Middleton/Fox/Hammond] 7

3 Research Design and Methods 7

3.1 The Concept 7

3.2 IT Research Challenges 8

3.3 KEG Definitions 8

3.4 Goals, Requirements, and Characteristics 9

3.5 Design 9

3.6 Architecture and Enabling Frameworks 10

3.6.1 Detailed description of the architecture 10

3.6.1.1 IP layer 10

3.6.1.2 KF1 layer 11

3.6.1.3 3.6.1.3 KF2 layer 13

3.6.1.4 3.6.1.4 KF3 layer 14

3.6.1.5 3.6.1.5 KF4 layer 18

3.6.1.6 3.6.1.6 MS layer 18

3.7 3.7 Outcomes 20

3.7.1 3.7.1 Infrastructure 21

3.7.2 Education and research 21

3.7.3 3.7.3 Services 21

3.7.4 3.7.4 An expandable framework 21

3.8 3.8 Software Engineering Challenge 21

4 A Multi-scale Earth System Model 21

4.1 Definition: 21

4.2 The Problems: Modeling and Software 24

4.3 The Approach 24

4.4 Goals and Outcome 25

5 A KEG for Everyone! 25

5.1 Deliverables 25

5.2 Technology Transfer 25

6 Education and Outreach 25

6.1 Outreach to the Scientific Community – Summer Design Institutes 25

6.2 Outreach to Communities 25

6.3 The K-12 Educational Community – K-12 KEG 25

6.4 Outreach to a Diverse Community – Collaboration with SOARS 26

6.5 Outreach to the Public – Sharing Information about KEG: 26

7 Usage Scenarios 26

7.1 Hurricane Landfall (HAL) Test Bed 26

7.2 El Nino Southern Oscillation (ENSO) Test Bed 26

7.3 Megacity Impact on Regional And Global Environment (MIRAGE) Test Bed 27

8 Broader Impacts 27

9 Management Plan (up to three pages in length) [hammond] 28

10 Prior Results 30

D. References Cited I

E. Biographical Sketches VI

F. Proposal Budget [hammond] VII

G. Current and Pending Support VIII

H. Facilities, Equipment, and Other Resources VIII

I. Special Information and Supplementary Documents VIII

J. Appendices VIII

K. Attic VIII

11 Data Mining from UAH VIII

11.1 Data mining in a distributed Environment VIII

11.1.1 Goals VIII

11.1.2 Requirements VIII

11.1.3 Basics IX

11.1.4 Applying Data Mining IX

C. Project Description

The Information Technology Revolution for the Geosciences

1 National and Global Context

Over the last 30 years, the global population has doubled, carbon dioxide concentration has increased from 315 to 370 ppm, and the mean global temperature has risen from 13.9 degrees C to 14.4. A gaping ozone hole appears every spring over Antarctica and another seems to be developing over the Arctic. Air and water pollution problems are global in scale. Never before has the need to understand our planet, the complex interactions of its processes, and our own impact upon the system been as urgent and compelling as now. The unique challenge of the geosciences is to address as a whole the many interlocking processes in the atmosphere, oceans, land surfaces, ice sheets, and biota that together determine the behavior of the planet. This holistic set of processes requires an earth systems approach to global and even regional and local problems, combining many specialties in a way that is not required in other scientific pursuits. The research issues are ceasing to be the purview of any single discipline and span multiple communities with stakeholders in the areas of education, environmental and societal impacts, and multiple earth system disciplines.

Detailed observations of the Earth, distributed and diverse data and information holdings, powerful simulation and analysis capabilities, knowledge holdings, and collaboration environments – to name but a few - clearly have tremendous potential to elevate our knowledge and understanding of our planet. The information technology revolution brings us unprecedented new capabilities that offer substantial promise for integrating these resources and turning them into powerful new tools and environments. As we consider our future, however, simple extensions of extant technologies and methodologies will not begin to address our requirements. A new era of scientific discovery is within reach - if these new capabilities can be effectively harnessed in the service of science. [Too vague, more work – don]

2 A Vision for Enabling Virtual Communities of Researchers and Educators

A centerpiece of NCAR’s long-term vision is to develop a Geosciences Decision Support Environment to substantially improve our understanding of and to provide accurate and timely information about the Earth system in which we live. This information and the decision support environment itself will be used to facilitate and accelerate fundamental scientific research, enrich education programs and to feed into policy decisions and assessments. This vision is consistent with the PITAC report [PITAC99], “Research is conducted in virtual laboratories in which scientists and engineers can routinely perform their work without regard to physical location—interacting with colleagues, accessing instrumentation, sharing data and computational resources, and accessing information in digital libraries.”

To make strides toward realizing this vision we propose to create a prototype Knowledge Environment for the Geosciences (KEG) that will produce a knowledge enabled collaborative problem-solving environment for Earth system research and education. This environment is a platform for fundamental IT research enabling advances in methodologies and tools for distributed, large-scale collaborative research, knowledge evolution, and distributed information environments. It will organize research products into a searchable, shared group resource that underlies a compelling concept: a knowledge-based problem-solving environment for geosciences research, education, and assessment. It is targeted at the heart of the research activity—the process—not just the end result. This will be the first time a comprehensive knowledge system has been proposed to integrate frontier research on high performance simulation of the Earth system with both archival and interactive geosciences learning environments. The complexity of a geophysical knowledge framework is a new and fertile context for basic IT research. The proposed research is structured to foster iteration among IT and the core geophysical problems, whereby basic IT research contributions spur new geosciences developments that in turn pose new IT research challenges.

3 NCAR’s Role

NCAR’s primary function is to serve as an integrator of people, disciplines, methods, technologies, and activities in the pursuit of advancing the national research agenda. It also acts as a catalyst, bringing together many specialists, disciplines, approaches, technologies, and activities to propel the science forward. While these roles have traditionally been in the context of earth system research, they must now extend into the information technology realm as well if geoscience is to achieve the progress that is needed.

NCAR is well positioned to play a prominent role in motivating the evolution of information technology research in the context of the geosciences. Broad community projects and large-scale simulation efforts push the envelope of what’s possible, and serve as harbingers of future community needs. In this proposed work we team earth system researchers with their counterparts in computational science in order to develop new understanding of the problem domain and to attack the basic research problems in computational science. This synergistic partnership is crucial to advancing the research agenda for all of the disciplines involved.

NCAR has a responsibility to foster the development of important, long-term community infrastructure and to support it as a persistent resource for research. This role complements this work by providing a path for sustaining and providing longevity for the prototype environments, frameworks, tools, and software that are produced as a result of this effort.

Elements of the KEG and Related Work

In considering a next generation environment for supporting distributed group research, one can identify a number of logical components that we understand fairly well today. Collaboratories present shared, virtual spaces where groups of researchers can conduct experiments, share results, collectively produce intermediate analyses, and work together to produce knowledge products such as publications. In geosciences research, terascale simulations produce terascale data holdings and these in turn must be analyzed in the context of the observed record - massive data in its own right. Recent advances in Grid technologies provide a model for an underlying computational and data fabric conceived for terascale modeling and analysis. Generalized frameworks for advanced numerical models are emerging that not only facilitate plug-and-play flexibility for algorithms, but also have substantial promise for supporting domain-specific problem solving environments.

The overarching challenge is to enable all of these technologies to be combined into effective problem-solving environments. The effort proposed here is aimed at building upon a number of other research efforts and extending them such that a knowledge-enabled meta-framework is realized. It will be all things to all people, ‘nuff said. [Replace, scope is too narrow – don ->] This environment provides an interdisciplinary team with virtual proximity to all required resources and each other. In the sections that follow we describe the primary building blocks of the prototype Knowledge Environment for the Geosciences, related work, and research challenges.

1 Problem Solving Environments]

[Will work with Elias over the weekend – don]

1 Problems/limitations with existing systems: Need integration, scalability, etc…

2 Collaboratories and Related Infrastructure

[Need Umich SPARC & CHEF background]

3 Portals for Scientific Research & Education

1 An Environment for Hypothesis Development and Testing

2 Frameworks for Realizing the Portal

4 Scientific Data: Complex, Diverse, and Very Large

Observational programs such as NASA’s Earth Observing System (EOS Terra, Aqua, and Aura) [] present a proverbial fire hose of data for the Earth System community. Space Science will face similar challenges when platforms such as the Stratospheric Observatory for Infrared Astronomy (SOFIA) [] and other advanced observatories become operational. At the same time, researchers successfully harness parallel computational platforms to simulate phenomena at unprecedented resolution while nested and multiscale models will add an additional level of complexity. Furthermore, climate and weather researchers and impacts assessment stakeholders require tremendous flexibility to combine and compare multiple disparate datasets including GIS. Overall, the geosciences community faces massive growth in the scope, complexity, and ultimate size of important, crucial scientific data with volumes escalating into the terabyte and petabyte range during this decade. The Data Problem challenges our very ability to understand the systems and underlying processes and could has the potential to stand as a formidable barrier to research progress if not addressed.

A meta-framework that anticipates future data requirements must possess extraordinary qualities relative to performance, scalability, flexibility, distributed operation and, above all, the incorporation of semantic content. We propose to build upon and coalesce several community efforts, each of which contributes a unique part to the KEG concept. Recent work in HDF5 [] addresses scalability and performance in the context of parallel computation and exposes a powerful and flexible data model. The Distributed Oceanographic Data System (DODS) is a popular framework for enabling data abstraction and distributed access but has not been targeted at high-performance applications. Recent work at the University of Wisconsin on the VisAD class library [] provides an elegant abstraction of data that is highly synergistic with both DODS and HDF5. One aspect of this research will be coalescing these into the meta-framework context with a coupling to DataGrid technologies, which enable distributed operation and address performance issues. One of the outstanding opportunities presented by this research is to explore the possibilities afforded by coupling this best-of-class synthesis with geoscience-specific ontologies, which enable management, discovery, and usage based upon semantic content.

5 Mining: Data, Information, and Knowledge

[Lotsa good material here from Sara and Steve. Need to condense and possibly re-tier – don]

Data Mining is concerned with the technologies that provide the ability to extract meaningful information and knowledge from large, heterogeneous data sources. Currently, large numbers of observations are acquired and stored in diverse and distributed data repositories, resulting in the need for “theories” that distill the information and knowledge content. The challenge of extracting meaningful information becomes progressively more formidable for the Geoscience community with the launch of the components of NASA’s Earth Observing System (EOS Terra, Aqua and Aura) and future missions. Similar challenges will face the Space Science community when platforms such as the Stratospheric Observatory for Infrared Astronomy (SOFIA) and other advanced observatories become operational. Since the acquisition of data is a continuing process, general tools and algorithms are needed for analyzing data, as well as for creating and testing theories or hypotheses. Due to the vast amounts of data involved, automated approaches that limit the need for human intervention are desirable.

Much progress has been made in both data mining and knowledge discovery over the past few years. For example, these techniques have proven useful for the automation of the analysis process and reducing data volume. However, the domain is still fairly new and this research frontier offers many areas for substantial improvement, such as the utilization of background knowledge, provability of results, scalability and the use of distributed computing approaches.

Current and near term data mining systems are limited in scope and exists only as a complementary tool for use with manual scientific analysis. Even so, there are several sound reasons for pursuing the use of data mining in the geosciences domain. It is a powerful tool for the analysis of huge amounts of science data and for when the manual examination of the data is impossible. Furthermore, the careful use of mining tools provides the ability to refine and add more layers to the knowledge bases associated with a given science domain. Both of these help to minimize the domain scientists’ data handling tasks, and allow them to maximize their research time. Because mining plans and techniques can be stored and documented in a reusable knowledge base, scientists can also publish their techniques for reuse and verification, and for use in other domains, thus reducing the tendency to reinvent the wheel. Also such new mining approaches can be readily integrated into a larger set of services in a framework to provide a new hybrid tool that is much greater than the sum of its parts.

It is the intelligent application of algorithms to data that is the means by which we extract knowledge. In data mining, the number of algorithms that can be used to filter, manipulate and analyze this data from its raw form into information, and ultimately into knowledge continues to expand. The interdisciplinary focus of scientific research today has added complexity with newer algorithms and different uses for the collected data. A new paradigm for data mining is required: a framework for modeling distributed data, algorithms, and their inter-relationships. This framework can be used as the foundation upon which to build automated systems that can be used for the discovery of causal relationships, hypothesis testing and theory formation, and to provide means for the detection of interesting correlations and patterns.

This next generation of mining tools, working within a larger framework, can provide a number of new capabilities for better dealing with the rising tide of geoscience data sets. They can provide flexible support for scientific analysis, ranging from reducing and refining the volume of data, to automation of the analysis process itself. Such tools should be accessible from all platforms including workstations, data archive centers and data mining environments. They should provide benefit to all levels of users, from the casual user to domain experts. And the tools should be available in real-time, on-demand, on-ingest and must provide verifiable and repeatable results.

To accomplish this, there are a number of challenges that must be met. Foremost is to define common standard interfaces for interoperability of distributed services and the incorporation of these standards into the geosciences domain. This common standard definition has begun with the efforts of such projects as the Earth Science Markup Language (ESML) and OGC Classification and services. But along with this common language, new frameworks must be developed to support this paradigm of interoperability. Such a framework should support network-accessible data sets (e.g. the data pool concept.) It should support a catalog of distributed capabilities and planning for resource allocation and load balancing. It should be able to deal with new and changing environments, such as streaming input, on-orbit processing, and grid environments such as NASA’s IPG, etc. The need for such a framework has been clearly stated in several arenas. For example, the scientists attending the ITSC-hosted NASA Data Mining Workshop expressed the desire for a taxonomy of data mining techniques and a semantic model for data.

Such a framework must be able to scale to much larger levels, and robust enough to survive in the rapidly changing computing environment of today and tomorrow. For example, a web-based user interface for visually connecting distributed processes, querying information about services and algorithms, and interactive inspection of data results. New methods of knowledge representation will be required in order to contain and use the large amounts of data associated not just with data sets, but also algorithms, processes and all the other facets associated with large-scale geoscience research.

In support of the KEG Framework, the team will also create an ontology for formally representing both the data and the algorithms used to manipulate that data, and more importantly, the complex associations among them. Applying its expertise in data mining and pattern recognition the research team will then develop access methodologies that will exploit this ontology. The integrated framework gained from merging a formal ontology with an overall search and modification capability will facilitate the transformation, exploitation, and exploration of data, in order to improve the ability to extract knowledge.

6 Discovery of Information, Data, Software, Tools, and Knowledge [Jessup?

7 Visualization: Multiscale and Terascale

[Working on over the weekend – don]

8 Advanced Collaborative Environments

[Argonne contribution - will condense, insert references. Also need CHEF i.e. web-based collaboration –don-]

In order to facilitate the creation of knowledge environment for the geosciences that demonstrates a seamless, virtual laboratory for Earth system science and research, the environment needs to support collaboration at a fundamental level. How are participants connected to the virtual laboratory, how is the laboratory organized or structured. If participants are located at a variety of different locations across the United States, with some resources collocated with participants while other resources are not how is the information shared and how do groups collaborate.

Argonne National Laboratory of which the University of Chicago group is associated has begun to address these issues via the development of the Access Grid [1, 2]. The Access Grid is an ensemble of network, computing and interaction resources that supports group-to-group human interaction across the grid. It consists of large-format multimedia displays, presentation and interactive software environments, interfaces to grid middleware, and interfaces to remote visualization environments. Access Grid Nodes are deployed into “designed spaces” that explicitly support the high-end audio and visual technology needed to provide a high-quality compelling and productive user experience. Access Grid nodes are connected via the high-speed Internet, typically using multicast video and audio streams. It is possible to participate in Access Grid sessions from multimedia PC’s. We are currently exploring the use of commodity technologies such as game consoles as high-performance, low cost Access Grid nodes, thus exploiting the fact that gaming consoles provide more advanced graphics than do most PC’s and are beginning to provide virtual environments support as well. The Access Grid enables distributed meetings, collaborative teamwork sessions, seminars, lectures, tutorials and training. The Access Grid design point is modest (3-20 people per site) but promotes group-to-group collaboration and communication. Large-format displays integrated with intelligent or active meeting rooms are a central feature of Access Grid nodes.

The development of the AccessGrid connects groups together; but that is just the beginning; it only forms the foundation for development of virtual communities. Once the community is formed the infrastructure and tools need to be in place for it to accomplish its goals. Chicago proposes to work on the development of collaborative visualization tools that support a wide variety of different endpoints, be it a desktop or advanced display environment, layered on the Access Grid collaboration infrastructure.

There has been work done on supporting collaborative visualization and analysis over network based environments. Foster et.al. [3] outline the requirements for enabling collaboration-oriented environments, based on what has evolved into today’s grid infrastructure [4]. An early project in the use of collaborative visualization is the Collaborative Visualization (CoVis) [5] that is being used as an educational tool. As an example CoVis has taken tools used by atmospheric scientist and simplified them for use in education. CoVis uses existing network based tools in an ad hoc manner to build the collaboration environment. This requires that the tools be available at all the sites and that each user becomes familiar with the needed tools, this can be a problem in a large collaboration where infrastructure and capabilities vary. Other collaborative visualization work has been done in the context of the Web, using browser and Internet infrastructure as the backbone on which tools are built [6, 7] that is connection oriented. Shastra defines an architecture for collaboration in a LAN based setting, which is stream based [8]. Chicago plans to build past experience in the area of collaborative visualization [9-12] and the work of the community; in the development of collaborative visualization tools that exploit the emerging Grid infrastructure [13-21] and Access Grid framework. In order for the collaborative visualization to succeed the underlying infrastructure needs to be flexible to operate across a variety of different network connections and on a variety of different endpoints. The endpoint maybe a users laptop, a room sized high-resolution display, or even an immersive virtual environment. Datasets that users will be collaborating on the analysis of will also vary in size, and it will be impossible for some of the machines to be capable of handling all the data, much less have the computation resources available to process it into a useful form. This unbalanced environment of resources means that the infrastructure needs to be aware of the use of visualization servers, enabling distant/remote visualization tools (connection to Folk/Hibbard??). Chicago plans work on integrating all of its collaborative visualization tools with backends that build on existing efforts in the efficient execution of the visualization pipeline [22, 23].

9 Next-generation Multiscale Earth System Models [Tribbia, Ghil]

10 Distributed Group Development of Frameworks, Tool, Models, and Agents

11 Executing and Managing Simulation Processes

12 Knowledge Systems: Ontologies for the Geosciences

[Deborah: I need to streamline but you’re welcome to as well ;-). Additional information and references on applications of knowledge systems to GIS and other earth-system applications would be helpful.]

Increasingly, knowledge's new products are stored in and used through distributed computing resources. Within a particular discipline, specific storage formats and transmittal protocols are often used. The problems quickly arise when trying to understand interactions among complex or interdisciplinary systems; something that is increasingly common in the both the education and scientific research processes. Suddenly, the specific storage formats and protocols become an additional barrier in opening the information up to another area using modern computer technology. Even today, the products educators and scientists seek are based in complex structures and/or contain or rely upon large volumes of reference materials, (often-undocumented) computer code, visual representations, and/or numeric data. In these cases, the information is only useful if the software used to access it represents accurately the structures and underlying assumptions. Even though the connectivity and capability of people using computers via networks continues to improve, and an increasing amount of information is available digitally, the two have not been brought together in a persistent way so that additional knowledge can be built upon them and captured.

Large-scale ontologies are becoming an essential component of many applications including standard search (such as Yahoo and Lycos), e-commerce (such as Amazon and eBay), configuration (such as Dell and PC-Order), and government intelligence (such as DARPA's High Performance Knowledge Base (HPKB), Rapid Knowledge Formation, (RKF), and Agent Markup Language (DAML) programs.

While fairly common in eCommerce, various expert and recommender systems, and even Geographic Information Systems, significant usage with the context of the geosciences is fertile ground for research.

Reusable ontologies are becoming increasingly important for tasks such as information integration, knowledge-level interoperation, and knowledge-base development. Ontolingua [] provides a set of tools and services to support the process of achieving consensus on common shared ontologies by geographically distributed groups using web-based tools. . These tools make use of the worldwide web to enable wide access and provide users with the ability to publish, browse, create, and edit ontologies stored on an ontology server. Users can quickly assemble a new ontology from a library of modules. Ontolingua utilizes open standards such as KIF and OKBC and has been used for years in academic and commercial settings. The Ontolingua Server may be accessed through the URL .

The ontologies are becoming so large that it is not uncommon for distributed teams of people with broad ranges of training to be in charge of the ontology development, design, and maintenance. Standard ontologies (such as UN/SPSC) are emerging as well which need to be integrated into large application ontologies, sometimes by people who do not have much training in knowledge representation. This process has generated needs for tools that support broad ranges of users in (1) merging of ontological terms from varied sources (2) diagnosis of coverage and correctness of ontologies (3) maintaining ontologies over time. Chimaera [] is an ontology environment aimed at supporting these tasks.

Within the context of the PKEG, we envision several important areas for using ontologies: representing the semantic meaning of data, recommending tools, mining data and information, and facilitating the use of the KEG by a diverse community. Building upon best-of-breed knowledge technologies and research efforts, primary research challenges lie in developing geoscience-specific ontologies, addressing scale and pruning algorithms, and addressing knowledge usage in an environment of high uncertainty.

It is important to note that there are myriad other enticing possibilities that could be explored, including spatiotemporal feature detection, algorithm selection, visual representation recommendation, and the dynamic construction of cooperative networks of components. As the meta-framework and knowledge components are integrated, we expect the overall environment to evolve into community infrastructure that fundamentally enables another generation of research in computational science and geoscience.

13 The IT Challenges: Scalability, Overall Integration, IT Research [Middleton/Fox/Hammond]

Research Design and Methods

The prototype Knowledge Environment for the Geosciences proposed here is composed of coordinated and tightly coupled IT research activities. The Knowledge Framework (KF) provides the ‘glue’ to mediate between the upper and lower tiers. It is the “middleware” for knowledge encapsulation, discovery, recording, and sharing for large-scale distributed and collaborative research work. Each element in the architecture is designed to be highly distributed, and scaleable. Here we elaborate upon the coordinated activities and prototype concept.

1 The Concept

The prototype Knowledge Environment for the Geosciences proposed here is composed of three coordinated and tightly coupled IT research activities: the Interaction Portal (IP), the Knowledge Framework (KF), and the Multi-scale repository (MS). These activities are arranged in a multi-tiered architecture. The upper tier is where users interact with the environment; the lower tier represents the primary sources of information to be built and used; and the middle tier provides the glue to mediate between the upper and lower tiers. It is the middleware for knowledge encapsulation, discovery, recording, and sharing for large-scale distributed and collaborative research work. Each element in the architecture is designed to be highly distributed, and scaleable.

2 IT Research Challenges

This proposal features many IT research challenges in fully addressing the complexity and nterdisciplinary nature of stored and presented information and knowledge. They are:

- Provide a robust and scalable information technology education and research environment that facilitates the processes of learning and researching.

- Validate applications of object-oriented and knowledge systems engineering methods in the large-scale geosciences domain.

- A data model that extends and integrates data semantics (e.g. references from real fields back to the mathematical equations underlying simulations and references from real and text fields back to a semantic network of terms in physics or more specialized disciplines).

- Hypermedia...

- Resource description...

- Etc.

-

3 KEG Definitions

The Interaction Portal (IP): is where the community member connects with the knowledge environment.  One model for the interaction portal is that of a knowledge-enabled next-generation problem-solving [e.g. PYTHIA] environment a powerful collection of tools, components, and interfaces that operate seamlessly with one another enabling a broad range of research activities and can be deployed across a range of end-user portals. Layered upon the KF and targeted initially at the MS application domain, the IP will realize a prototype next-generation portal for IT and domain research for the Geosciences. The IP must support a wide variety of human activities including finding, accessing, combining, processing, analyzing, and visualizing, up to, terascale data objects, developing and evaluating model components and tools, collaborating with other researchers, and drawing from and contributing to the knowledge base. Several levels of access must be addressed, ranging from individual web-based browser access up to collaborative distributed group interactions in a sophisticated tiled, project environment (e.g. the AccessGrid). In order to address educational, research, and multi-disciplinary impacts assessment usage, the IP will also be aimed at a range of levels of user sophistication from student to scientist.

The Multi-Scale repository (MS): The scientific cornerstone of KEG is the investigation of multi-scale Earth system characteristics, including analysis, simulation and prediction. This concept is a significant generalization of the current standard methods of Earth system research, and composes of a set of matched and coupled collections of data, and numerical component models depicting the climate system. The multi-scale concept expands this structure of coupled individual components to linked embedded hierarchies of data and models within each component module. This multi-scale structure will permit the investigation of the most uncertain aspect of selected geophysical systems; numerical simulations, sub grid scale parameterizations, etc..  It will allow us to address an expanding list of scientific questions that involve interactions between disparate scales in space and time [Ghil85; Ghil87].  Such questions are at the core of many of the most societally relevant research problems to be addressed

in the geosciences in the next 5-10 years.

The Knowledge Framework (KF): is a layered set of services with well defined (access and/or application programming) interfaces; it features a highly distributed set of methods, classes, tools, and components:

and is a fusion of knowledge and object representation.  Each of the elements of the KF utilizes a protocol and content specification for passing data, metadata and knowledge between themselves and those in the IP and MS layers. The framework will support modeling, visualization, analysis, software development, and display and collaboration aspects of the environment.

4 Goals, Requirements, and Characteristics

KEG itself is conceived to be a symmetric architecture, i.e. unlike most conventional web-based architectures for which the flow of information is mostly from the information layer (e.g. web pages) through the middleware (e.g. web server) to the user (e.g. web browser). The KF will mediate the representation of knowledge as is passes back and forth between the MS and IP layers and is the key to implementing the prototype KEG.  Viewed from the MS, the KF is a toolkit of successively layered software classes and services for assembling components of a, or a whole, set of (physical) multi-scale resources (datasets, models, etc.), which is the planned ultimate prototype of the KEG proposed here. As a result of this requirement, elements of the KF must function in both “input” and “output” modes.

Attention will be given to establishing and maintaining integrity of the captured knowledge, resource distribution and utilization, load balancing and error handling. An important diagnostic or validation step for the knowledge framework components is their publication. This will be the basis for debate and benchmarking leading to their credibility, quality and acceptance. The KF will allow for an immense variety of applications and interfaces built upon open language and protocol standards and allow data, metadata and knowledge sources to be added and accessed/served. Ultimately, users can gain needed resources, codes, data and metadata efficiently without having to know details about storage or location.

The KF will use traditional object concepts of encapsulation and polymorphism that allow discipline-specific features to be hidden from users [Booch94; Taylor98], and high levels of abstraction, which can be largely discipline-independent [Hibbard98]. When accessed from the IP or MS layers, a particular component would take their discipline specific form. The KF therefore will allow interdisciplinary scientific progress on large scales as well as be able to scale across the spectrum of user requirements: from K-16 to senior researchers.

5 Design

A preliminary architectural design was developed for P-KEG during the preparation of this proposal and is depicted in Figure PKEG1. While this design utilizes six tiers or layers and will be used to discuss important technologies for this proposal, it is necessary for this design to be iterative and coupled to end user requirements and the actual development processes. A formal capabilities and design document will be developed in the first year of the project and be used to establish the software design.

Figure PKEG1.

We will utilize UML capable tools for the detailed design, code stub generation, reverse engineering, etc. In addition, we will hold modeling sessions with domain experts early in the project to formalize the functional requirements and characteristics.

6 Architecture and Enabling Frameworks

1 Detailed description of the architecture

From the preliminary KEG architecture shown above we now focus on the KF elements in more detail and describe them, their provisional protocols, proposed content specifications, some functional characteristics and how they may fit together. This section deals with specific use and the reasoning for technologies as well as what needs to be developed.

1 IP layer

The IP must support a wide variety of human activities including finding, accessing, combining, processing, analyzing, and visualizing terascale data objects, developing and evaluating model components and tools, collaborating with other researchers, and drawing from and contributing to the knowledge base. Several levels of access must be addressed, ranging from individual web-based browser access up to collaborative distributed group interactions in a sophisticated tiled, project environment. In order to address educational, research, and multi-disciplinary impacts assessment usage, the IP is aimed at a range of levels of user sophistication - from student to scientist.

The IP necessarily builds upon the layered fabric of the KF, as a collection of complementary and interoperating components, some of which have analogs in extant information technology. Powerful access and interaction with the knowledge base is a central element of the IP. The knowledge base for this effort will be the Multi-Scale repository (MS), a distributed store of simulation code components, simulated and observed data, tools, analyses, commentary, documents, images, movies, and even virtual collaborative experiences (e.g. AccessGrid sessions).

The ability to share, browse, study, and even experience high-level knowledge holdings will enable distributed collaborative IT and domain research at a new level. The IP will present an environment for software design and code development for all levels of the KEG, including the model development and validation processes and be accessible to discipline and information scientists, algorithm specialists and software engineers. The concept will encapsulate SourceForge-type models, including in-portal execution of analysis, visualization, models, and collaboration components, etc.

Data analysis and visualization are vital to the process of developing understanding and ultimately, new knowledge. One of our goals is to enable geographically distributed researchers to effectively hypothesize and subsequently explore, analyze, and compare large, complex geosciences data from their home location. The geosciences application domains addressed in this effort pose formidable challenges. Most already involve terabyte-class datasets, many growing to petabyte-class by mid-decade. Present-day studies of the Earth system are not only growing rapidly in scale/resolution, but also in complexity, and include oceans, rivers, atmosphere, sea ice, vegetation, and biogeochemistry. Future research incorporating a multiplicity of nested grid scales, presents logical data complexity beyond anything encountered before. We propose to address these areas by combining distance visualization techniques, multi-resolution approaches, access to powerful data fusion capabilities, and metadata/knowledge-base awareness.

Another goal of this endeavor is to develop prototype tools that facilitate saving an exploration, a powerful concept aimed at addressing ‘Do you see what I see?’ questions among geographically distributed researchers. The AccessGrid (see [AccessGrid]) is a rapidly developing technology aimed at facilitating group-to-group interaction among geographically distributed people.  Access to the collaboration services in the KF layer for this effort will provide a next-generation realization of the AccessGrid fused with the IP and other problem solving environments. This effort requires extensions to the AccessGrid framework so that prototype visual analysis tools function in a distributed group mode. In addition, we propose to extend the framework such that AccessGrid experiences may be recorded, annotated, and deposited in the knowledge base as a record of human experience and knowledge exchange. These experiences will be searchable in the same way that other knowledge holdings are.

The Interaction Portal will serve as a model for a next-generation user environment for geosciences research and will offer a springboard for future development. The knowledge and data search capability will be a quantum leap beyond what is currently possible, and will address major shortcomings in current community capabilities. We envision this being a major step towards the concept of access to a Semantic Web [Berners-Lee98] in the context of geosciences research. The new software development environment will revolutionize how next-generation analysis, visualization and models are developed and will be redeployable for similar efforts at other research centers and universities.

2 KF1 layer

The KF1 layer is a set of interaction and discovery services including special and general-purpose modules.

User Interface services:

 

Collaboration services:

To support the complex mix of activities that constitute a scientific research endeavor requires collaboration systems that extend beyond simply sharing screens.  A collaborative environment must provide facilities for creating a shared workspace where work can be performed both as a group and individually over an extended period of time. While working simultaneously as a group, scientists may interact via synchronous collaboration tools. Synchronous collaboration extends across a range of technologies from shared displays (pixel sharing) to shared data or visualization streams to shared applications (event sharing).  Each of these has its on cost/benefit tradeoffs.  Display sharing technologies will be used to support collaboration in the IP. In the KF, we will be concerned with application sharing and support for data stream sharing.  They will also work asynchronously, contributing work when others in the group are not necessarily available and on-line. Tools to for asynchronous collaboration include discussion lists, calendars, announcements, e-mail lists, and folders to organize resources. They also include the ability to create, modify, and annotate shared resources.

For the collaboration system of the KF, we will develop a middleware infrastructure for collaboration systems.  This middleware will provide frameworks that will allow the collaboration system to be customized to the needs of the user and the user's task.  These frameworks will include an application framework, a user management framework, and a resource framework.

The application framework will provide components to register, locate, invoke, and share applications.   We will develop an API through which applications can access sharing services.  The framework will also provide wrappers through which sharing may be added to applications which are not built for that purpose.  To make it possible for the user to find applications to add to their portal, there must be metadata defined for each application and that metadata must be registered in a directory. Where possible, we will make use of appropriate metadata and directory standards.

The user management framework will provide directory services for storing user attributes and group affiliations.  This directory will be available to all applications to allow for user identification, presence, and security.  The resource framework will interface with the KEG cataloging service and KF/MS interface services to provide search, location, storage, and retrieval of resources.

The collaboration system provides the problem-solving environment for KEG.  As such, it must allow the user to define the workflows that support a particular task. 

The application framework will monitor and log all user activity in a manner that provides security and privacy.  The reason for this is twofold. First, it allows researchers and system developers to mine this data for information on how the collaboration system is being used.  This information can then be fed back into the development process to improve and expand the system in the most cost-effective ways.  Second, it allows agents to be written which can look at the data and make inferences based on what a given user, or a group of users in aggregate, has done. For example, if a user is setting up a workflow for a simulation run, the agent may detect the type of activity being performed and provide assistance in formulating steps to be included.  Or, it may be able to suggest collaborators that are working on similar problems or datasets or other resources that may be of use.

We plan to expand the information resource discovery to use the Resource Description Framework [RDF] format which facilitates ways that “knowledge” about artifacts can be stored and processed by agents, and directly capturing knowledge about the objects that exist in the KF. This knowledge includes the information stored in the MS layer and knowledge about the KF objects themselves (“metadata”). The task here is to develop information integration leading tools that can be used to express meta-information and constraints about entities in an information environment and agents (or processes) that can exploit that information to provide relevant services to users of the environment. For instance, it should be possible to provide a service in the knowledge framework that tells you how many objects can be used to manipulate a particular type of information in the MS layer.

The ANSI/NISO Z39.50 [Z3950a, Z3950b] standard defines a way for elements in a layered framework to communicate for the purpose of information retrieval. Z39.50 makes it easier to use large information databases by standardizing the procedures and features for searching and retrieving information in a distributed environment. The particular (Open Source) implementation of Z39.50 we plan to utilize is from Island Edge Research [IER] which forms queries and responses in XML. Z39.50 is attractive because it is simple and an established open standard. As the project evolves we will evaluate new protocol standards as they emerge and are adopted.

Development services:

Within the prototype KEG, and especially driven by the formation of hypothese (problems) in the IP layer, it will be necessary to provide a portion of the framework for developing new analyses, visualizations, model runs, classification or evolution of new knowledge areas, or even a combination of these and/or other services. This service is the one that is typical not provided within a framework environment but instead, these capabilities are developed externally and then inserted. We argue that to successfully KF-enable these developments, they should arise from within the KF.

Since there are a variety of methods already in existence, some just need to be encapsulated with the framework, some will need to have service interfaces or API's developed (e.g. in adapting a SourceForge-type capability), and still others will need to be invented, or developed. Particular services for development within the KF may include: makefiles, validation, debugging, ontology creation, evolution, agents, code segments or class libraries, or even an entire model. We envision that at least a new XML derivative language may be required for internal expression of development requirements between the KF elements, in essence, a ‘DevML’, which we will consider in the early design phases.

3 KF2 layer

Visualization services:

The scalability requirements for the KEG dictate the need for a flexible level of KF visualization elements. In effect this means they must accommodate interactive visualization of terascale data sets residing on clusters all the way to simple or on-the-fly summary representations for a quick look and in the ideal case, both of these extremes should be accessible to educators and scientists alike. In addition, these elements must cooperate, i.e. interface, with the services in the KF1 layer. /Bill/Don/PAF/

Important community resources like VisAD [HibbardXX] lead the way in powerful, flexible visualization component frameworks. When complemented with script-based languages like NCL [Alpert9X] and IDL [RSI9x], for example and state-of-the-art volume visualization techniques, a .............

As part of the KF2 visulization service, we will create new implementation classes for VisAD data and display component interfaces, to support visualization of terascale data sets distributed across the nodes of large processor clusters. By suitable interfacing to data services (mining, management, etc in lower KF layers), the data components may be “tiled” across a distributed set of resources (e.g. compute nodes), typically (but not necessarily) via a requested partition of the data set's spatial domain. Display components from the KF1 layer will access data components visualized locally on each cluster node, so that display information rather than data are transmitted to user displays.

In addition, VisAD display components have a built-in collaboration capability so that any display can be used to construct a collaborating display on another system.  For example, VisAD display components can run in web applets and in virtual reality. Using collaboration services, VisAD can be integrated with the AccessGrid for “collaborative distributed group interactions” that include access via web browsers, workstations and virtual reality.

All visualization services must be able to access distributed data holdings and we propose to utilize a Data Access Protocol, initially based on the DODS DAP [DODS] which is discussed in more detail in a later section.

Analysis services:

It is common for analyses of data for different portions of the research, and sometimes education, process to be scattered between standalone codes, model code, and even visualization codes. We propose to extract these functional requirements into a set of flexible and deployable analysis services which can be utilized in concert with other elements within the KF2 layer and exchanged with the KF1 layer.

/Bill/Don/PAF/Doug/

Assimilation services:           /Joe/Don/UAH?/

In this context, assimilation includes data and knowledge fusion. This service within the prototype KEG will provide advanced capabilities for integrating spatially and temporally-disparate data for use within the other KF layers and ultimately for presentation to a user. This service takes configuration information, such as control over the aggregation and resolution of the temporal domain.

Modeling services:

To address the multi-scale science problems of interest, numerical simulation models are one of the primary research tools. In the same way that individual datasets or discrete knowledge holdings are, ensembles or components of models have not been brought together with a framework such as that proposed here. NCAR, together with its research community are developing partnerships to address the development of new model frameworks, especially in the area of Earth System modeling. For the KEG, and this specific KF service, we will leverage those activities and focus them toward both the multi-scale requirements we have identified and the accompanying framework of other services.

Particular capabilities for this service are: scientific utilities, advanced data structures (which build on basic data structures within the framework) such as: differential operators, pointwise operators, and scale coupling, scale aware physics parameterizations, model specific math libraries such as: linear and non-linear solvers, fast spectral transforms,  etc., model components to advance dynamical equation sets applicable to a given scale, model validation and diagnostic capability, model steering, interfaces for real time summary extraction, and a capability to build production model runs, including input and output streams, independent of target platforms and operating environments. It is likely that some new XML-derivative language will be required to implement many of these capabilities, for example, a new ModelML.

4 KF3 layer

Mining services:

Research extensions of existing 'mining' [Hinke97; DATA99; Ramac00] and semantic analysis techniques will be required to address the aggregation and organization of information and knowledge resources as they pass through the KF portion of the KEG. This means mining in a sense well beyond ‘data’. Algorithms, templates, and agents for application to the multi-scale repository must be assembled and/or developed and in turn stored as part of the knowledge holdings as well as possibly kept active in a server process ready to perform common operations in response to requests from KF2 layer services. A comprehensive and scalable data model design will be required for use within this service and we intend to leverage existing efforts from the investigator team for the [DODS], [Globus], [VisAD] and [STT] projects.

The reality that large immovable datasets are part of todays multi-scale repository also may require custom order processing (using mining, subsetting, and other methods) as a means to reduce the data volume for visualization and other purposes.

Since most existing mining systems are end-to-end themselves, an important first step is to migrate the data processing services and resulting mining applications into a form that supports the integration of highly distributed and interoperable software components and adds a high degree of scalability. This process will free these services from the conventional tightly coupled single server approach that is currently utilized. This un-coupling will also allow other KF layer services to utilize processing and data resources that are widely dispersed both in terms of spatial location and as well in terms of hardware and software implementations.

At the same time, full access to the KF metadata and knowledge components may be associated with the mining services. To make this extension, another facet of this research will be to extend traditional search mechanisms with advanced search capabilities (e.g. latent semantic analysis) in order to provide a new level of domain and problem specific knowledge and information finding capability. The mining services will support the fusion of data and knowledge sets at different spatial and temporal resolutions as provided in the KF2 layer.

As with all KF elements, mining services will incorporate all pertinent standards and emerging standards, such as the Simple Object Access Protocol (SOAP), to insure the highest level of interoperability with other components of the KF to achieve the highest level of open access to data and services possible and allow the mining services to be deployed in specific end-to-end applications. Inherent in this type of distributed environment will also be the ability to access heterogeneous data sets using a common data protocol. together with emerging metadata technologies, such as the Earth Science Markup Language [ESML].

Cataloging and Tracking services:

In order to effectively support the needs of geo-scientists, the knowledge framework will need to provide cataloging and tracking services. For instance, scientists will need to be able to classify both objects and relationships in the knowledge framework. That is, it will not only be important to assert that a data set contains information generated by an atmospheric model, but that it is related to another data set that was generated by a previous version of the atmospheric model. This type of relationship is important to evaluate the evolution of a model: Is the new version generating more accurate data? Is the new version generating more compact data? etc. As this is just one type of valuable relationship, it is clear that the knowledge framework will need to be able to support multiple types of relationships, and since a significant portion of knowledge framework objects will not be stored in a Web-accessible form, the knowledge framework will need to be able to create, maintain, and evolve relationships between arbitrary data formats. One important problem to which cataloging services can be directly applied, is the problem of relating the files produced by an ensemble model run to one another. Ensemble models runs involve multiple models running in tandem producing data that is in some sense synchronized, e.g. these ten files each represent data produced by models that were processing the same region of space at the same time but addressing different concerns (such as atmospheric conditions, sun intensity, and ocean temperature). Ensemble model runs typically generate thousands of files and currently nothing is being used to relate these files to one another, other than ad hoc techniques such as file naming conventions. What is needed is a service that can group the output of ensemble model runs into conceptual wholes linked by relationships that indicate the semantic groupings of the file: e.g. these ten files represent model data for Boulder, Colorado on April 23, 2001 at 2 PM and they are linked to these ten files that represent the same model data for Boulder on April 23rd at 1 PM, and so on.

With respect to tracking services, a critical capability is to provide a recommender service for geo-science data. Recommender services are becoming more common with the rise of the World Wide Web (although recommender techniques and technology predate the Web []) and are typified by ’s ability to generate information statements like “People who bought this book, also purchased these titles...” In the knowledge framework, it will be important to derive patterns of use over existing data sets. Thus, if scientists always access data set B after retrieving data set A, then the knowledge framework should automatically inform scientists who access data set A about the existence of data set B.

We intend to explore techniques for providing cataloging and tracking services based on an approach known as open hypermedia. Open hypermedia [Østerbye, et al., 1996] is a technology that supports the creation, navigation, and management of relationships among a heterogeneous set of applications and data formats. In the same way that the hypermedia services of the World Wide Web [Berners-Lee, 1996] enable navigation among related Web pages, open hypermedia allows users to navigate among the documents of integrated applications. Open hypermedia, however, provides a higher level of hypermedia sophistication than that found on the Web. In particular, open hypermedia is not limited to a read-only environment; users have the power to freely create links between information regardless of the number and types of information involved. There are three technical characteristics of open hypermedia systems that enable this ability: advanced hypermedia data models, externally managed relationships, and

third-party application integration.

Advanced hypermedia data models: The advanced hypermedia data models of open hypermedia systems allow them to create links between multiple sets of documents. Links can be typed, and meta-information about the documents being related can be stored. These models are more powerful when compared to the hypermedia data model of the Web (as provided by HTML), which consists of one-way pointers embedded inside documents.

Externally managed relationships: Externally managed relationships imply that links are stored separate from the content being related. This approach, again, contrasts to the approach taken by the Web, in which link information is embedded directly within the content of HTML documents. The benefits of externally managed links include the ability to link to information located on read-only media or within material protected by read-only access permissions and the ability to associate multiple sets of links with the same information.

Third-party application integration: Finally, third-party application integration allows hypermedia services to be provided in the tools used by users on a daily basis. This approach also contrasts with the Web where typically the only applications that provide hypermedia services are Web browsers. With the open hypermedia approach, end-user applications provide hypermedia services directly, allowing users to create relationships over a wide range of applications and data formats. The open hypermedia field has developed a wide range of techniques for integrating legacy systems into an open hypermedia framework [Davis, et al., 1994; Whitehead, 1997; Wiil, et al., 1996] while new applications can be designed and implemented to directly support and provide open hypermedia services.

Open hypermedia systems provide the extensible infrastructure that enable hypermedia services over an open set of data, tools, and users [Østerbye, et al., 1996]. The first open hypermedia system appeared in 1989 [Pearl, 1989] and the field has been evolving rapidly ever since. Østerbye and Wiil present a good history and overview of open hypermedia systems in [Østerbye, et al., 1996]. There are many open hypermedia systems in existence. Examples include Chimera [Anderson, 1997; Anderson, 1999a; Anderson, 1999b; Anderson, et al., 2000a; Anderson, et al., 1994; Anderson,   et al., 2000b], Microcosm [Hall, et al., 1996], DHM [Grønbæk, et al., 1993], HOSS [Nürnberg, et al., 1996], and HyperDisco [Wiil, et al., 1996]. For the purposes of the proposed research, we have selected Chimera to help support our prototype development efforts. Chimera specializes in supporting the creation of relationships across heterogeneous data sources [Anderson, et al., 2000b] and is, thus, well suited to the domain of the proposed research.

Techniques in open hypermedia can be usefully applied to the infrastructure design of the knowledge framework. Not all objects in the knowledge framework will be Web-based (in fact, maybe only a small percentage of objects will be Web-accessible) and so standard Web linking technologies will not be sufficient for capturing the relationships that exist between these objects. An open hypermedia system, however, can easily provide the infrastructure to establish and maintain relationships between the heterogeneous data formats of the knowledge framework. The fundamental research challenge will involve adapting open hypermedia infrastructure to the scale of the knowledge framework (which is envisioned to contain thousands to tens of thousands of objects). Past work in open hypermedia has addressed scalability in the number of relationships over a comparatively small set of objects [Anderson, 1999a; Anderson, 1999b]. Now we must scale the technology so it can support the creation of hundreds of thousands (if not millions) of relationships over a large set of objects.

In addition, research must be performed to provide distributed relationship management capabilities to each object of the knowledge framework. Thus, for instance, it should be possible for a user of the interaction portal to import two knowledge framework objects into a design environment and establish a relationship between them (since both objects would be “hypermedia-enabled”) regardless of where these objects actually live. Plus, the link could potentially be made available outside of the design environment that created the link in the first place. This will require distributed hypermedia link repositories that live in the knowledge framework and replicate link information. This characteristic implies an additional design challenge: How should link consistency be characterized and enforced in the knowledge framework? That is, if a link is deleted, how long will it take before all copies of that link are removed from all of the distributed link repositories.

Notification services:

A key functionality of the services layer of the knowledge framework is notification. For instance, scientists will want to be notified when interesting things occur in the framework, such as a particular paper being published, or when a running model has discovered an interesting feature that may be useful for steering future model runs. In addition, ensemble model runs may make use of notification services to coordinate the different elements running in the ensemble. Educators or students may want notification of events of topical or social interest .... /Roberta?/.

One approach to notification services is via events using even publish/subscribe technology (also known as event notification technology). The basic notion involves a set of publishers and a set of subscribers. The publishers publish specifications of particular event types they may generate at some point in the future. Subscribers can subscribe to a particular set of published events and will receive notifications each time an event is generated for one of their subscriptions. For instance, a Web server may generate an event each time one of its pages is updated. Subscribers interested in those events would subscribe to the “page updated” event published by the Web server. Then each time a page is updated, the web server creates a new instance of the “page updated” event configured with information about the newly updated event, and hands the event to an event notification system. The event system takes this instance and sends it to each subscriber interested in the “page updated” event. A key benefit enabled by event notification technology is that the publishers and subscribers remain completely ignorant, and thus completely independent, of one another. A publisher simply generates an event, and the event notification system takes care of sending the event to all of the interested recipients.

Event notification technology has been in existence since the mid-eighties [Reiss, 1990 #19] but, until very recently, has been restricted to supporting publishers and subscribers on local area networks. The University of Colorado has been active in pushing the scope of this technology to wide-area networks (WANs) with the Siena project Carzaniga, 2000 #264]. Siena's support for Internet-scale event notification makes it an ideal candidate to develop notification services for the knowledge framework. In its current form, Siena can support the construction of experimental prototypes that explore the use of notification services between a small set of knowledge framework entities. However, Siena is not currently able to scale to the thousands of objects currently envisioned for a production P-KEG environment.

As such, basic computer science research will be needed to expand the scope of event notification technology from small sets of publishers and subscribers distributed across a WAN to an environment that can consist of thousands (if not more) of publishers and subscribers distributed across both LANs and WANs, similar to the proposed P-KEG environment. For Siena to reach this level of scalability, new research will be needed in low-level distributed event protocols and in advanced event routing algorithms.

An additional area of research for event notification technology within the context of the P-KEG environment is support for domain-specific event patterns. For instance, while scientists will most likely be interested in coarse-grain events from models such as “model run started” and “model run ended”, they may also be interested in finer grain events such as “hurricane detected” or “sea temperature rising” that might be generated, for instance, by atmospheric and oceanic models. Furthermore, they may be interested in particular event sequences: if each time an oceanic model generates three “sea temperature rising” events, an atmospheric model generates a “hurricane detected” event, then scientists may want to be notified when such a pattern occurs. In particular, we believe that very powerful analytical tools can be built on top of a notification service that provides advanced event pattern services. We intend to explore such services in the P-KEG environment using Siena as a development platform.

5 KF4 layer

In the present KEG design, the KF4 layer mediates the flow of data, metadata and knowledge from distributed repositories and services in the KF3 layer and above. As such, these capabilities are refered to as systems rather than services since they must accommodate a variety of protocols and content formats, different size and frequencies of requests in a reliable and scalable manner.

Knowledge Management System:

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download