Data Warehousing—



Data WarehousingÑ

Advantages for Decision Support

|Data Management Technologies |[pic] |

|126 E. Wing St. Suite 225 | |

|Arlington Heights, IL 60004 | |

|847-394-8387 Fax 847-509-9096 | |

Copyright © 1994 Data Management Technologies.

All Rights Reserved. Printed in U.S.A.

No part of this document may be reproduced, stored in retrieval systems, or transmitted in any form or by any means, electronic, mechanical, photocopy, recording, or otherwise, without prior written permission of Data Management Technologies. Data Management Technologies assumes no responsibility for any errors that may appear in this document.

No responsibility is assumed for the use or reliability of software or hardware described in this document.

TIME MACHINE is a registered trademark of Data Management Technologies.

BusinessObjects is a trademark of Business Objects, Inc.

OpenVMS is a trademark of Digital Equipment Corporation.

Express is a trademark of Information Resources, Inc.

Information Warehouse is a trademark and IBM is a registered trademark of International Business Machines, Inc.

Microsoft Access and Windows NT are trademarks and Microsoft is a registered trademark of Microsoft Corporation.

NIELSEN is a registered trademark of AC Nielsen Company.

PowerBuilder is a trademark of PowerSoft Corporation.

Knowledge workers depend on information to determine the direction and set the strategy for their organizations. To succeed, they must be able to locate and retrieve data rapidly, incorporate it in their analyses, and use it to make sound decisions. The accessibility and reliability of the data at their disposal can mean the difference between competitive advantage and mediocre management.

In fields as diverse as financial services, manufacturing, retail and packaged goods, telecommunications, and government and aerospace, knowledge workers rely on data to shape and support both near- and long-term plans. To do so effectively, they need a responsive data warehouse that provides straightforward, efficient access to data that reflects trends in sales, customer support, production, distribution, and other functions that can determine whether a company succeeds or fails.

Strategic Data Has Distinct Requirements

Unlike users of transaction-oriented data, knowledge workers typically do not insert, delete, or update information frequently. Instead, they consult databases for the statistics they need and retrieve information relevant to their business analyses. They then use strategic, or decision-support, data to analyze and determine corporate direction.

Another type of data, transaction-oriented, or operational, data is used to support the operations of a business. One example is data entered, manipulated, and retrieved through a transaction processing (TP) or on-line transaction processing (OLTP) system. This type of data is updated frequently and accessed by multiple users. Table 1 compares the two types of data.

[pic]

Because knowledge workers interact with databases in a unique way, they require database management systems unlike those that manage operational systems. For that reason, knowledge users, as well as database administrators responsible for managing critical corporate data, should carefully compare data warehousing technologies and select the system that best supports their business goals.

Because operational and strategic data are used differently, they must embody different characteristics. For example, an operational system must:

¥ Support frequent and efficient insert, update, and delete operations.

¥ Support multiple user updates.

¥ Provide record locking, deadlock detection/prevention, and two-phased commit.

¥ Maintain available space for future insertion of records.

¥ Provide consistent indexing performance for updating.

¥ Allow for efficient retrieval of individual records/transactions (by providing a minimum rate of transactions per second).

¥ Demonstrate data integrity Ñ in journaling, checkpointing, recovery, and backout.

A strategic system, on the other hand, must:

¥ Handle loading of new time period, old time period unload, and compress functions efficiently.

¥ Minimize database fragmentation.

¥ Provide efficient management of multi-spindle tables.

¥ Support single-user read/write but eliminate record locking, deadlock detection, and resource commit support.

¥ Store data as efficiently (densely) as possible.

¥ Support high-performance indexing and multiple indexing methods (through speed and storage efficiency).

¥ Support data drill down.

¥ Measure performance in elapsed query response time.

¥ Retrieve multiple records per physical I/O.

¥ Offer fast unload source database, fast backup and restore.

¥ Minimize downtime caused by the failure of disk drives.

¥ Eliminate journaling and other write support.

Data Warehouse Systems Offer Advantages

Conventional, or ad hoc, decision-support systems use familiar data extraction tools such as spreadsheets to query operational systems. Recently, graphical user interface (GUI) tools such as PowerSoft CorporationÕs PowerBuilderª have begun providing knowledge workers with powerful tools to paint decision templates. These tools permit the extraction of data through complex Structured Query Language (SQL) queries without requiring that the user know which language is being used. Conventional systems call on an operational server, designed for transactions, to support both operational clients and strategic clients and their respective loads on the system. For conventional systems, this often proves burdensome.

Knowledge workers may be unaware and unconcerned about the burden their usage imposes on an operational system. Intent on completing their analyses, they do not consider how joining multiple million record tables can encumber an active transaction system. Their undisciplined use of tools can cause headaches for MIS managers, however, especially when it results in system slow-down or other types of disruptions. Sophisticated query systems such as BusinessObjectsª address this problem with a governor available to the database administrator, which systematically enforces query time or extent limitations. Query views can similarly restrict usage.

Fortunately, a much more efficient and less restrictive solution is available. The data warehouse is designed to allow knowledge workers to utilize decision support tools efficiently. By adding a warehouse server to the network, MIS managers can provide knowledge workers with a system tailored to the way they retrieve and use information.

Periodically, a warehouse server must be loaded with data from the operational server. It can thus act as a buffer, allowing knowledge worker clients to run their queries directly against the data warehouse. This scheme ensures that there is little or no risk of decision-support queries interfering with the operational system. To further optimize the decision-support system and increase the productivity of knowledge workers, a data warehouse system can be customized to take full advantage of GUI query tools, 4GLs, and proprietary business analysis tools.

By exploring a data warehouse, an organization can provide an efficient means for knowledge workers to retrieve data while maintaining an operational system solely for its intended purpose. Knowledge workers should use report writers and other decision-support systems to query an operational system only when the ad hoc queries are trivial, or when they require only summary data that the system can generate. In such cases, however, the operational system must have sufficient capacity to process decision-support loads, or its primary functions will be disrupted.

TIME MACHINE¨

TIME MACHINE, from Data Management Technologies (DMT), is a data warehouse system with a balanced data warehouse architecture based on such industry standards as SQL and ANSI C. A relational data base management system (RDBMS), TIME MACHINE is designed to support time-based strategic data such as the information UPC scanners collect at points of sale.

By off-loading processing from costly mainframes and distributing it over the network, TIME MACHINE places key decision-making data on inexpensive servers. TIME MACHINE also supports user scalability requirements. It is available on a range of servers, including Intel 386/486 platforms in PC LANs, RISC-based UNIX servers in TCP/IP networks, and MVS mainframes. Moreover, because TIME MACHINE is written in ANSI C, it is highly portable.

The major advantage that TIME MACHINE offers over conventional data warehouse products based on multipurpose DBMs is its time-based design, which allows for storing each significant time interval in its own file. TIME MACHINE also provides performance gains over conventional RDBMs used as data warehouses, particularly over databases storing more than 5 GB of data.

The balanced design of TIME MACHINE results in a high performance system. For example, it permits rapid loading, so that new files can be added without requiring the reloading and organization of the entire database. The design also enhances the performance of SQL queries, and dramatically reduces disk space requirements. In addition, TIME MACHINEÕs balanced design allows the system to support a wide range of computers, including IBM MVS, VM, UNIX variants (such as IBMÕs AIX, SunOS, and HP-UX) as well as PC DOS and MS Windows.

TIME MACHINE fits well in multi-vendor distributed production systems because it can load quickly and perform data translation. In fact, TIME MACHINE data can be accessed by any program or end-user tool that generates SQL (such as Microsoft ACCESS). Special interfaces for 4GLs as well as analysis tools that have proprietary APIs (such as Expressª from INFORMATION RESOURCES, Inc.) are also available.

DMTÕs TIME MACHINE technology is licensed to AC NIELSEN¨ for sale to packaged goods retailers and manufacturers. It also is available in solutions by other remarketers. It is incorporated into complete solutions by UNISYS Corporation on their series 2200 computers.

Departmental and Enterprise Data Warehouses

Data warehousing systems often are implemented as two dissimilar configurations so that each can address the needs of either departments or entire enterprises. The functional, or departmental, data warehouse (DDW) allows an organization to deploy data in a way that is set up for a key function such as customer support. In this way, a DDW enables a department to spot trends quickly and make appropriate adjustments. Using the TIME MACHINE server, a DDW can be set up cost-effectively using a PC LAN or a RISC-based UNIX server system for a low cost-per-client.

Alternatively, an enterprise-wide data warehouse system (EDW) spans several functions, often from source systems in disparate geographic locations. An EDW must be available on all major PC LAN, RISC-based UNIX, VAX/VMS, VM, and MVS platforms to meet interoperability and scalability goals set for system performance, as well as hold down IS costs, and support corporate strategy. By doing so, it allows users to take advantage of new technology while protecting their investments in hardware, software, and training.

More specifically, data warehouse loading rates must be fast to meet specific time windows for off-loading the data. In addition, system managers must be confident that they can estimate, schedule, control, and perform these loads with predictable speed. They do not want to have to reorganize their data warehouse systems to accommodate the addition of new data. To serve an enterprise, an EDW also must be able to communicate with and load data from different transaction RDBMS platforms.

Figure 1 highlights some of the differences between these two configurations.

[pic]

Conventional Relational DBMS and Data Warehouse Systems

The characteristics and requirements of systems processing operational data differ from those used for processing strategic data. OLTP relational DBMSs have specific shortcomings that make them weak choices for handling decision-support data.

Conventional B-tree Indexing

Relational systems for transaction processing use b-tree or b*-tree indexing. These are excellent selections for retrieving or inserting single records. Continuous values that change over time, however, cannot be indexed effectively with these methods. To index the entire database for decision support, a b-tree index would grow several sizes larger than the database tables themselves. As a result, single-file databases, prevalent in the UNIX environment, would grow enormous and could cause disk thrashing.

Query Optimizers

Decision support concerns ad hoc queries, which are, by definition, unpredictable. Query optimizer schemes based on predicting precedence of terms to process first have very limited chance of success.

Join Parsing

OLTP joins use a single value that is then combined with those of other tables. This approach is inadequate for decision support, which requires that entire ranges of values to be joined with other tables.

Time-Based Data A Critical User Concern

In Building the Data Warehouse, William Inmon defines data warehousing as a:

Ò¥ Subject-oriented

¥ Integrated

¥ Nonvolatile

¥ Time-variant

collection of data in support of management decisions.Ó

The data warehouse is subject-oriented because it contains exhaustive information about a specific subject and is well connected with the overall information processing system. It meets the requirement for non-volatility by preserving information that is not updated. Finally, the data warehouse is also time variant, supporting the modeling and processing of time-based data.

Managers and users of decision-support systems want to be able to model and process time-based data. The conventional RDBMS at their disposal, however, lack storage structures and index options optimized to do so. A typical operational RDBMS, for example, contains its entire database in a single file with a single index. When the data for another week is added, the entire database must be reorganized, and often reloaded, which can bring normal processing to a standstill.

Using conventional SQL parsers can also provide less-than-adequate results because they are not set up to handle time-based data. In processing joins of time series data, SQL parsers consume large portions of the CPUs resources, which can halt normal processing. Thus, data warehouse systems based on transactional RDBMSs must be detuned to process queries one record at a time. This moves the burden to the I/O system, an unacceptable solution. Furthermore, the recommendation to use faster RISC or parallel processing hardware merely results in faster runaway queries.

A More Advantageous Alternative

A data warehouse system must support efficient use of disks through file allocation and compression. It cannot restrict the size of ad hoc queries. It also must rely on standards such as SQL, and support all standard machine architectures. DMT has taken these requirements into account when taking their balanced approach to data warehouse design. Their approach emphasizes speed in loading, updating, and query. Most importantly, DMT has designed TIME MACHINE to avoid the tradeoff of one of these requirements at the expense of another.

As a result, the TIME MACHINE data warehouse does more than merely off-load strategic data processing from the transaction system. TIME MACHINE treats each significant time period as its own file with its own index struc-ture. By taking advantage of DASD loading and indexing and the application of query theory, the TIME MACHINE relational data warehouse system effectively deals with time-based data. TIME MACHINE is based on tenets of good read performance, providing:

¥ Minimal disk head movement.

¥ Multiple records read per I/O.

¥ Indexed sequential reads.

¥ Data-independent indices.

TIME MACHINE Performs Fast Initial and Incremental Loads

Initial load of a TIME MACHINE data warehouse from an operational database is performed by the DBMS UNLOAD command, which writes fixed length flat files to the data warehouse. Unlike the conventional DBMS-based data warehouse, in which the tables and indices are built during the load, TIME MACHINE first unloads data from the source and then builds the index and tables. This approach greatly enhances load speed and is especially valuable when distributed operational databases are available only for short periods.

Data Translation from Multi-Vendor Sources

TIME MACHINE provides a SOURCE option at table load time that translates input data, including the data types of IBM 370, IBM RS/6000, IBM PC, Digital VAX, Hewlett-Packard HP9000, and SUN Microsystems systems, to the NATIVE data type of the machine on which TIME MACHINE is running. In addition, a TARGET option is used to convert the type of the data a TIME MACHINE table unloads to that of whichever architecture the database administrator specifies.

These conversion capabilities are powerful features for organizations with computing environments made up of multi-vendor systems. They allow knowledge workers to work with heterogeneous data warehouses without having to worry about the source of the data. Such transparency is a key goal for the open environments to which most companies are migrating.

File Structure Enhances Loading and Query Speed

Transaction-oriented RDBMSs traditionally are contained in one file. A TP system (for example, a file) with a million records (perhaps one per product, or customer) would be huge. For a data warehouse, however, a million records does not pose a challenge. TIME MACHINE breaks data into significant time chunks, or partitions, and then stores each chunk in its own two files, one that holds the index and the other the data. The size of the index file is typically 1% that of the data file. These small indices can be held in memory easily. For example, if a query looking at weekly periods over the course of a year were run, TIME MACHINE would load 52 small indices into memory instead of one large one that might have to be swapped in and out as processing occurs. Figure 2 provides a typical data warehouse table creation statement.

[pic]

Multiple Index Types Offer Flexibility and High Query Performance

TIME MACHINE supports three types of indices: partitioned, clustered, and inverted. A partitioned index creates separate files for records with discrete values for the partitioned attribute. Typically, the partitioned value is a unit of time. For example, with a partitioned index, a user could look at weekly results by consulting one of 52 partitions for a given year.

Partitioned indices offer the fastest retrieval of the three indexing types as well as other advantages in retrieval and storage. By allowing users to access only those files needed, retrieval time is minimized. Storage space is used efficiently because the key value need not be stored in the partition; it defines the partition. In addition, those portions of the database that are not accessed frequently can be compressed to as little as 25% of their original size. Figure 3 represents the partitioned index.

[pic]

Clustered indices store all records sharing the same value contiguously, thereby reducing the number of operations required to retrieve data and, ultimately, reducing data retrieval time. This approach allows TIME MACHINE to access a group of records simultaneously, and not be limited to accessing individual ones.

An example of clustered indices might be those pertaining to a companyÕs product. These indices are created at initial load time. Each weekÕs partition may contain a record for each product. The table is then sorted and stored based on this index. TIME MACHINE breaks these clustered records into I/O blocks optimized for the underlying hardware, typically 32 KB. If each record were 32 bytes, theoretically, TIME MACHINE could fetch as many as 1000 records per I/O, after the index file had been read into memory. Typically, TIME MACHINE reads 150 records per I/O, whereas a transaction system reads one (or fewer) per I/O.

A third index structure, often used for market data, is called an inverted list index. This type of indexing creates a list of all the possible values for the indexed attribute, with a pointer to each associated record. Multiple inverted list indices add to the size of the database. Consequently, data retrieval from this type of index is not as efficient as that for clustered indices. For that reason, TIME MACHINE normally runs a query against a clustered index.

Figure 4 depicts the inverted index.

[pic]

TIME MACHINE also allows for the definition of an optional primary key, which sorts the records when the table is loaded to produce faster queries by retrieving these records, often obviating the sort step in joins.

Table Updates Isolated from Rest of Database

When updates occur, they can usually be limited to adding a new file or partition to a single table. This reduces the need for time-consuming data warehouse reorganizations or rebuilds. If a new product is introduced and enters the data warehouse at week 10, the index structure is changed from this point only, without affecting the rest of the table. Because TIME MACHINE does not enforce referential integrity, new tables can be added and processed without requiring a database reorganization.

Management of the database is always a concern. However, with TIME MACHINE, a new time interval is stored in its own partition, with its own index. The existing database does not need to be reorganized or reloaded. This is more desirable than having to reorganize the data when updates occur as required by competitive data warehouse managers or conventional RDBMSs.

Data Preserved in Third Normal Form

In order to speed queries, several new data warehouse query tools require that the data warehouse be denormalized from 3rd normal form. Adding redundancy can increase the size of the data warehouse and can also add to the complexity of the Data Base AdministratorÕs (DBA) task. Additional metadata management tools may be required. None of these additional activities or products are required by TIME MACHINE, which accepts fully normalized data and loads and stores it efficiently.

TIME MACHINE Permits Reduced Storage Capacity

In a DB2 environment, for example, TIME MACHINE yields typically a 50% space savings. This savings occurs without optional compression being enabled and is available because only the data, and not the space that the DB2 indices require, are stored by TIME MACHINE. With compression, users have reported a 200 GB DB2 database stored in a 25 GB TIME MACHINE data warehouse. This compression can be applied selectively by the DBA to tables or parts of tables, for example, on less frequently accessed time periods. TIME MACHINE uses a standard algorithm for compression. (In a RISC environment, compression may actually provide an I/O performance advantage as a result of the availability of CPU cycles necessary to perform decompression.)

TIME MACHINE SQL Implementation Enhances Query Speed

Without Pre-defined Joins

TIME MACHINE features an optimizer that takes typical queries that result in joins and parses them into a nested select that obviates the joins. Figure 5 depicts how this query is parsed. The system does this without user intervention, allowing knowledge workers to use GUI tools familiar to them. They need not become SQL experts in order to use TIME MACHINE effectively.

[pic]

Unlike other data warehouse products, TIME MACHINE does not require pre-defined joins. An option is to sort the table on a primary key when the table is created. Doing so enhances query speed for frequent queries by eliminating the sort step in joins.

TIME MACHINE Uses ANSI SQL

In addition, users can use effectively without learning or programming SQL extensions. DMT believes that SQL is best left for data queries and that analysis tools should perform mathematical operations on data. Thus TIME MACHINE allows the warehouse server to remain a file or I/O server while applications run on the client.

TIME MACHINE Uses Operating System Utilities

Unlike competitive systems TIME MACHINE uses standard operating system utilities and features for security and backup. System managers need not buy and learn new software to have a fully functioning system.

TIME MACHINE Schema Option

TIME MACHINE does not require users to log into an assigned database to as a security measure. Thus, by using the TIME MACHINE ANSI schema option, a user can perform multi-schema queries. Consequently, different databases with the same table names can be defined and queried.

Parallel Processing Capability

TIME MACHINE includes an option that allows queries to be paralleled in UNIX and Windows NT operating systems. In single processor systems, this provides a performance advantage for I/O, because each query can be run against a specific data partition. In multiple processor systems, overall load and query performance is increased because of the additional processors.

Open APIs

DMT documents the TIME MACHINE API so that other vendors can write an optimized interface for their product. Other systems have unpublished or proprietary APIs and require that users count on future vendor cooperation to interface their systems. Table 2 lists many of the systems TIME MACHINE runs on or interfaces with more are being added. [pic]

Data Import and Export to Tools Like Expressª

One of the barriers to the rapid deployment of data warehouse systems is their lack of integration with 3GLs, 4GLs and other sophisticated data analysis tools. TIME MACHINE, however, which provides several features that contribute to integration with data analysis tools can:

¥ Use fixed format sequential files with a known record layout as a table without adding them to the data warehouse.

¥ Unload a table from TIME MACHINE to a file with a user-specified format.

¥ Unload data in a flat file fixed record format for use by 3GLs and 4GLs.

¥ Unload data to IRIÕs Expressª for further analysis.

Conclusion...

Today business is conducted in an ever-changing, increasingly competitive climate. Only those companies with ready access to strategic data will successfully guide their organizations through the challenging times ahead.

Companies can insure that their knowledge workers have the best tools available and the means to manage themÑ with a data warehouse. A well-designed data warehouse separates strategic data from the data that supports day-to-day operations, and thus empowers decision makers by providing fast, easy access to vital information.

The TIME MACHINE data warehouse offers a balanced, time-based design that supports corporate computing goals for interoperability, scalability, and portability. This standards-based system allows companies to off-load mainframes and use less expensive servers. By cost-effectively arming knowledge workers with the data they need, TIME MACHINE provides the decision support to help companies in demanding global markets.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download