This is is an example for ITI confrence



Real Life Data Mart - Models Comparison

Mario Milicevic, Vedran Batos

Polytechnic of Dubrovnik

Cira Carica 4, 20000 Dubrovnik

CROATIA

Vedran Mornar

Faculty of Electrical Engineering and Computing, University of Zagreb

Unska 3, 10000 Zagreb

CROATIA

Abstract: During the implementation of Data Warehouse data is derived from OLTP normalized database and then translated into denormalized dimensional model. Basic parameters for evaluation of this process are ETL process speed, disk space consumption, query performances and user friendliness. Any of mentioned parameters could be favored with appropriate model.

Keywords: Data warehouse, Data mart, Dimensional modeling, Star schema

1 Introduction

Analyzed data is based on three years of data from OLTP database owned by tourist agency - more precisely from the part of database dedicated to excursions preparation and manipulation. For the purpose of better differentiating the models and better prediction of future behaviour, the quantity of data is enlarged for significant factor. The original and characteristic distribution of data is preserved during this process.

The total quantity of data is representing more then 7 000 000 transactions, i.e. more then 36 000 000 passengers - during the three years interval.

On the basis of transactional data Data Mart (DM) was built, representing the prototype of the future data warehousing (DWH) solution.

2 Initial transactional database

In analyzed case OLTP database is the only one source of date for DM - what became quit usual situation during the last few years. Only for minimal amount of input data, textual files or spreadsheets are taken into consideration.

The consequence of quoted is that the ETL procedures are being simplified, but still we must pay attention on their efficiency.

Transactional database is following relational model consistently – all relations are normalized (3NF), referential and domain integrities are accomplished where it is necessary, etc.

More then 100 relations (database tables) are used for storing data for preparation, booking and sale of the excursions.

3 Dimensional model

Analyzed DM is built around the most interesting event in the system - sale transaction. However, after taking into consideration real-life aspect of this case, it is clear that data gathered during the sale transaction are not always complete.

On the basis of users' requirements transaction is determined with ticket. Specific detail is the fact that ticket may be issued for more passengers. The consequence is faster sale process - what is very important because tickets are often being sold just before the excursion - even in the moment of embarkation into the vehicle.

All passengers an same ticket share the common attributes (attributes A):

o ticket S/N

o excursion (DIM)

o excursion date (DIM)

o exc.time (embarkation) (DIM)

o embarkation location (DIM)

o language (guide) (DIM)

o reseller (DIM)

o booking date (DIM)

o mode of payment (DIM)

o number of adults

o number of childs (3-12)

o number of infants (0-2)

o amount (EUR)

Foreign keys pointing to the corresponding primary key of the dimension table are marked (DIM).

As one can notice attributes like name, gender, age etc are not registered. It is acceptable because excursions are usually lasting within one day and corresponding money amounts are low.

With one transaction (ticket) it is possible to register more passengers (even to 100) - so this fact table isn't built on the at the lowest level of granularity possible.

However, there is one important exception: according to regulations set by custom authorities - for excursions abroad - additional data must be registered:

• first name

• family name

• date of birth

• gender

• passport number

• country of citizenship

... so additional attributes can be created (attributes B):

o age

o age group (ADL/CHD/INF)

o gender

o country of citizenship (DIM)

o native language (DIM)

o excursion price

Although only for 10% of total number of passengers additional attributes exist, this data is invaluable for different demographic analyses.

3.1 DM models

Three different models are analysed:

A) extended fact table

B) fact table with incomplete data

C) two independent fact tables

All models include same data - but same output reports will be accomplished with different SQL queries. Also, significant differences can be expected in space consumptions and queries duration.

extended fact table

This model can be found in the literature for a while, but it is used relatively rarely. Possible reason is in the fact that level of normalization is somewhat higher (two related fact tables with different grain), so some queries must use additional join operation. In this example two fact tables share same trans_ID - surrogate key that is representing the unique sale transaction number (ticket).

[pic]

Fig. 1. Data Mart – Model A

Extended fact table (PAX = passenger) contains additional data only for transactions where that data is available

fact table with incomplete data

[pic]

Fig. 2. Data Mart – Model B

Only one fact table is used - at the lowest level of granularity possible (data for single passenger). However, as noted before, only for 10% of passengers additional data is available - so in remaining 90% rows of table attributes like age or gender has value 'UNKNOWN' (NULL values are avoided).

two independent fact tables

Data is located in two independent fact tables - with different grain and number of rows. However, during the initial load and forthcoming updates same surrogate key trans_ID is maintained in both tables - but just for necessities of this analysis and for possibility of comparison with other models.

[pic]

Fig. 3. Data Mart – Model C

It is necessary to stress that chosen model will be completed with a few aggregation tables. That tables are omitted in this work because their influence on measured parameters is not essential.

Of course, assumption is that analyzed queries after all won't use that aggregation tables - because grain or method of aggregation is not adequate in observed case.

3.2 Space consumption analysis

All implementations of DWH technologies are linked with hard space consumption considerations. Often this is the crucial factor for selecting optimal design.

In this case study DM size is determined mostly (99%) by size of fact table(s) and accompanied indexes (the biggest dimension table has only 9000 rows).

|Model |Object |Rows |Size |

| | |(106) |(MB) |

|A |TICKET fact table |7,2 |664 |

|A |PAX fact table |3,2 |112 |

|A |INDEXES | |60 |

|A |Total: 836 |

|B |PAX fact table |36,5 |3226 |

|B |INDEXES | |147 |

|B |Total: 3373 |

|C |TICKET fact table |7,2 |664 |

|C |PAX fact table |3,2 |320 |

|C |INDEXES | |80 |

|C |Total: 1064 |

Table 1. Data Mart Size

Chart 1. Data Mart Size

It is evident that model B has the least efficient use of space - as the consequence of forcing the lowest level of granularity - even when all attributes are not known.

For DM size estimation bitmap indexes on FK in fact tables are taken into consideration - for example 8 indexes on TICKET fact table from model A.

Besides other advantages over the B-tree indexes (more about that later in paper) bitmap indexes need significantly less space:

Chart 2. Index Size Comparison

It is obvious that bitmap indexes occupy less DM space - i.e. for model B 147MB (bitmap) vs. 4560MB (B-tree).

However, additional fact must be mentioned in context of bitmap indexes:

o as expected - space needed for bitmap index is in direct correlation with indexed attribute cardinality. Even attribute with 9000 different values could be good index candidate - when corresponding table has few millions of rows;

o less expected (but quite logical) – amount of space used for bitmap index is depending significantly on distribution of indexed attribute - partially presorted data can reduce needed space significantly;

o less expected (but quite logical) - bit map indexes respond inadequately on insert or update of corresponding table - needed space is not increased proportionally - for example:

a) PAX fact table (model B) has 36 533 622 rows (3 years data). Bitmap index on column time_key occupy 14MB of space. Complex query using 4 bitmap indexes lasts 5s;

b) during the daily update of DM 130657 rows are being inserted into fact table (increase of 0,36%). After the insert bitmap index on column time_key occupy 40MB of space (increase of 185%). Query execution time is increased from 5s to 10s (100%).

Quoted facts are the reason for recreation of bitmap indexes after the massive inserts or updates.

3.3 Queries performances

Significant number of queries has been analyzed - with multiple joins of fact and dimension tables. Special attention is dedicated to comparison of the characteristics of bitmap and b-tree indexes. Both space consumption and query performances are analyzed before b-tree indexes are dropped.

Each query is executed several times in different environment to ensure that effect of cache memory (database buffers) is controlled. Presented values are the average value - because it is realistic assumption that each query will be executed more than once (maybe with minor changes in WHERE clause).

|Execution |Q1 |Q2 |Q3 |Q4 |Q5 |Q6 |

|Time |(s) |(s) |(s) |(s) |(s) |(s) |

|Model A |0,5 |10,0 |1,0 |50,0 |1,0 |25,0 |

|Model B |1,5 |20,0 |1,0 |80,0 |50,0 |5,0 |

|Model C |0,5 |10,0 |0,1 |50,0 |0,1 |2,5 |

Table 2. Query Execution Time

[pic]

Chart 3. Query Execution Time

Query Q1 and Q2 are using data from TICKET fact table (sale transaction level) and associated dimension tables. Queries Q3 to Q6 are using data also from PAX fact table (passengers level).

Queries have different selectivity - for example Q4 retrieves number of passengers in year 2002. - grouped by months, language and gender. In this case Cost Based Optimizer (CBO) estimates that there is not need to use available indexes. For distinction of Q4, query Q5 retrieve number of passengers grouped by dates and gender, but filtered with the precise conditions: year ‘2002’, month ‘8’, reseller ‘DWH tours’, language ‘English’, departure time ‘ ................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download