Automation of Candidate Extract and Load Process



DWH Material

Version 1.0

REVISION HISTORY

The following table reflects all changes to this document.

|Date |Author / Contributor |Version |Reason for Change |

|01-Nov-2004 | |1.0 |Initial Document |

|14-Sep-2010 | |1.1 |Updated Document |

Table of Contents

1 Introduction 4

1.1 Purpose 4

2 ORACLE 4

2.1 DEFINATIONS 4

NORMALIZATION: 5

First Normal Form: 5

Second Normal Form: 5

Third Normal Form: 5

Boyce-Codd Normal Form: 6

Fourth Normal Form: 6

ORACLE SET OF STATEMENTS: 6

Data Definition Language :(DDL) 6

Data Manipulation Language (DML) 6

Data Querying Language (DQL) 7

Data Control Language (DCL) 7

Transactional Control Language (TCL) 7

Syntaxes: 7

ORACLE JOINS: 9

Equi Join/Inner Join: 10

Non-Equi Join 10

Self Join 10

Natural Join 11

Cross Join 11

Outer Join 11

Left Outer Join 11

Right Outer Join 11

Full Outer Join 12

What’s the difference between View and Materialized View? 12

View: 12

Materialized View: 13

Inline view: 13

Indexes: 19

Why hints Require? 19

Explain Plan: 21

Store Procedure: 23

Packages: 24

Triggers: 24

Data files Overview: 26

2.2 IMPORTANT QUERIES 27

3 DWH CONCEPTS 29

What is BI? 29

4 ETL-INFORMATICA 53

4.1 Informatica Overview 53

4.2 Informatica Scenarios: 90

4.3 Development Guidelines 97

4.4 Performance Tips 101

4.5 Unit Test Cases (UTP): 103

5 UNIX 106

Detailed Design Document

Introduction

1 Purpose

The purpose of this document is to provide the detailed information about DWH Concepts and Informatica based on real-time training.

ORACLE

1 DEFINATIONS

Organizations can store data on various media and in different formats, such as a hard-copy document

in a filing cabinet or data stored in electronic spreadsheets or in databases.

A database is an organized collection of information.

To manage databases, you need database management systems (DBMS). A DBMS is a program that

stores, retrieves, and modifies data in the database on request. There are four main types of databases:

hierarchical, network, relational, and more recently object relational(ORDBMS).

NORMALIZATION:

Some Oracle databases were modeled according to the rules of normalization that were intended to eliminate redundancy.

Obviously, the rules of normalization are required to understand your relationships and functional dependencies

First Normal Form:

A row is in first normal form (1NF) if all underlying domains contain atomic values only.

• Eliminate duplicative columns from the same table.

• Create separate tables for each group of related data and identify each row with a unique column or set of columns (the primary key).

Second Normal Form:

An entity is in Second Normal Form (2NF) when it meets the requirement of being in First Normal Form (1NF) and additionally:

• Does not have a composite primary key. Meaning that the primary key can not be subdivided into separate logical entities.

• All the non-key columns are functionally dependent on the entire primary key.

• A row is in second normal form if, and only if, it is in first normal form and every non-key attribute is fully dependent on the key.

• 2NF eliminates functional dependencies on a partial key by putting the fields in a separate table from those that are dependent on the whole key. An example is resolving many: many relationships using an intersecting entity.

Third Normal Form:

An entity is in Third Normal Form (3NF) when it meets the requirement of being in Second Normal Form (2NF) and additionally:

• Functional dependencies on non-key fields are eliminated by putting them in a separate table. At this level, all non-key fields are dependent on the primary key.

• A row is in third normal form if and only if it is in second normal form and if attributes that do not contribute to a description of the primary key are move into a separate table. An example is creating look-up tables.

Boyce-Codd Normal Form:

Boyce Codd Normal Form (BCNF) is a further refinement of 3NF. In his later writings Codd refers to BCNF as 3NF. A row is in Boyce Codd normal form if, and only if, every determinant is a candidate key. Most entities in 3NF are already in BCNF.

Fourth Normal Form:

An entity is in Fourth Normal Form (4NF) when it meets the requirement of being in Third Normal Form (3NF) and additionally:

Has no multiple sets of multi-valued dependencies. In other words, 4NF states that no entity can have more than a single one-to-many relationship.

ORACLE SET OF STATEMENTS:

Data Definition Language :(DDL)

Create

Alter

Drop

Truncate

Data Manipulation Language (DML)

Insert

Update

Delete

Data Querying Language (DQL)

Select

Data Control Language (DCL)

Grant

Revoke

Transactional Control Language (TCL)

Commit

Rollback

Save point

Syntaxes:

CREATE OR REPLACE SYNONYM HZ_PARTIES FOR SCOTT.HZ_PARTIES

CREATE DATABASE LINK CAASEDW CONNECT TO ITO_ASA IDENTIFIED BY exact123 USING ' CAASEDW’

Materialized View syntax:

CREATE MATERIALIZED VIEW EBIBDRO.HWMD_MTH_ALL_METRICS_CURR_VIEW

REFRESH COMPLETE

START WITH sysdate

NEXT TRUNC(SYSDATE+1)+  4/24  

WITH PRIMARY KEY

AS

select * from HWMD_MTH_ALL_METRICS_CURR_VW;

Another Method to refresh:

DBMS_MVIEW.REFRESH('MV_COMPLEX', 'C');

Case Statement:

Select NAME,

(CASE

WHEN (CLASS_CODE = 'Subscription')

THEN ATTRIBUTE_CATEGORY

ELSE TASK_TYPE

END) TASK_TYPE,

CURRENCY_CODE

From EMP

Decode()

Select empname,Decode(address,’HYD’,’Hyderabad’,

‘Bang’, Bangalore’, address) as address from emp;

Procedure:

CREATE OR REPLACE PROCEDURE Update_bal (

cust_id_IN In NUMBER,

amount_IN In NUMBER DEFAULT 1) AS

BEGIN

Update account_tbl Set amount= amount_IN where cust_id= cust_id_IN

End

Trigger:

CREATE OR REPLACE TRIGGER EMP_AUR

AFTER/BEFORE UPDATE ON EMP

REFERENCING

NEW AS NEW

OLD AS OLD

FOR EACH ROW

DECLARE

BEGIN

IF (:NEW.last_upd_tmst :OLD.last_upd_tmst) THEN

-- Insert into Control table record

Insert into table emp_w values('wrk',sysdate)

ELSE

-- Exec procedure

Exec update_sysdate()

END;

ORACLE JOINS:

• Equi join

• Non-equi join

• Self join

• Natural join

• Cross join

• Outer join

➢ Left outer

➢ Right outer

➢ Full outer

Equi Join/Inner Join:

SQL> select empno,ename,job,dname,loc from emp e,dept d where e.deptno=d.deptno;

USING CLAUSE

SQL> select empno,ename,job ,dname,loc from emp e join dept d using(deptno);

ON CLAUSE

SQL> select empno,ename,job,dname,loc from emp e join dept d on(e.deptno=d.deptno);

Non-Equi Join

A join which contains an operator other than ‘=’ in the joins condition.

Ex: SQL> select empno,ename,job,dname,loc from emp e,dept d where e.deptno > d.deptno;

Self Join

Joining the table itself is called self join.

Ex: SQL> select e1.empno,e2.ename,e1.job,e2.deptno from emp e1,emp e2 where e1.empno=e2.mgr;

Natural Join

Natural join compares all the common columns.

Ex: SQL> select empno,ename,job,dname,loc from emp natural join dept;

Cross Join

This will gives the cross product.

Ex: SQL> select empno,ename,job,dname,loc from emp cross join dept;

Outer Join

Outer join gives the non-matching records along with matching records.

Left Outer Join

This will display the all matching records and the records which are in left hand side table those that are not in right hand side table.

Ex: SQL> select empno,ename,job,dname,loc from emp e left outer join dept d on(e.deptno=d.deptno);

Or

SQL> select empno,ename,job,dname,loc from emp e,dept d where

e.deptno=d.deptno(+);

Right Outer Join

This will display the all matching records and the records which are in right hand side table those that are not in left hand side table.

Ex: SQL> select empno,ename,job,dname,loc from emp e right outer join dept d on(e.deptno=d.deptno);

Or

SQL> select empno,ename,job,dname,loc from emp e,dept d where e.deptno(+) = d.deptno;

Full Outer Join

This will display the all matching records and the non-matching records from both tables.

Ex: SQL> select empno,ename,job,dname,loc from emp e full outer join dept d on(e.deptno=d.deptno);

OR

SQL> select p.part_id, s.supplier_name

2 from part p, supplier s

3 where p.supplier_id = s.supplier_id (+)

4 union

5 select p.part_id, s.supplier_name

6 from part p, supplier s

7 where p.supplier_id (+) = s.supplier_id;

What’s the difference between View and Materialized View?

View:

Why Use Views?

• To restrict data access

• To make complex queries easy

• To provide data independence

A simple view is one that:

– Derives data from only one table

– Contains no functions or groups of data

– Can perform DML operations through the view.

A complex view is one that:

– Derives data from many tables

– Contains functions or groups of data

– Does not always allow DML operations through the view

A view has a logical existence but a materialized view has

a physical existence.Moreover a materialized view can be

Indexed, analysed and so on....that is all the things that

we can do with a table can also be done with a materialized

view.

We can keep aggregated data into materialized view. we can schedule the MV to refresh but table can’t.MV can be created based on multiple tables.

Materialized View:

In DWH materialized views are very essential because in reporting side if we do aggregate calculations as per the business requirement report performance would be de graded. So to improve report performance rather than doing report calculations and joins at reporting side if we put same logic in the MV then we can directly select the data from MV without any joins and aggregations. We can also schedule MV (Materialize View).

Inline view:

If we write a select statement in from clause that is nothing but inline view.

Ex:

Get dept wise max sal along with empname and emp no.

Select a.empname, a.empno, b.sal, b.deptno

From EMP a, (Select max (sal) sal, deptno from EMP group by deptno) b

Where

a.sal=b.sal and

a.deptno=b.deptno

What is the difference between view and materialized view?

|View |Materialized view |

|A view has a logical existence. It does not contain data. |A materialized view has a physical existence. |

|Its not a database object. |It is a database object. |

|We cannot perform DML operation on view. |We can perform DML operation on materialized view. |

|When we do select * from view it will fetch the data from base |When we do select * from materialized view it will fetch the data |

|table. |from materialized view. |

|In view we cannot schedule to refresh. |In materialized view we can schedule to refresh. |

| |We can keep aggregated data into materialized view. Materialized |

| |view can be created based on multiple tables. |

What is the Difference between Delete, Truncate and Drop?

DELETE

The DELETE command is used to remove rows from a table. A WHERE clause can be used to only remove some rows. If no WHERE condition is specified, all rows will be removed. After performing a DELETE operation you need to COMMIT or ROLLBACK the transaction to make the change permanent or to undo it.

TRUNCATE

TRUNCATE removes all rows from a table. The operation cannot be rolled back. As such, TRUCATE is faster and doesn't use as much undo space as a DELETE.

DROP

The DROP command removes a table from the database. All the tables' rows, indexes and privileges will also be removed. The operation cannot be rolled back.

Difference between Rowid and Rownum?

ROWID

A globally unique identifier for a row in a database. It is created at the time the row is inserted into a table, and destroyed when it is removed from a table.'BBBBBBBB.RRRR.FFFF' where BBBBBBBB is the block number, RRRR is the slot(row) number, and FFFF is a file number.

ROWNUM

For each row returned by a query, the ROWNUM pseudo column returns a number indicating the order in which Oracle selects the row from a table or set of joined rows. The first row selected has a ROWNUM of 1, the second has 2, and so on.

You can use ROWNUM to limit the number of rows returned by a query, as in this example:

SELECT * FROM employees WHERE ROWNUM < 10;

|Rowid |Row-num |

|Rowid is an oracle internal id that is allocated every time a new|Row-num is a row number returned by a select statement. |

|record is inserted in a table. This ID is unique and cannot be | |

|changed by the user. | |

|Rowid is permanent. |Row-num is temporary. |

|Rowid is a globally unique identifier for a row in a database. It|The row-num pseudocoloumn returns a number indicating the order |

|is created at the time the row is inserted into the table, and |in which oracle selects the row from a table or set of joined |

|destroyed when it is removed from a table. |rows. |

Order of where and having:

SELECT column, group_function

FROM table

[WHERE condition]

[GROUP BY group_by_expression]

[HAVING group_condition]

[ORDER BY column];

The WHERE clause cannot be used to restrict groups. you use the

HAVING clause to restrict groups.

Differences between where clause and having clause

|Where clause |Having clause |

|Both where and having clause can be used to filter the data. |

|Where as in where clause it is not mandatory. |But having clause we need to use it with the group by. |

|Where clause applies to the individual rows. |Where as having clause is used to test some condition on the group |

| |rather than on individual rows. |

|Where clause is used to restrict rows. |But having clause is used to restrict groups. |

|Restrict normal query by where |Restrict group by function by having |

|In where clause every record is filtered based on where. |In having clause it is with aggregate records (group by functions). |

MERGE Statement

You can use merge command to perform insert and update in a single command.

Ex: Merge into student1 s1

Using (select * from student2) s2

On (s1.no=s2.no)

When matched then

Update set marks = s2.marks

When not matched then

Insert (s1.no, s1.name, s1.marks) Values (s2.no, s2.name, s2.marks);

What is the difference between sub-query & co-related sub query?

A sub query is executed once for the parent statement

whereas the correlated sub query is executed once for each

row of the parent query.

Sub Query:

Example:

Select deptno, ename, sal from emp a where sal in (select sal from Grade where sal_grade=’A’ or sal_grade=’B’)

Co-Related Sun query:

Example:

Find all employees who earn more than the average salary in their department.

SELECT last-named, salary, department_id FROM employees A

WHERE salary > (SELECT AVG (salary)

FROM employees B WHERE B.department_id =A.department_id

Group by B.department_id)

EXISTS:

The EXISTS operator tests for existence of rows in

the results set of the subquery.

Select dname from dept where exists

(select 1 from EMP

where dept.deptno= emp.deptno);

|Sub-query |Co-related sub-query |

|A sub-query is executed once for the parent Query |Where as co-related sub-query is executed once for each row of the|

| |parent query. |

|Example: |Example: |

|Select * from emp where deptno in (select deptno from dept); |Select a.* from emp e where sal >= (select avg(sal) from emp a |

| |where a.deptno=e.deptno group by a.deptno); |

Indexes:

1. Bitmap indexes are most appropriate for columns having low distinct values—such as GENDER, MARITAL_STATUS, and RELATION. This assumption is not completely accurate, however. In reality, a bitmap index is always advisable for systems in which data is not frequently updated by many concurrent systems. In fact, as I'll demonstrate here, a bitmap index on a column with 100-percent unique values (a column candidate for primary key) is as efficient as a B-tree index.

2. When to Create an Index

3. You should create an index if:

4. A column contains a wide range of values

5. A column contains a large number of null values

6. One or more columns are frequently used together in a WHERE clause or a join condition

7. The table is large and most queries are expected to retrieve less than 2 to 4 percent of the rows

8. By default if u create index that is nothing but b-tree index.

Why hints Require?

It is a perfect valid question to ask why hints should be used. Oracle comes with an optimizer that promises to optimize a query's execution plan. When this optimizer is really doing a good job, no hints should be required at all.

Sometimes, however, the characteristics of the data in the database are changing rapidly, so that the optimizer (or more accuratly, its statistics) are out of date. In this case, a hint could help.

You should first get the explain plan of your SQL and determine what changes can be done to make the code operate without using hints if possible. However, hints such as ORDERED, LEADING, INDEX, FULL, and the various AJ and SJ hints can take a wild optimizer and give you optimal performance

Tables analyze and update Analyze Statement

The ANALYZE statement can be used to gather statistics for a specific table, index or cluster. The statistics can be computed exactly, or estimated based on a specific number of rows, or a percentage of rows:

ANALYZE TABLE employees COMPUTE STATISTICS;

ANALYZE TABLE employees ESTIMATE STATISTICS SAMPLE 15 PERCENT;

EXEC DBMS_STATS.gather_table_stats('SCOTT', 'EMPLOYEES');

Automatic Optimizer Statistics Collection

By default Oracle 10g automatically gathers optimizer statistics using a scheduled job called GATHER_STATS_JOB. By default this job runs within maintenance windows between 10 P.M. to 6 A.M. week nights and all day on weekends. The job calls the DBMS_STATS.GATHER_DATABASE_STATS_JOB_PROC internal procedure which gathers statistics for tables with either empty or stale statistics, similar to the DBMS_STATS.GATHER_DATABASE_STATS procedure using the GATHER AUTO option. The main difference is that the internal job prioritizes the work such that tables most urgently requiring statistics updates are processed first.

Hint categories:

Hints can be categorized as follows:

• ALL_ROWS

One of the hints that 'invokes' the Cost based optimizer

ALL_ROWS is usually used for batch processing or data warehousing systems.

(/*+ ALL_ROWS */)

• FIRST_ROWS

One of the hints that 'invokes' the Cost based optimizer

FIRST_ROWS is usually used for OLTP systems.

(/*+ FIRST_ROWS */)

• CHOOSE

One of the hints that 'invokes' the Cost based optimizer

This hint lets the server choose (between ALL_ROWS and FIRST_ROWS, based on statistics gathered.

• Hints for Join Orders,

• Hints for Join Operations,

• Hints for Parallel Execution, (/*+ parallel(a,4) */) specify degree either 2 or 4 or 16

• Additional Hints

• HASH

Hashes one table (full scan) and creates a hash index for that table. Then hashes other table and uses hash index to find corresponding records. Therefore not suitable for < or > join conditions.

/*+ use_hash */

Use Hint to force using index

SELECT /*+INDEX (TABLE_NAME INDEX_NAME) */ COL1,COL2 FROM TABLE_NAME

Select ( /*+ hash */ ) empno from

ORDERED-( This hint forces tables to be joined in the order specified. If you know table X has fewer rows, then ordering it first may speed execution in a join.

PARALLEL (table, instances)(This specifies the operation is to be done in parallel.

If index is not able to create then will go for /*+ parallel(table, 8)*/-----For select and update example---in where clase like st,not in ,>,< , then we will use.

Explain Plan:

Explain plan will tell us whether the query properly using indexes or not.whatis the cost of the table whether it is doing full table scan or not, based on these statistics we can tune the query.

The explain plan process stores data in the PLAN_TABLE. This table can be located in the current schema or a shared schema and is created using in SQL*Plus as follows:

SQL> CONN sys/password AS SYSDBA

Connected

SQL> @$ORACLE_HOME/rdbms/admin/utlxplan.sql

SQL> GRANT ALL ON sys.plan_table TO public;

SQL> CREATE PUBLIC SYNONYM plan_table FOR sys.plan_table;

What is your tuning approach if SQL query taking long time? Or how do u tune SQL query?

If query taking long time then First will run the query in Explain Plan, The explain plan process stores data in the PLAN_TABLE.

it will give us execution plan of the query like whether the query is using the relevant indexes on the joining columns or indexes to support the query are missing.

If joining columns doesn’t have index then it will do the full table scan if it is full table scan the cost will be more then will create the indexes on the joining columns and will run the query it should give better performance and also needs to analyze the tables if analyzation happened long back. The ANALYZE statement can be used to gather statistics for a specific table, index or cluster using

ANALYZE TABLE employees COMPUTE STATISTICS;

If still have performance issue then will use HINTS, hint is nothing but a clue. We can use hints like

• ALL_ROWS

One of the hints that 'invokes' the Cost based optimizer

ALL_ROWS is usually used for batch processing or data warehousing systems.

(/*+ ALL_ROWS */)

• FIRST_ROWS

One of the hints that 'invokes' the Cost based optimizer

FIRST_ROWS is usually used for OLTP systems.

(/*+ FIRST_ROWS */)

• CHOOSE

One of the hints that 'invokes' the Cost based optimizer

This hint lets the server choose (between ALL_ROWS and FIRST_ROWS, based on statistics gathered.

• HASH

Hashes one table (full scan) and creates a hash index for that table. Then hashes other table and uses hash index to find corresponding records. Therefore not suitable for < or > join conditions.

/*+ use_hash */

Hints are most useful to optimize the query performance.

Store Procedure:

|What are the differences between stored procedures and triggers? |

|Stored procedure normally used for performing tasks |

|But the Trigger normally used for tracing and auditing logs. |

| |

|Stored procedures should be called explicitly by the user in order to execute |

|But the Trigger should be called implicitly based on the events defined in the table. |

| |

|Stored Procedure can run independently |

|But the Trigger should be part of any DML events on the table. |

| |

|Stored procedure can be executed from the Trigger |

|But the Trigger cannot be executed from the Stored procedures. |

| |

|Stored Procedures can have parameters. |

|But the Trigger cannot have any parameters. |

|Stored procedures are compiled collection of programs or SQL statements in the database. |

|Using stored procedure we can access and modify data present in many tables. |

|Also a stored procedure is not associated with any particular database object. |

|But triggers are event-driven special procedures which are attached to a specific database object say a table. |

|Stored procedures are not automatically run and they have to be called explicitly by the user. But triggers get executed when the |

|particular event associated with the event gets fired. |

Packages:

Packages provide a method of encapsulating related procedures, functions, and associated cursors and variables together as a unit in the database.

package that contains several procedures and functions that process related to same transactions.

A package is a group of related procedures and functions, together with the cursors and variables they use,

Packages provide a method of encapsulating related procedures, functions, and associated cursors and variables together as a unit in the database.

Triggers:

Oracle lets you define procedures called triggers that run implicitly when an INSERT, UPDATE, or DELETE statement is issued against the associated table

Triggers are similar to stored procedures. A trigger stored in the database can include SQL and PL/SQL

Types of Triggers

This section describes the different types of triggers:

• Row Triggers and Statement Triggers

• BEFORE and AFTER Triggers

• INSTEAD OF Triggers

• Triggers on System Events and User Events

Row Triggers

A row trigger is fired each time the table is affected by the triggering statement. For example, if an UPDATE statement updates multiple rows of a table, a row trigger is fired once for each row affected by the UPDATE statement. If a triggering statement affects no rows, a row trigger is not run.

BEFORE and AFTER Triggers

When defining a trigger, you can specify the trigger timing--whether the trigger action is to be run before or after the triggering statement. BEFORE and AFTER apply to both statement and row triggers.

BEFORE and AFTER triggers fired by DML statements can be defined only on tables, not on views.

Difference between Trigger and Procedure

|Triggers |Stored Procedures |

|In trigger no need to execute manually. Triggers will be fired |Where as in procedure we need to execute manually. |

|automatically. | |

|Triggers that run implicitly when an INSERT, UPDATE, or DELETE | |

|statement is issued against the associated table. | |

Differences between stored procedure and functions

|Stored Procedure |Functions |

|Stored procedure may or may not return values. |Function should return at least one output parameter. Can return more|

| |than one parameter using OUT argument. |

|Stored procedure can be used to solve the business logic. |Function can be used to calculations |

|Stored procedure is a pre-compiled statement. |But function is not a pre-compiled statement. |

|Stored procedure accepts more than one argument. |Whereas function does not accept arguments. |

|Stored procedures are mainly used to process the tasks. |Functions are mainly used to compute values |

|Cannot be invoked from SQL statements. E.g. SELECT |Can be invoked form SQL statements e.g. SELECT |

|Can affect the state of database using commit. |Cannot affect the state of database. |

|Stored as a pseudo-code in database i.e. compiled form. |Parsed and compiled at runtime. |

Data files Overview:

A tablespace in an Oracle database consists of one or more physical datafiles. A datafile can be associated with only one tablespace and only one database.

Table Space:

Oracle stores data logically in tablespaces and physically in datafiles associated with the corresponding tablespace.

A database is divided into one or more logical storage units called tablespaces. Tablespaces are divided into logical units of storage called segments.

Control File:

A control file contains information about the associated database that is required for access by an instance, both at startup and during normal operation. Control file information can be modified only by Oracle; no database administrator or user can edit a control file.

39 IMPORTANT QUERIES

1. Get duplicate rows from the table:

Select empno, count (*) from EMP group by empno having count (*)>1;

2. Remove duplicates in the table:

Delete from EMP where rowid not in (select max (rowid) from EMP group by empno);

3. Below query transpose columns into rows.

|Name |No |Add1 |Add2 |

|abc |100 |hyd |bang |

|xyz |200 |Mysore |pune |

Select name, no, add1 from A

UNION

Select name, no, add2 from A;

4. Below query transpose rows into columns.

select

emp_id,

max(decode(row_id,0,address))as address1,

max(decode(row_id,1,address)) as address2,

max(decode(row_id,2,address)) as address3

from (select emp_id,address,mod(rownum,3) row_id from temp order by emp_id )

group by emp_id

Other query:

select

emp_id,

max(decode(rank_id,1,address)) as add1,

max(decode(rank_id,2,address)) as add2,

max(decode(rank_id,3,address))as add3

from

(select emp_id,address,rank() over (partition by emp_id order by emp_id,address )rank_id from temp )

group by

emp_id

5. Rank query:

Select empno, ename, sal, r from (select empno, ename, sal, rank () over (order by sal desc) r from EMP);

6. Dense rank query:

The DENSE_RANK function works acts like the RANK function except that it assigns consecutive ranks:

Select empno, ename, Sal, from (select empno, ename, sal, dense_rank () over (order by sal desc) r from emp);

7. Top 5 salaries by using rank:

Select empno, ename, sal,r from (select empno,ename,sal,dense_rank() over (order by sal desc) r from emp) where r=$$Pre_sess_max_upd (Mapping var)

3) In the expression assign max last_upd_date value to $$Pre_sess_max_upd(mapping var) using set max var

4) Because its var so it stores the max last upd_date value in the repository, in the next run our source qualifier query will fetch only the records updated or inseted after previous run.

Approach_2: Using parameter file

1. First need to create mapping parameter ($$Pre_sess_start_tmst )and assign initial value as old date (01/01/1940) in the parameterfile.

2. Then override source qualifier query to fetch only LAT_UPD_DATE >=$$Pre_sess_start_tmst (Mapping var)

3. Update mapping parameter($$Pre_sess_start_tmst) values in the parameter file using shell script or another mapping after first session get completed successfully

4. Because its mapping parameter so every time we need to update the value in the parameter file after comptetion of main session.

Approach_3: Using oracle Control tables

1. First we need to create two control tables cont_tbl_1 and cont_tbl_1 with structure of session_st_time,wf_name

2. Then insert one record in each table with session_st_time=1/1/1940 and workflow_name

3. create two store procedures one for update cont_tbl_1 with session st_time, set property of store procedure type as Source_pre_load .

4. In 2nd store procedure set property of store procedure type as Target _Post_load.this proc will update the session _st_time in Cont_tbl_2 from cnt_tbl_1.

5. Then override source qualifier query to fetch only LAT_UPD_DATE >=(Select session_st_time from cont_tbl_2 where workflow name=’Actual work flow name’.

SCD Type-II Effective-Date Approach

• We have one of the dimension in current project called resource dimension. Here we are maintaining the history to keep track of SCD changes.

• To maintain the history in slowly changing dimension or resource dimension. We followed SCD Type-II Effective-Date approach.

• My resource dimension structure would be eff-start-date, eff-end-date, s.k and source columns.

• Whenever I do a insert into dimension I would populate eff-start-date with sysdate, eff-end-date with future date and s.k as a sequence number.

• If the record already present in my dimension but there is change in the source data. In that case what I need to do is

• Update the previous record eff-end-date with sysdate and insert as a new record with source data.

Informatica design to implement SDC Type-II effective-date approach

• Once you fetch the record from source qualifier. We will send it to lookup to find out whether the record is present in the target or not based on source primary key column.

• Once we find the match in the lookup we are taking SCD column from lookup and source columns from SQ to expression transformation.

• In lookup transformation we need to override the lookup override query to fetch Active records from the dimension while building the cache.

• In expression transformation I can compare source with lookup return data.

• If the source and target data is same then I can make a flag as ‘S’.

• If the source and target data is different then I can make a flag as ‘U’.

• If source data does not exists in the target that means lookup returns null value. I can flag it as ‘I’.

• Based on the flag values in router I can route the data into insert and update flow.

• If flag=’I’ or ‘U’ I will pass it to insert flow.

• If flag=’U’ I will pass this record to eff-date update flow

• When we do insert we are passing the sequence value to s.k.

• Whenever we do update we are updating the eff-end-date column based on lookup return s.k value.

Complex Mapping

• We have one of the order file requirement. Requirement is every day in source system they will place filename with timestamp in informatica server.

• We have to process the same date file through informatica.

• Source file directory contain older than 30 days files with timestamps.

• For this requirement if I hardcode the timestamp for source file name it will process the same file every day.

• So what I did here is I created $InputFilename for source file name.

• Then I am going to use the parameter file to supply the values to session variables ($InputFilename).

• To update this parameter file I have created one more mapping.

• This mapping will update the parameter file with appended timestamp to file name.

• I make sure to run this parameter file update mapping before my actual mapping.

How to handle errors in informatica?

• We have one of the source with numerator and denominator values we need to calculate num/deno

• When populating to target.

• If deno=0 I should not load this record into target table.

• We need to send those records to flat file after completion of 1st session run. Shell script will check the file size.

• If the file size is greater than zero then it will send email notification to source system POC (point of contact) along with deno zero record file and appropriate email subject and body.

• If file size=($$INCREMENT_TS (Mapping var)

3) In the expression assign max last_upd_date value to ($$INCREMENT_TS (mapping var) using set max var

4) Because its var so it stores the max last upd_date value in the repository, in the next run our source qualifier query will fetch only the records updated or inseted after previous run.

[pic]

Logic in the mapping variable is

[pic]

Logic in the SQ is

[pic]

In expression assign max last update date value to the variable using function set max variable.

[pic]

[pic]

Logic in the update strategy is below

[pic]

[pic]

Approach_2: Using parameter file

First need to create mapping parameter ($$LastUpdateDate Time )and assign initial value as old date (01/01/1940) in the parameterfile.

Then override source qualifier query to fetch only LAT_UPD_DATE >=($$LastUpdateDate Time (Mapping var)

Update mapping parameter($$LastUpdateDate Time) values in the parameter file using shell script or another mapping after first session get completed successfully

Because its mapping parameter so every time we need to update the value in the parameter file after comptetion of main session.

Parameterfile:

[GEHC_APO_DEV.WF:w_GEHC_APO_WEEKLY_HIST_LOAD.WT:wl_GEHC_APO_WEEKLY_HIST_BAAN.ST:s_m_GEHC_APO_BAAN_SALES_HIST_AUSTRI]

$DBConnection_Source=DMD2_GEMS_ETL

$DBConnection_Target=DMD2_GEMS_ETL

$$LastUpdateDate Time =01/01/1940

Updating parameter File

[pic]

Logic in the expression

[pic]

Main mapping

[pic]

Sql override in SQ Transformation

[pic]

Workflod Design

[pic]

2 Informatica Scenarios:

1) How to populate 1st record to 1st target ,2nd record to 2nd target ,3rd record to 3rd target and 4th record to 1st target through informatica?

We can do using sequence generator by setting end value=3 and enable cycle option.then in the router take 3 goups

In 1st group specify condition as seq next value=1 pass those records to 1st target simillarly

In 2nd group specify condition as seq next value=2 pass those records to 2nd target

In 3rd group specify condition as seq next value=3 pass those records to 3rd target.

Since we have enabled cycle option after reaching end value sequence generator will start from 1,for the 4th record seq.next value is 1 so it will go to 1st target.

2) How to do Dymanic File generation in Informatica?

I want to generate the separate file for every State (as per state, it should generate file).It has to generate 2 flat files and name of the flat file is corresponding state name that is the requirement.

Below is my mapping.

Source (Table) -> SQ -> Target (FF)

Source:

|State |Transaction |City |

|AP |2 |HYD |

|AP |1 |TPT |

|KA |5 |BANG |

|KA |7 |MYSORE |

|KA |3 |HUBLI |

This functionality was added in informatica 8.5 onwards earlier versions it was not there.

We can achieve it with use of transaction control and special "FileName" port in the target file .

In order to generate the target file names from the mapping, we should make use of the special "FileName" port in the target file. You can't create this special port from the usual New port button. There is a special button with label "F" on it to the right most corner of the target flat file when viewed in "Target Designer".

When you have different sets of input data with different target files created, use the same instance, but with a Transaction Control transformation which defines the boundary for the source sets.

in target flat file there is option in column tab i.e filename as column.

when you click that one non editable column gets created in metadata of target.

in transaction control give condition as iif(not isnull(emp_no),tc_commit_before,continue) else tc_commit_before

map the emp_no column to target's filename column

ur mapping will be like this

source -> squlf-> transaction control-> target

run it ,separate files will be created by name of Ename

3) How to concatenate row data through informatica?

Source:

|Ename |EmpNo |

|stev |100 |

|methew |100 |

|john |101 |

| tom |101 |

Target:

|Ename |EmpNo |

|Stev methew |100 |

|John tom |101 |

Approach1: Using Dynamic Lookup on Target table:

If record doen’t exit do insert in target .If it is already exist then get corresponding Ename vale from lookup and concat in expression with current Ename value then update the target Ename column using update strategy.

Approch2: Using Var port :

Sort the data in sq based on EmpNo column then Use expression to store previous record information using Var port after that use router to insert a record if it is first time if it is already inserted then update Ename with concat value of prev name and current name value then update in target.

4) How to send Unique (Distinct) records into One target and duplicates into another tatget?

Source:

|Ename |EmpNo |

|stev |100 |

|Stev |100 |

|john |101 |

|Mathew |102 |

Output:

Target_1:

|Ename |EmpNo |

|Stev |100 |

|John |101 |

|Mathew |102 |

Target_2:

|Ename |EmpNo |

|Stev |100 |

Approch 1: Using Dynamic Lookup on Target table:

If record doen’t exit do insert in target_1 .If it is already exist then send it to Target_2 using Router.

Approch2: Using Var port :

Sort the data in sq based on EmpNo column then Use expression to store previous record information using Var ports after that use router to route the data into targets if it is first time then sent it to first target if it is already inserted then send it to Tartget_2.

5) How to Process multiple flat files to single target table through informatica if all files are same structure?

We can process all flat files through one mapping and one session using list file.

First we need to create list file using unix script for all flat file the extension of the list file is .LST.

This list file it will have only flat file names.

At session level we need to set

source file directory as list file path

And source file name as list file name

And file type as indirect.

6) How to populate file name to target while loading multiple files using list file concept.

In informatica 8.6 by selecting Add currently processed flatfile name option in the properties tab of source definition after import source file defination in source analyzer.It will add new column as currently processed file name.we can map this column to target to populate filename.

7) If we want to run 2 workflow one after another(how to set the dependence between wf’s)

• If both workflow exists in same folder we can create 2 worklet rather than creating 2 workfolws.

• Finally we can call these 2 worklets in one workflow.

• There we can set the dependency.

• If both workflows exists in different folders or repository then we cannot create worklet.

• We can set the dependency between these two workflow using shell script is one approach.

• The other approach is event wait and event rise.

If both workflow exists in different folrder or different rep then we can use below approaches.

1) Using shell script

• As soon as first workflow get completes we are creating zero byte file (indicator file).

• If indicator file is available in particular location. We will run second workflow.

• If indicator file is not available we will wait for 5 minutes and again we will check for the indicator. Like this we will continue the loop for 5 times i.e 30 minutes.

• After 30 minutes if the file does not exists we will send out email notification.

2) Event wait and Event rise approach

We can put event wait before actual session run in the workflow to wait a indicator file if file available then it will run the session other event wait it will wait for infinite time till the indicator file is available.

8) How to load cumulative salary in to target ?

Solution:

Using var ports in expression we can load cumulative salary into target.

[pic]

[pic]

3 Development Guidelines

General Development Guidelines

The starting point of the development is the logical model created by the Data Architect. This logical model forms the foundation for metadata, which will be continuously be maintained throughout the Data Warehouse Development Life Cycle (DWDLC). The logical model is formed from the requirements of the project. At the completion of the logical model technical documentation defining the sources, targets, requisite business rule transformations, mappings and filters. This documentation serves as the basis for the creation of the Extraction, Transformation and Loading tools to actually manipulate the data from the applications sources into the Data Warehouse/Data Mart.

To start development on any data mart you should have the following things set up by the Informatica Load Administrator

➢ Informatica Folder. The development team in consultation with the BI Support Group can decide a three-letter code for the project, which would be used to create the informatica folder as well as Unix directory structure.

➢ Informatica Userids for the developers

➢ Unix directory structure for the data mart.

➢ A schema XXXLOAD on DWDEV database.

Transformation Specifications

Before developing the mappings you need to prepare the specifications document for the mappings you need to develop. A good template is placed in the templates folder You can use your own template as long as it has as much detail or more than that which is in this template.

While estimating the time required to develop mappings the thumb rule is as follows.

➢ Simple Mapping – 1 Person Day

➢ Medium Complexity Mapping – 3 Person Days

➢ Complex Mapping – 5 Person Days.

Usually the mapping for the fact table is most complex and should be allotted as much time for development as possible.

Data Loading from Flat Files

It’s an accepted best practice to always load a flat file into a staging table before any transformations are done on the data in the flat file.

Always use LTRIM, RTRIM functions on string columns before loading data into a stage table.

You can also use UPPER function on string columns but before using it you need to ensure that the data is not case sensitive (e.g. ABC is different from Abc)

If you are loading data from a delimited file then make sure the delimiter is not a character which could appear in the data itself. Avoid using comma-separated files. Tilde (~) is a good delimiter to use.

Failure Notification

Once in production your sessions and batches need to send out notification when then fail to the Support team. You can do this by configuring email task in the session level.

Naming Conventions and usage of Transformations

Port Standards:

Input Ports – It will be necessary to change the name of input ports for lookups, expression and filters where ports might have the same name. If ports do have the same name then will be defaulted to having a number after the name. Change this default to a prefix of “in_”. This will allow you to keep track of input ports through out your mappings.

Prefixed with: IN_

Variable Ports – Variable ports that are created within an expression

Transformation should be prefixed with a “v_”. This will allow the developer to distinguish between input/output and variable ports. For more explanation of Variable Ports see the section “VARIABLES”.

Prefixed with: V_

Output Ports – If organic data is created with a transformation that will be mapped to the target, make sure that it has the same name as the target port that it will be mapped to.

Prefixed with: O_

Quick Reference

|Object Type |Syntax |

|Folder |XXX_ |

|Mapping |m_fXY_ZZZ__x.x |

|Session |s_fXY_ZZZ__x.x |

|Batch |b_ |

|Source Definition | |

|Target Definition | |

|Aggregator |AGG_ |

|Expression |EXP_ |

|Filter |FLT_ |

|Joiner |JNR_ |

|Lookup |LKP_ |

|Normalizer |Norm_ |

|Rank |RNK_ |

|Router |RTR_ |

|Sequence Generator |SEQ_ |

|Source Qualifier |SQ_ |

|Stored Procedure |STP__ |

|Update Strategy |UPD__xxx |

|Mapplet |MPP_ |

|Input Transformation |INP_ |

|Output Tranformation |OUT_ |

|Database Connections |XXX__ |

4 Performance Tips

What is Performance tuning in Informatica

The aim of performance tuning is optimize session

performance so sessions run during the available load window

for the Informatica Server.

Increase the session performance by following.

The performance of the Informatica Server is related to

network connections. Data generally moves across a network

at less than 1 MB per second, whereas a local disk moves

data five to twenty times faster. Thus network connections

ofteny affect on session performance. So avoid work

connections.

1. Cache lookups if source table is under 500,000 rows and DON’T cache for tables over 500,000 rows.

2. Reduce the number of transformations. Don’t use an Expression Transformation to collect fields. Don’t use an Update Transformation if only inserting. Insert mode is the default.

3. If a value is used in multiple ports, calculate the value once (in a variable) and reuse the result instead of recalculating it for multiple ports.

4. Reuse objects where possible.

5. Delete unused ports particularly in the Source Qualifier and Lookups.

6. Use Operators in expressions over the use of functions.

7. Avoid using Stored Procedures, and call them only once during the mapping if possible.

8. Remember to turn off Verbose logging after you have finished debugging.

9. Use default values where possible instead of using IIF (ISNULL(X),,) in Expression port.

10. When overriding the Lookup SQL, always ensure to put a valid Order By statement in the SQL. This will cause the database to perform the order rather than Informatica Server while building the Cache.

11. Improve session performance by using sorted data with the Joiner transformation. When the Joiner transformation is configured to use sorted data, the Informatica Server improves performance by minimizing disk input and output.

12. Improve session performance by using sorted input with the Aggregator Transformation since it reduces the amount of data cached during the session.

13. Improve session performance by using limited number of connected input/output or output ports to reduce the amount of data the Aggregator transformation stores in the data cache.

14. Use a Filter transformation prior to Aggregator transformation to reduce unnecessary aggregation.

15. Performing a join in a database is faster than performing join in the session. Also use the Source Qualifier to perform the join.

16. Define the source with less number of rows and master source in Joiner Transformations, since this reduces the search time and also the cache.

17. When using multiple conditions in a lookup conditions, specify the conditions with the equality operator first.

18. Improve session performance by caching small lookup tables.

19. If the lookup table is on the same database as the source table, instead of using a Lookup transformation, join the tables in the Source Qualifier Transformation itself if possible.

20. If the lookup table does not change between sessions, configure the Lookup transformation to use a persistent lookup cache. The Informatica Server saves and reuses cache files from session to session, eliminating the time required to read the lookup table.

21. Use :LKP reference qualifier in expressions only when calling unconnected Lookup Transformations.

22. Informatica Server generates an ORDER BY statement for a cached lookup that contains all lookup ports. By providing an override ORDER BY clause with fewer columns, session performance can be improved.

23. Eliminate unnecessary data type conversions from mappings.

24. Reduce the number of rows being cached by using the Lookup SQL Override option to add a WHERE clause to the default SQL statement.

5 Unit Test Cases (UTP):

QA Life Cycle consists of 5 types of

Testing regimens:

1. Unit Testing

2. Functional Testing

3. System Integration Testing

4. User Acceptance Testing

Unit testing: The testing, by development, of the application modules to verify each unit (module) itself meets the accepted user requirements and design and development standards

Functional Testing: The testing of all the application’s modules individually to ensure the modules, as released from development to QA, work together as designed and meet the accepted user requirements and system standards

System Integration Testing: Testing of all of the application modules in the same environment, database instance, network and inter-related applications, as it would function in production. This includes security, volume and stress testing.

User Acceptance Testing(UAT): The testing of the entire application by the end-users ensuring the application functions as set forth in the system requirements documents and that the system meets the business needs.

UTP Template:

| | | | |Actual Results, |Pass or Fail|Tested By |

| | | | | |(P or F) | |

|Step |Description |Test Conditions |Expected Results | | | |

|# | | | | | | |

|1 |Check for the total |SOURCE: |Both the source and target table load |Should be same as |Pass |Stev |

| |count of records in| |record count should match. |the expected | | |

| |source tables that |SELECT count(*) FROM XST_PRCHG_STG | | | | |

| |is fetched and the | | | | | |

| |total records in the| | | | | |

| |PRCHG table for a |TARGET: | | | | |

| |perticular session | | | | | |

| |timestamp |Select count(*) from _PRCHG | | | | |

|2 |Check for all the |select PRCHG_ID, |Both the source and target table record|Should be same as |Pass |Stev |

| |target columns |PRCHG_DESC, |values should return zero records |the expected | | |

| |whether they are |DEPT_NBR, | | | | |

| |getting populated |EVNT_CTG_CDE, | | | | |

| |correctly with |PRCHG_TYP_CDE, | | | | |

| |source data. |PRCHG_ST_CDE, | | | | |

| | |from T_PRCHG | | | | |

| | |MINUS | | | | |

| | |select PRCHG_ID, | | | | |

| | |PRCHG_DESC, | | | | |

| | |DEPT_NBR, | | | | |

| | |EVNT_CTG_CDE, | | | | |

| | |PRCHG_TYP_CDE, | | | | |

| | |PRCHG_ST_CDE, | | | | |

| | |from PRCHG | | | | |

|3 |Check for Insert |Identify a one record from the source|It should insert a record into target |Should be same as |Pass |Stev |

| |strategy to load |which is not in target table. Then |table with source data |the expected | | |

| |records into target |run the session | | | | |

| |table. | | | | | |

|4 |Check for Update |Identify a one Record from the |It should update record into target |Should be same as |Pass |Stev |

| |strategy to load |source which is already present in |table with source data for that |the expected | | |

| |records into target |the target table with different |existing record | | | |

| |table. |PRCHG_ST_CDE or PRCHG_TYP_CDE values | | | | |

| | |Then run the session | | | | |

UNIX

How strong you are in UNIX?

1) I have Unix shell scripting knowledge whatever informatica required like

If we want to run workflows in Unix using PMCMD.

Below is the script to run workflow using Unix.

cd /pmar/informatica/pc/pmserver/

/pmar/informatica/pc/pmserver/pmcmd startworkflow -u $INFA_USER -p $INFA_PASSWD -s $INFA_SERVER:$INFA_PORT -f $INFA_FOLDER -wait $1 >> $LOG_PATH/$LOG_FILE

2) And if we suppose to process flat files using informatica but those files were exists in remote server then we have to write script to get ftp into informatica server before start process those files.

3) And also file watch mean that if indicator file available in the specified location then we need to start our informatica jobs otherwise will send email notification using

Mail X command saying that previous jobs didn’t completed successfully something like that.

4) Using shell script update parameter file with session start time and end time.

This kind of scripting knowledge I do have. If any new UNIX requirement comes then I can Google and get the solution implement the same.

Basic Commands:

Cat file1 (cat is the command to create none zero byte file)

cat file1 file2 > all -----it will combined (it will create file if it doesn’t exit)

cat file1 >> file2---it will append to file 2

o > will redirect output from standard out (screen) to file or printer or whatever you like.

o >> Filename will append at the end of a file called filename.

o < will redirect input to a process or command.

How to create zero byte file?

Touch filename (touch is the command to create zero byte file)

how to find all processes that are running

ps -A

Crontab command.

Crontab command is used to schedule jobs. You must have permission to run this command by Unix Administrator. Jobs are scheduled in five numbers, as follows.

Minutes (0-59) Hour (0-23) Day of month (1-31) month (1-12) Day of week (0-6) (0 is Sunday)

so for example you want to schedule a job which runs from script named backup jobs in /usr/local/bin directory on sunday (day 0) at 11.25 (22:25) on 15th of month. The entry in crontab file will be. * represents all values.

25 22 15 * 0 /usr/local/bin/backup_jobs

The * here tells system to run this each month.

Syntax is

crontab file So a create a file with the scheduled jobs as above and then type

crontab filename .This will scheduled the jobs.

Below cmd gives total no of users logged in at this time.

who | wc -l

echo "are total number of people logged in at this time."

Below cmd will display only directories

$ ls -l | grep '^d'

Pipes:

The pipe symbol "|" is used to direct the output of one command to the input

of another.

Moving, renaming, and copying files:

Cp file1 file2 copy a file

mv file1 newname move or rename a file

mv file1 ~/AAA/ move file1 into sub-directory AAA in your home directory.

rm file1 [file2 ...] remove or delete a file

To display hidden files

ls –a

Viewing and editing files:

cat filename Dump a file to the screen in ascii.

More file name to view the file content

head filename Show the first few lines of a file.

head -5 filename Show the first 5 lines of a file.

tail filename Show the last few lines of a file.

Tail -7 filename Show the last 7 lines of a file.

Searching for files :

find command

find -name aaa.txt Finds all the files named aaa.txt in the current directory or

any subdirectory tree.

find / -name vimrc Find all the files named 'vimrc' anywhere on the system.

find /usr/local/games -name "*xpilot*"

Find all files whose names contain the string 'xpilot' which

exist within the '/usr/local/games' directory tree.

You can find out what shell you are using by the command:

echo $SHELL

If file exists then send email with attachment.

if [[ -f $your_file ]]; then

uuencode $your_file $your_file|mailx -s "$your_file exists..." your_email_address

fi

Below line is the first line of the script

#!/usr/bin/sh

Or

#!/bin/ksh

What does #! /bin/sh mean in a shell script?

It actually tells the script to which interpreter to refer. As you know, bash shell has some specific functions that other shell does not have and vice-versa. Same way is for perl, python and other languages.

It's to tell your shell what shell to you in executing the following statements in your shell script.

Interactive History

A feature of bash and tcsh (and sometimes others) you can use

the up-arrow keys to access your previous commands, edit

them, and re-execute them.

Basics of the vi editor

Opening a file

Vi filename

Creating text

Edit modes: These keys enter editing modes and type in the text

of your document.

i Insert before current cursor position

I Insert at beginning of current line

a Insert (append) after current cursor position

A Append to end of line

r Replace 1 character

R Replace mode

Terminate insertion or overwrite mode

Deletion of text

x Delete single character

dd Delete current line and put in buffer

:w Write the current file.

:w new.file Write the file to the name 'new.file'.

:w! existing.file Overwrite an existing file with the file currently being edited.

:wq Write the file and quit.

:q Quit.

:q! Quit with no changes.[pic]

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download