Miscellaneous Notes



Db2 for z/OS LPL and WEPR Overview

Db2 for z/OS tracks write I/O activity to Db2 page sets (aka VSAM Linear Data sets (LDS’s) backing Db2 objects) as changes occur to the data in the page sets (e.g. INSERT, UPDATE, DELETE activity, etc.). This tracking is separate from Db2 Log activity which is employed by Db2 for the ongoing maintenance of data integrity. Db2 is designed to identify when a write I/O problem occurs on a Db2 page set, and takes appropriate action when such a scenario is identified.

Db2 for z/OS employs two internal lists for tracking underlying Db2 page sets with logical or physical problems/errors. These two lists are called the Logical Page List (LPL) and the Write Error Page Range (WEPR). While the two lists have a great deal in common, they are used internally for Db2 under slightly different circumstances

This paper discusses at a moderately technical level how Db2 employs the LPL and the WEPR to aid in maintaining data integrity for the data managed by Db2 for z/OS. The paper discusses the differences between the LPL and the WEPR, and the process by which Db2 might employ each list.

For the remainder of this discussion, the term Db2 will mean Db2 for z/OS (as opposed to Db2 LUW, etc.). This discussion is exclusively for Db2 for z/OS, where the term Db2 is concerned and used. References to z/OS are exclusively for those services and programs executing in a z/OS operating system.

DISCLAIMER: The following discussion is a personal interpretation of the IBM provided documentation on these topics. Because interpretations of technical documentation may vary from person to person, and are therefore subject to scrutiny, the user should always rely on IBM documentation for the definitive and authoritative explanation of each topic.

This paper is as of Db2 12 for z/OS and z/OS 2.2, and was written on Feb. 9, 2018 by:

Kurt Bohnert

Manager, Db2 Systems Programming

Rocket Software, Inc.

When Db2 for z/OS writes changes to data as the result of some action (e.g. INSERT, UPDATE, DELETE, etc.), the change is not (usually, under normal circumstances) immediately written to the underlying Db2 page set. The change is written to a buffer in memory (above the 2G bar in the Db2 DBM1 address space), in the Buffer Pool (BP) assigned for that object. The change is solidified in the log when the application performing the change issues a COMMIT. It is not necessarily immediately written (externalized) to the page set however, even after COMMIT. The change is asynchronously externalized (written) to the page set via an internal Db2 feature known as the Deferred Write process. When a change is made to Db2 data, the page stored in the Db2 BP in memory is changed and the page containing that change is placed on the Db2 Deferred Write Queue. At some later point in time, an event will drive what is termed Buffer Pool Externalization. The changed pages on the Deferred Write Queue are then asynchronously written (via an enclave SRB) to the appropriate page sets. There are a variety of events which can drive Buffer Pool Externalization, including but not limited to the following:

1. Db2 Checkpoint (CHKPT)

2. The BP DWTH (Deferred Write Threshold … by BP)

3. The BP VDWTH (Vertical Deferred Write Threshold … by TS within BP)

4. The BP IWTH (Immediate Write Threshold)

5. Physical Close of a page set

6. Pseudo Close of a page set

7. QUIESCE utility with WRITE(YES)

When this physical I/O to the page set happens, there are occasional problems/errors which can occur. While quite infrequent, this is true for any write I/O to any data set (in Db2 or external to Db2). When such an error occurs in Db2, the write fails and the page set is not updated with the changed data. In such a case, Db2 must track all changed pages residing in Db2 page sets which were not successfully updated (the I/O failed). The I/O error can be categorized as either a Logical Error or a Physical Error. An example of a Logical Error might be in a scenario where connectivity to the DASD housing the page set has been interrupted. An example of a Physical Error is almost always due to a DASD hardware (device) error.

A Logical Error means that the affected page is “logically correct” in the Db2 Log, but it is not correct on the physical page due to an inability to access the data set. The page is then placed on the LPL, as either a single page or a range of pages.

A Physical Error means that the affected page is “logically correct” in the Db2 Log, but the DASD on which the page resides has sustained physical damage (a device error). The page is then place on the WEPR list. The WEPR is a list of page ranges residing on DASD which is physically damaged (device error). The range has a low page and high page (a range of pages).

Note again that in both cases, the page is logically correct in the Db2 Log. As will be seen, this is crucial for recovery, and for the maintenance of data integrity.

Perhaps the best way to summarize the difference between the two types of errors (Logical vs. Physical) from a Db2 perspective, is as follows: A page on the WEPR will always require a solution involving the creation of a new physical data set (page set). A page on the LPL may or may not require a solution involving the creation of a new physical data set (e.g. an LPL recovery may accommodate resolution without the need to create a new physical data set … see below for more on LPL recovery).

If the problem accessing the data set is an issue of connectivity to the DASD, the problem can be resolved simply by restoring connectivity and then initiating an LPL recovery. While there are many reasons why connectivity might have been lost, they are beyond the scope of this discussion. Once the connectivity issue is resolved, an LPL recovery will resolve the problem. During LPL recovery, Db2 will employ the Db2 Log to correct the page in the physical page set once it is again accessible. After a successful LPL recovery, Db2 removes the page from the LPL. There are a variety of ways in which Db2 will be prompted to initiate LPL recovery. One way to initiate an LPL recovery is by issuing a -START DATABASE command on the DB housing the object. When the -START DATABASE command is initiated, Db2 will always attempt LPL recovery for all pages in the DB residing on the LPL. There are other ways in which Db2 will be prompted to initiate LPL recovery, the -START DATABASE command is but one example. Note that a page on the LPL is considered inaccessible for all Db2 access, until the issue is resolved.

The only way to fix a page on the WEPR is via the creation of a new data set to replace the data set residing on damaged DASD. One commonly used method for this is to employ the Db2 RECOVER utility, which will create a new physical data set in place of the old data set. The data set must be recreated because the original is no longer accessible due to the DASD device error.

In the case where a page is on the LPL and not on the WEPR, there are several ways other than a -START DATABASE command (which drives LPL recovery) to resolve the problem. One example of many: A mass delete or TRUNCATE will also clear an LPL status, but obviously not a WEPR status because the data set is still residing on damaged DASD.

There are a variety of other reasons why Db2 might put pages on the LPL. Another example might be due to an ICF Coupling Facility R/W issue. Other examples include a mass delete error, a Not-Logged Tablespace error, a deferred restart of Db2, there are many others.

If one wished to drive Db2 to place a page on the LPL, one could do so with the following process:

1. CREATE DB, TS, TB, IX’s and populate the TB with some rows, and COMMIT (in this example, the DB name is KFBDB and the TS is KFBTS).

2. INSERT another row without a COMMIT … i.e. an INFLIGHT thread.

3. /C ssidIRLM to cancel the IRLM and force a Db2 abnormal termination while the INSERT in #2 above is INFIGHT (i.e. not committed).

4. Setup for deferred restart for that DB (KFBDB) only. Note that in a deferred restart, the log is processed but data set processing is deferred for those objects (i.e. KFBDB) specified for DEFER.

5. Start Db2 and a message or series of messages will appear in the startup Db2 subsystem SYSLOG:

DSNI001I !TC1A RESTART HAS BEEN DEFERRED

REASON 00C90095

TYPE 00000200

NAME KFBDB.KFBTS

DSNB250E !TC1A DSNIIMPD A PAGE RANGE WAS ADDED TO

THE LOGICAL PAGE LIST

DATABASE NAME=KFBDB

SPACE NAME=KFBTS

INSTANCE=1

DATA SET NUMBER=1

PAGE RANGE X'00000001' TO X'00000001'

START LRSN=X'000000000000A86049B8'

END LRSN=X'000000000000A86187E6'

START RBA=X'000000000000A86049B8'

LPL TRACE ID=00000008

LPL REASON TYPE=LOGAPPLY

The command … -DISPLAY DATABASE(KFBDB) SPACENAM(*) … will provide the following display (note that associated indexes are also on the LPL):

DSNT360I !TC1A ***********************************

DSNT361I !TC1A * DISPLAY DATABASE SUMMARY

DSNT360I !TC1A ***********************************

DSNT362I !TC1A DATABASE = KFBDB STATUS = RW

DBD LENGTH = 4028

DSNT397I !TC1A 221

NAME TYPE PART STATUS PHYERRLO PHYERRHI CATALOG PIECE

-------- ---- ----- ----------------- -------- -------- -------- -----

KFBTS TS RW,LPL

KFBIX1 IX RW,LPL

KFBIX2 IX RW,LPL

******* DISPLAY OF DATABASE KFBDB ENDED **********************

When a -START DATABASE(KFBDB) SPACENAM(*) command is issued, Db2 will perform LPL recovery for the object(s), and the resulting messages will look something like this:

DSN9022I !TC1A DSNTDDIS 'START DATABASE' NORMAL COMPLETION

DSNI006I !TC1A DSNISREC ASYNCHRONOUS GRECP OR LPL

RECOVERY IN PROGRESS FOR

START DATABASE COMMAND-ID = 1

TYPE 00000200

NAME KFBDB.KFBTS

DSNI042I !TC1A DSNISREC -PAGE SET HEADER RBA

START DATABASE COMMAND-ID = 1

DB ID 0124

PS ID 0002

PART # 0

RBA 000000000000A85F9E26

DSNI006I !TC1A DSNISREC ASYNCHRONOUS GRECP OR LPL

RECOVERY IN PROGRESS FOR

START DATABASE COMMAND-ID = 1

TYPE 00000201

NAME KFBDB.KFBIX1

.

.

.

(much more) …

.

.

.

DSNI021I !TC1A DSNICLST GRECP OR LPL RECOVERY FOR

START DATABASE COMMAND-ID = 1

TYPE 00000200

NAME KFBDB.KFBTS

HAS COMPLETED SUCCESSFULLY.

DSNI021I !TC1A DSNICLST GRECP OR LPL RECOVERY FOR

START DATABASE COMMAND-ID = 1

TYPE 00000201

NAME KFBDB.KFBIX1

HAS COMPLETED SUCCESSFULLY.

DSNI021I !TC1A DSNICLST GRECP OR LPL RECOVERY FOR

START DATABASE COMMAND-ID = 1

TYPE 00000201

NAME KFBDB.KFBIX2

HAS COMPLETED SUCCESSFULLY.

DSNI049I !TC1A DSNICLST GRECP OR LPL RECOVERY FOR

START DATABASE COMMAND-ID = 1

HAS COMPLETED.

Note that Db2 recovered the DB.TS and all associated indexes also.

The command … -DISPLAY DATABASE(KFBDB) SPACENAM(*) … will now provide the following display (note that all objects, including the indexes are no longer on the LPL, all objects are now Read/Write (RW):

DSNT360I !TC1A ***********************************

DSNT361I !TC1A * DISPLAY DATABASE SUMMARY

DSNT360I !TC1A ***********************************

DSNT362I !TC1A DATABASE = KFBDB STATUS = RW

DBD LENGTH = 4028

NAME TYPE PART STATUS PHYERRLO PHYERRHI CATALOG PIECE

-------- ---- ----- ----------------- -------- -------- -------- -----

KFBTS TS RW

KFBIX1 IX RW

KFBIX2 IX RW

******* DISPLAY OF DATABASE KFBDB ENDED **********************

A reference was made previously in this paper to the ICF Coupling Facility (CF), and the potential that problems in the CF could drive Db2 to place pages on the LPL. A brief discussion of the CF and its use by Db2 will help to explain how this might occur.

In IBM z/OS environments, XCF stands for Cross-system Coupling Facility. The terms CF and XCF are synonymous in this discussion. The XCF is a component of z/OS which manages communications between applications in a SYSPLEX. A SYSPLEX (also commonly called a Parallel SYSPLEX) is a cluster of z/OS Logical Partitions (LPAR’s) acting together within a z/OS environment. There can be up to thirty-two (32) LPAR’s in a SYSPLEX. While usually processing on the same hardware, LPAR’s can be thought of as separate mainframes communicating with each other for a common purpose. Each LPAR is IPL’d (Initial Program Load) with its own z/OS operating system. The LPAR’s in a SYSPLEX communicate with each other via a variety of mechanisms and z/OS features, including the use of an XCF.

From a Db2 for z/OS perspective, an XCF is simply a set of defined data structures shared across all LPAR’s in a SYSPLEX. Because the structures are shared by all LPAR’s, they are accessible to all Db2 for z/OS subsystems running in those LPAR’s. When Db2 for z/OS is configured for Data Sharing (DS), many members of the DS “group” can be started on the same or different LPAR’s in the SYSPLEX. Each member can have full access to the same data … i.e. the individual Db2 subsystems (aka members) share data between them. While the advantages of running Db2 in a DS configuration extend far beyond the scope of this paper, one simple example will be described: A four-member Db2 DS group consists of four Db2 for z/OS subsystems, all communicating with each other (via the XCF) and sharing the same data. Each member can be started on a different LPAR in the SYSPLEX. If one LPAR becomes inaccessible for any reason, the workload which might have been directed to that Db2 subsystem can be redirected to one of the other members (Db2 subsystems) on one of the other LPAR’s not impacted by the problem. Mission critical work can continue without incident due to the problem occurring on only one LPAR in the SYSPLEX.

To assure data integrity in a Db2 DS environment, each member of the DS group must be aware of (and respect) other members access to the same data. When multiple members wish to access the same data, and one of the members is attempting to change that data, communications become critical for the preservation of the integrity of the data.

There are three types of XCF structures built to support Db2 data sharing. The three types of structures created in support of a Db2 DS group are:

1. Lock Structure: The lock structure is created in the XCF for a Db2 DS group to support Db2 locking mechanisms, which are critical to data integrity and concurrency.

2. List Structure (aka SCA): The SCA is created in the XCF for a Db2 DS group to support a variety of Db2 functions, most notably data recovery and data integrity. It should be noted that when operating in a data sharing environment, the Db2 DS group shares a single LPL, and that LPL is stored in the SCA.

3. Group Buffer Pools (GBP’s): GBP’s are created in the XCF for a Db2 DS group to support the sharing of data between the various members of the group. While beyond the scope of this discussion, GBP’s can be called upon to store data pages under circumstances where more than one member of the group has an interest in the same data page. When only one member of the group has an interest in a data page, that one member can store that page in its private (i.e. local) Buffer Pool (BP). When more than one member of the group has an interest in the same data page, then the page can be moved internally (moved by Db2) to a GBP in the XCF. While residing in the GBP, it is accessible to all members of the group simultaneously.

Db2 for z/OS employs all three of the structure types above to allow for maximum data concurrency (the ability of multiple applications to access the same data) while maintaining data integrity.

Because data pages can be stored in GBP’s they are subject to the same types of issues discussed above for data pages residing in a local Db2 BP. Consequently, the same reasons for placing a data page residing in a local BP on the LPL/WEPR might drive Db2 to place a data page residing in a GBP on the LPL/WEPR as well.

Also, when data pages reside in a GBP, they effectively reside outside of Db2 subsystem internal storage. If a GBP contains data pages and that GBP suffers an error for any reason, the pages residing in that GBP must be placed on the LPL. When such a GBP error occurs, Db2 always attempts to automatically rebuild the failed pool. If successful, no pages must be placed on the LPL. If the GBP rebuild is not successful (and there are many reasons why the GBP rebuild might fail … e.g. Db2 has lost communications with the XCF), then the pages in that GBP are placed on the LPL. If the automatic rebuild is successful however, then no page is placed on the LPL.

In a scenario where a Db2 member abnormally terminates, it will attempt automatic LPL recovery at member restart. The same rules apply as above. If the LPL recovery is successful, the page(s) is removed from the LPL, else the page(s) continues to reside on the LPL (and continues to be inaccessible to all members) until a recovery can be facilitated.

There are many very unusual scenarios which could drive Db2 to place a data page on the LPL. In fact, there are more such scenarios than can be addressed in this paper. Several of the scenarios were discussed above. Two more will be addressed below. Just know that these are not the only such scenarios, they are merely intended as examples of how this might occur.

Another scenario where a Db2 data page could be placed on the LPL is when NOT LOGGED table spaces (NL TS’s) are employed and an error occurs. An NL TS is a TS in which Db2 has been instructed to not log changes made to the underlying data (i.e. data pages) in the TS. There are variety of ways in which Db2 can be instructed to skip logging for a TS, and are beyond the scope of this paper. Suffice it to say that when Db2 is instructed to skip logging for the TS, the Db2 log cannot be employed to recover lost data. In such a scenario where recovery is not available and an error occurs while changing data in the TS, the entire page set must be placed on the LPL.

If a TS is NOT LOGGED, and a ROLLBACK or BACKOUT of changed data is attempted, Db2 does not have the log data to accommodate the ROLLBACK/BACKOUT and the integrity of the data in the TS is effectively violated. Any problem in an NL TS requiring log recovery will result in the associated page set be placed on the LPL and marked as Recovery Pending (RECP). Neither a -START DATABASE command (see above) nor a Db2 restart will alleviate the problem, because in both cases Db2 relies on its log to perform the LPL recovery, and the TS is designated as NOT LOGGED. In such cases, another remedy must be employed … examples follow:

• PIT RECOVER (recover to a prior PIT … Point-In-Time)

• LOAD REPLACE

• REFRESH TABLE for an MQT

• DROP and CREATE the object new

• DELETE * FROM … (mass delete)

• TRUNCATE (mass delete)

A final example in which Db2 might place a data page on the LPL is during a Deferred Restart of a Db2 subsystem. When Db2 is started, it can be instructed to defer page set processing during phase 3 and phase 4 of the start-up process. This instruction can be to defer processing for ALL objects requiring action during start-up, or to defer processing for specifically identified objects. In either case, Db2 will only defer processing the impacted page sets. Db2 will not defer log processing during the restart. Db2 will still process the entire log (as required) during the restart, and all objects will remain “logically correct” (see above) in the log. The result of this can be that while the impacted data pages are logically correct in the log, the physical data pages are no longer correct. In such a scenario, the data pages must be placed on the LPL until a recovery action is taken.

There are several scenarios in which a Deferred Restart (start-up) might be desired, and the list of reasons for employing such a restart is beyond the scope of this paper. One example in which a Deferred Restart (start-up) might be employed is in a Disaster Recovery scenario, where either all or some of the underlying VSAM LDS’s impacted are not available during start-up. Instructing Db2 to defer page set processing while completing all log processing will place the systems administrator (and/or DBA) in a better position to recover business critical data via a variety of mechanisms after start-up has completed.

I final note is made here regarding some of the Db2 commands that can help to identify data pages on the LPL/WEPR.

This command will display all objects with data pages residing on the LPL:

-DISPLAY DATABASE(*) SPACENAM(*) LPL ONLY

This command will display all objects with data pages residing on the WEPR:

-DISPLAY DATABASE(*) SPACENAM(*) WEPR ONLY

When a start data base command is executed in Db2 …

-START DATABASE(db-name) SPACENAM(ts-name)

… the subsystem will automatically identify pages in the DB.TS residing on the LPL and attempt to perform an LPL recovery using the Db2 log. If the recovery is successful, the pages will be removed from the LPL. If the recovery cannot be completed successfully for any reason, the pages will remain on the LPL, until the obstacle/problem preventing recovery can be alleviated and the recovery process can successfully recover the impacted pages.

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download