Constructing Begin-End Date Intervals from Event or As-of ...

Constructing Begin-End Date Intervals from Event or As-of Dates in a SAS? DATA Step

Thomas E. Billings, Union Bank, San Francisco, California

This work by Thomas E. Billings is licensed (2013) under a Creative Commons Attribution-ShareAlike 3.0 United States License.

Abstract

The challenge: you have a SAS? table that contains a series of event or as-of dates or datetimes, and you want to create a version of the table that includes date intervals: [begin date, end date] variables that specify when the data for each row are relevant. Basic concepts for dates and intervals are reviewed, followed by an example of a table with a series of dates, and the target date intervals to be constructed. The different methods available to construct intervals are briefly described. Design considerations and options, including possible problems, are covered in-depth. Code for a single DATA step that uses the LAG function to construct date intervals is reviewed. An appendix provides a numeric method to minimize possible data loss when using datetimes of arbitrary precision.

Introduction

Database tables may contain data (rows) that are valid over a specific period of time. Such data may be captured on a particular event or as-of date (or datetime), and the data are valid until the next relevant date/datetime when additional records are captured. Data capture may be driven by specific real-world events or may follow a schedule, e.g., calendar milestones like the end of a month.

For this kind of data, it is common practice to designate the interval when the data are valid by adding variables that specify the begin-effective-date and end-effective-dates for each row. Depending on the application, these intervals (sometimes referred to as date bands) may need to be created for data that occur in blocks defined by key or other BY-variables that make the blocks unique (ID variables). An example may clarify this; e.g.: you have a table that records events by customer ID, and you want to create a version of the table where the event dates are used to create date intervals on a by-customer basis. Such a table might be joined with other tables on a by-customer, bydate basis for applications like reporting.

The challenge in creating and inserting date intervals is that you need the event or as-of date from the next (following) relevant event record to establish the end date. That means you have to access dates across rows ? a clumsy process in SQL, but considerably more elegant in the SAS? DATA step. The possibility that you have to create datetime intervals when the datetime can be any number in the legitimate range presents another challenge: if the end-datetime for one interval is the same as the begin-datetime for the next interval, then, since they are created from observed event datetimes, a join using such intervals can yield more than one row (i.e., multiple rows for one datetime in the join). An additional issue to resolve: how to build intervals to handle dates before the earliest or after the latest event/as-of date in the source files?

We consider these questions and more below, and start with a review of basic concepts and a sample table with date intervals.

Concepts

Dates and datetimes can be in many forms, varying by source and/or per cultural practice. SAS dates are the number of days from 1 Jan 1960, and datetimes are the number of seconds from time 00:00:00 on that same day. In general, dates and datetimes are usually converted into their SAS internal form for use. A wide variety of informats and formats are available to convert dates to/from SAS internal form.

1

By definition, dates are integers, though they are stored internally in floating-point numbers by the SAS system. Datetimes can be integers (whole seconds) or can be more precise. Sample dates/datetimes (all for time 00:00:00) and their internal values are shown in the following:

date_value date_internal

datetime_value

01JAN1950

-3652 01JAN1950:00:00:00

01JAN1960

0 01JAN1960:00:00:00

01JAN2000

14610 01JAN2000:00:00:00

01JAN2010

18263 01JAN2010:00:00:00

01JAN2020

21915 01JAN2020:00:00:00

01JAN2050

32873 01JAN2050:00:00:00

01JAN3000

379853 01JAN3000:00:00:00

datetime_internal -315532800 0 1262304000 1577923200 1893456000 2840227200

32819299200

Example of a table with date intervals

An example will help to clarify the subject. Given an artificial sample raw data set where the data are in blocks identified by two "ID variables", customer_id and archive_id with an event date in each row:

Obs CUSTOMER_ID ARCHIVE_ID

1

348128

6

2

386548

3

3

386548

3

4

386548

6

5

386548

12

6

386548

12

7

387434

40

8

387434

41

9

387434

43

10

387434

46

11

388038

2

12

388038

2

13

388038

12

14

388038

12

15

388038

15

event_date 15JAN2013 28JUN2010 28JUN2010 15OCT2010 23NOV2011 23NOV2011 20SEP2012 04DEC2012 15APR2013 25JUN2013 16JUL2010 16JUL2010 14SEP2012 14SEP2012 14SEP2012

We want to create a derivative or new version of the data set that looks like this:

Obs CUSTOMER_ID ARCHIVE_ID

1

348128

6

2

348128

6

3

386548

3

4

386548

3

5

386548

3

6

386548

6

7

386548

12

8

386548

12

9

387434

40

10

387434

40

11

387434

41

12

387434

43

13

387434

46

14

388038

2

event_date begin_date 15JAN2013 01JAN1600 15JAN2013 15JAN2013 28JUN2010 01JAN1600 28JUN2010 28JUN2010 28JUN2010 28JUN2010 15OCT2010 15OCT2010 23NOV2011 23NOV2011 23NOV2011 23NOV2011 20SEP2012 01JAN1600 20SEP2012 20SEP2012 04DEC2012 04DEC2012 15APR2013 15APR2013 25JUN2013 25JUN2013 16JUL2010 01JAN1600

2

end_date 14JAN2013 31DEC3000 27JUN2010 28JUN2010 14OCT2010 22NOV2011 23NOV2011 31DEC3000 19SEP2012 03DEC2012 14APR2013 24JUN2013 31DEC3000 15JUL2010

Obs CUSTOMER_ID ARCHIVE_ID

15

388038

2

16

388038

2

17

388038

12

18

388038

15

19

388038

12

event_date 16JUL2010 16JUL2010 14SEP2012 14SEP2012 14SEP2012

begin_date 16JUL2010 16JUL2010 14SEP2012 14SEP2012 14SEP2012

end_date 16JUL2010 13SEP2012 14SEP2012 14SEP2012 31DEC3000

Artificial begin and end dates have also been added (for completeness) to the sequence, creating an extra row in each block. The added row is a duplicate of the row associated with the first event date, so ? in the example above ? if customer_id and archive_id are components of a primary key, then an additional variable (perhaps a sequence number) needs to be added to the key to preserve uniqueness.

Methods to create date intervals

As previously mentioned, we need to inject the date value from the following/next row (when sorted by the variables that define the data into blocks plus the event or as-of date) into the current row in order to calculate the begin and end dates. This is sometimes referred to as a look-ahead read. SAS provides multiple ways to accomplish this, and the target processing can be done in SQL as well. Some of the ways to do this are as follows.

1. SQL self-join. To avoid a cross-product, we need a sequence number for each row within the natural blocks the data occurs in. In some SQL dialects, the DENSE_RANK function may be used for this purpose. However, DENSE_RANK is not supported in native SAS SQL; instead use PROC RANK to insert a sequence number in the data before doing self-joins. (If you are not familiar with PROC RANK, you can insert sequence numbers via a DATA step that uses RETAIN count variables that are reset at the beginning of each data block using FIRST.BYVARIABLES.)

2. Hamilton et al. (2005) describes 4 different methods to perform a look-ahead read using the DATA step. 3. This paper uses a 5th method to do a look-ahead read: Presort the target file so the rows are in decreasing

date order within the natural blocks the data occurs in, i.e., in "reverse" order. A single DATA step then uses the LAG function to pull the relevant date into the row, at which point the begin and end dates are calculated. This method is described in detail below.

Application and design considerations

Before constructing date intervals, be sure to determine how the intervals will/should be used, and assess other relevant factors, to make sure that the intervals you construct will meet the underlying requirements.

Operational environment(s). Which environments do/will the target data tables ? the ones that need date intervals ? reside in? The options here include native SAS tables, RDBMS tables (e.g., Oracle, Pivotal, DB2, etc.), XML (usually ISO 8601-compliant dates), etc.

Note that RDBMS dates can be datetimes in SAS, so conversions may be required. Daily, weekly, monthly snapshot records in databases often have as-of datetimes with zero times and such datetimes may be converted to/processed as dates in SAS, then converted back to datetimes for transfer back to the RDBMS. ISO 8601 dates and times (as used in XML) can be complicated; there are SAS User Group papers on the topic ? do a search at for references.

How will the date intervals be used in joins? Given a date or datetime interval with begin and end markers, will the usage in joins be:

1. begin_date test_date end_date 2. SQL: test_date BETWEEN begin_date and end_date, 3. begin_date test_date < end_date (or a variation thereof).

Methods #1 and #2 are equivalent (in SQL) because BETWEEN uses the operator. Methods #1, 2 are the most common and you should expect that is how the date intervals will be used. Note the subtle difference between #1 and #3 above. Method #3 is possible and supports more flexibility in assigning end dates. However, method #3 is nonstandard and can cause problems when used in joins.

3

To understand the underlying issues, let's say we have 3 separate and unequal event dates: d(1) < d(2) < d(3), and we want to construct date intervals for them. Now dates are integers and the intervals here would clearly be calculated as the following, in [begin_date, end_date] form:

[d(1), d(2) - 1] [d(2), d(3) - 1]

where we subtract 1 from the event end dates to work correctly in tests based on and BETWEEN. Now instead let d(i) be datetimes . If your datetimes are always measured at a certain known precision level, say for example 0.5 seconds, then an approach similar to the above will work for datetimes:

[d(1), d(2) - e] [d(2), d(3) - e]

where e is an epsilon smaller than the precision level, e.g., say 0.025 when the precision is 0.5 seconds. A problem arises if your intervals are based on datetimes that can be any valid number less than the event datetime used to calculate the end datetime for the interval. The approach above leaves tiny gaps in the intervals, and if the data includes datetimes with values in those gaps, then data can be lost when using the intervals in joins.

If you set e=0 in the above to try to solve the problem of possible data loss, then the end datetime for interval 1 will equal the begin datetime for interval #2 above. Given that the interval begin and end datetimes are constructed using actual datetimes that occur in the data, then a join with the data will presumably yield multiple rows. This creates a dilemma: users have to know they may get multiple rows and need to unduplicate OR they have to know to use method #3 (and that they should not use BETWEEN). Recognizing that users frequently don't read data dictionaries, how can we compute the epsilon e to avoid duplicate rows and also minimize data loss? A solution that supports and BETWEEN is described in the appendix.

How to handle dates outside the observed range? Since the purpose of the date intervals is to facilitate a join by dates between 2 tables, how to handle dates before the earliest begin date or after the last end date? Typically, artificial begin and end dates are used for this. For an artificial begin date, 1 Jan 1600 or 1 Jan 1900 might be used (the earliest valid SAS date is the year 1582), and for an artificial end date, 31 Dec 3000 or 31 Dec 9999. (Of course, you can choose other artificial dates if you prefer.) The use of artificial begin and end dates can force the construction of an extra row in each block of the data when inserting intervals; this is illustrated in the code below.

Duplicate dates. If you are constructing date intervals for a series of event dates or datetimes that contains duplicates, the options are to have 1 or more intervals where begin_date = end_date ? which may produce multiple rows in a join, or unduplicate the data to avoid multiple rows. Research and/or discussion with the table users may be required to determine the appropriate processing for duplicate dates.

Missing vs. NULL. In general, NULL or missing values should not be included in date intervals because it introduces major complications when the table is used in SQL. Artificial begin/end dates are used specifically to avoid the problems introduced by NULL values in RDBMS tables. For the most part, null or missing event/as-of dates are deliberately excluded from the logic used to create date intervals.

Recognizing that joining by dates is important ? and the motivation for creating date intervals ? the following approaches may be considered to handle dates that are missing in joins:

Accept that there are some missing date values and produce an exception report if needed. Pre-process the input file before doing the join by date, to replace missing dates with a default or other

"best-estimate" date. Filter out rows with missing dates and join them using alternate logic that does not require a date.

The bottom line is that if your data have a large number of missing dates, you may need to add separate logic to handle the rows with missing dates.

Source Code for DATA step using the LAG function

#1: Sort the data into blocks, with event or as-of date descending (as we want to use the LAG function to bring in the "lagged" date as the end date. proc sort data=extract_db;

4

by customer_id descending event_date archive_ID; run;

#2: Specify artificial begin, end dates to handle dates before/after the range of event or as=of dates.

%let art_begin_date="01JAN1600"d; %let art_end_date="31DEC3000"d;

#3: DATA step to derive begin, end dates. Note the handling of duplicate event or as-of dates. The processing is defined in a DATA step view for efficiency.

data date_banded_x / view=date_banded_x; set extract_db (keep= customer_id archive_id event_date); by customer_id;

* apply LAG function up front, in unconditional logic; lagdate = lag(event_date);

* set 1st guess at end date to default or lagged value; if first.customer_id then

offset_date = &art_end_date.; else offset_date = lagdate;

* adjust begin, end date values as needed; begin_date = event_date; end_date = offset_date;

if ((end_date ne begin_date) and (end_date ne &art_end_date.)) then end_date=end_date-1;

output;

* output additional row for joins with dates before first event date in block; if (last.CUSTOMER_ID) then

do; end_date = begin_date-1; begin_date = &art_begin_date.; output;

end;

format begin_date end_date offset_date date9.; drop lagdate offset_date; run;

#4: Execute DATA step view and create the (target) sorted data set with date intervals.

* add date intervals and re-sort in 1 step; proc sort data=date_banded_x

out=date_intervals; by CUSTOMER_ID begin_date end_date; run;

The code above created the sample file with date intervals shown in the section above titled "Example of a table with date intervals".

5

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download