Identifying Duplicate Values

NESUG 2006

Coders' Corner

Identifying Duplicate Values

Christopher J. Bost, MDRC, New York, NY

ABSTRACT

This paper describes techniques for identifying duplicate values across observations, including PROC SORT,

PROC FREQ, PROC SQL, and FIRST./LAST. variables. Pros and cons of each technique are noted.

INTRODUCTION

Duplicate values may or may not be a problem, depending on the data source. Four techniques to identify duplicate values are detailed below. Each is followed by an optional technique to store observations with duplicate

values in another SAS? data set.

SAMPLE DATA SET

Temporary SAS data set TEST is used in all examples. It contains two variables and ten observations:

Data set TEST

Obs

ID

N

1

2

3

4

5

6

7

8

9

10

111

222

333

444

444

555

555

666

666

666

1

1

1

1

2

1

2

1

2

3

Note that IDs 111, 222, and 333 each occur once. IDs 444 and 555 each occur twice. ID 666 occurs three times.

In other words, there are three single values, two duplicate values, and one triplicate value. Variable N numbers

the relative occurrence of each ID.

IDENTIFYING DUPLICATES WITH PROC SORT

Use PROC SORT to remove duplicate values:

proc sort data=test nodupkey;

by id;

run;

Observations in data set TEST are sorted by ID in ascending order. The NODUPKEY option deletes any observations with duplicate BY values (i.e., observations that duplicate a previously encountered value of ID).

More than one variable may be specified on the BY statement; SAS will delete any observations that duplicate

previously encountered combinations of BY values.

Use the NODUPREC option to delete duplicate observations (i.e., where all variable values are repeated).

1

NESUG 2006

Coders' Corner

STORING DUPLICATES

Use the DUPOUT= option with NODUPKEY (or NODUPREC) to output duplicates to the specified SAS data set:

proc sort data=test nodupkey dupout=dups;

by id;

run;

Observations in data set TEST are sorted by ID in ascending order. The NODUPKEY option deletes any observations with duplicate BY values from data set TEST. The DUPOUT= option outputs observations with duplicate

BY values to data set DUPS. The resulting data sets look as follows:

Data set TEST [NODUPKEY]

Data set DUPS

Obs

ID

N

Obs

ID

N

1

2

3

4

5

6

111

222

333

444

555

666

1

1

1

1

1

1

1

2

3

4

444

555

666

666

2

2

2

3

Data set TEST contains six observations, including the first occurrence of each value of ID. (Note that N equals

1 for each.) Data set DUPS contains four observations, including the subsequent occurrences of each value of

ID. (Note that N equals 2 or 3 for each.)

PROS AND CONS

Pros: easy to use; can output duplicates to another SAS data set; can delete duplicate observations

Cons: original and duplicates not stored together; no frequencies; noncontiguous duplicates may not be detected

IDENTIFYING DUPLICATES WITH PROC FREQ

Use PROC FREQ to count the number of times each ID occurs:

proc freq data=test;

tables id;

run;

PROC FREQ prints the following output:

The FREQ Procedure

Cumulative

Cumulative

ID

Frequency

Percent

Frequency

Percent

©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤

111

1

10.00

1

10.00

222

1

10.00

2

20.00

333

1

10.00

3

30.00

444

2

20.00

5

50.00

555

2

20.00

7

70.00

666

3

30.00

10

100.00

IDs 111, 222, and 333 each have a Frequency of 1; IDs 444 and 555 each have a Frequency of 2; and ID 666

has a Frequency of 3. In other words, there are three single values, two duplicate values, and one triplicate value.

2

NESUG 2006

Coders' Corner

COUNTING FREQUENCIES

The Frequency indicates how many times each ID occurs. This categorizes each ID as being a single occurrence (Frequency equals 1), a duplicate (Frequency equals 2), a triplicate (Frequency equals 3), and so on.

PROC FREQ may produce voluminous output, however, depending on the number of IDs. Output the frequency

counts to a SAS data set, and run PROC FREQ on the Frequency variable to summarize duplicates:

proc freq data=test noprint;

tables id/out=freqs;

run;

The NOPRINT option on the PROC FREQ statement suppresses printed output. Frequency-count information

for ID is saved to temporary SAS data set FREQS, which looks as follows:

Data set FREQS

Obs

ID

COUNT

PERCENT

1

2

3

4

5

6

111

222

333

444

555

666

1

1

1

2

2

3

10

10

10

20

20

30

Data set FREQ contains six observations and three variables. There is one observation for each value of ID.

The frequency of each ID is stored in variable COUNT. The percentage that each count represents of the total

number of observations is stored in variable PERCENT.

Run a second PROC FREQ to process data set FREQS:

proc freq data=freqs;

tables count;

run;

PROC FREQ prints the following output:

The FREQ Procedure

Frequency Count

Cumulative

Cumulative

COUNT

Frequency

Percent

Frequency

Percent

©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤

1

3

50.00

3

50.00

2

2

33.33

5

83.33

3

1

16.67

6

100.00

COUNT has the value 1 three times (i.e., has a Frequency of 3), meaning there are three single values.

COUNT has the value 2 two times, meaning there are two duplicates.

COUNT has the value 3 one time, meaning there is one triplicate.

The two PROC FREQ steps count all occurrences (i.e., unique values, duplicates, triplicates, and so on,

depending on the data).

3

NESUG 2006

Coders' Corner

STORING DUPLICATES

Merge data set FREQS with data set TEST to subset observations that do not have unique values of ID:

data dups;

merge freqs(in=inF where=(count>1) keep=id count)

test (in=inT);

by id;

if inF and inT;

drop count;

run;

Use the IN= data set option to create variables that store information about the origin of each observation. SAS

creates temporary variables INF and INT that have the value 1 when the respective data set contributes to the

current observation and the value 0 when it does not. Use the WHERE= option to merge only those observations

from data set FREQS where COUNT is greater than 1 (that is, a duplicate). Use the KEEP= option to keep only

the variables ID and COUNT (used in the WHERE= option).

Use an IF-THEN statement to output only those observations with an ID in both data sets FREQS and TEST to

data set DUPS. Use a DROP statement to drop the COUNT variable. The resulting data set looks as follows:

Data set DUPS

Obs

ID

N

1

2

3

4

5

6

7

444

444

555

555

666

666

666

1

2

1

2

1

2

3

Data set TEST must be in sort order, sorted with PROC SORT, or indexed by ID. Data set FREQS ¡ª the output

data set from the FREQ procedure ¡ª is already sorted by ID in ascending order.

PROS AND CONS

Pros: easy to summarize counts and percentages; data-driven

Cons: requires two steps; cannot count combinations of variable values

IDENTIFYING DUPLICATES WITH PROC SQL

Use PROC SQL to count the number of unique IDs and the number of observations:

proc sql;

select count(distinct id) as UniqueIDs,

count(*) as NObs

from test;

quit;

The COUNT function with the DISTINCT keyword counts nonmissing unique values in a column (variable).

The COUNT function with an asterisk (*) counts the number of rows (observations) in the table (data set).

The AS keyword names the variable to store the result. PROC SQL prints the following output:

UniqueIDs

NObs

©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤

6

10

4

NESUG 2006

Coders' Corner

PROC SQL counted six unique ID values (UNIQUEIDS) out of ten observations (NOBS). This indicates that one

or more ID values are duplicated, but it does not indicate which ones. This query is perhaps best used to document that there are no duplicates (i.e., when the number of UNIQUEIDS equals NOBS).

Use PROC SQL to count the number of times each ID occurs:

proc sql;

select id, count(*) as IDCount

from test

group by id;

quit;

This query treats observations with the same value of ID as a group and counts the number of times each occurs.

PROC SQL prints the following output:

ID

IDCount

©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤

111

1

222

1

333

1

444

2

555

2

666

3

Save the query results to a SAS data set. Run a second query to count the number of times each IDCOUNT

occurs:

proc sql;

create table counts as

select id, count(*) as IDCount

from test

group by id;

select IDCount, count(*) as N

from counts

group by IDCount;

quit;

CREATE TABLE outputs the results of the first query to SAS data set COUNTS. The second query treats observations with the same value of IDCOUNT as a group and counts the number of times each occurs. PROC SQL

prints the following output:

IDCount

N

©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤

1

3

2

2

3

1

IDCOUNT 1 has an N of 3, meaning there are three single values. IDCOUNT 2 has an N of 2, meaning there are

two duplicates. IDCOUNT 3 has an N of 1, meaning there is one triplicate.

Note that PROC FREQ could be run on data set COUNTS to produce these results plus percentages. This

would require a second step (i.e., instead of a second query within one PROC SQL step).

STORING DUPLICATES

Perform an inner join of observations in data sets TEST and COUNTS where the ID occurs in COUNTS:

proc sql;

create table dups as

select T.*

5

................
................

In order to avoid copyright disputes, this page is only a partial summary.

Google Online Preview   Download