Identifying Duplicate Values
NESUG 2006
Coders' Corner
Identifying Duplicate Values
Christopher J. Bost, MDRC, New York, NY
ABSTRACT
This paper describes techniques for identifying duplicate values across observations, including PROC SORT,
PROC FREQ, PROC SQL, and FIRST./LAST. variables. Pros and cons of each technique are noted.
INTRODUCTION
Duplicate values may or may not be a problem, depending on the data source. Four techniques to identify duplicate values are detailed below. Each is followed by an optional technique to store observations with duplicate
values in another SAS? data set.
SAMPLE DATA SET
Temporary SAS data set TEST is used in all examples. It contains two variables and ten observations:
Data set TEST
Obs
ID
N
1
2
3
4
5
6
7
8
9
10
111
222
333
444
444
555
555
666
666
666
1
1
1
1
2
1
2
1
2
3
Note that IDs 111, 222, and 333 each occur once. IDs 444 and 555 each occur twice. ID 666 occurs three times.
In other words, there are three single values, two duplicate values, and one triplicate value. Variable N numbers
the relative occurrence of each ID.
IDENTIFYING DUPLICATES WITH PROC SORT
Use PROC SORT to remove duplicate values:
proc sort data=test nodupkey;
by id;
run;
Observations in data set TEST are sorted by ID in ascending order. The NODUPKEY option deletes any observations with duplicate BY values (i.e., observations that duplicate a previously encountered value of ID).
More than one variable may be specified on the BY statement; SAS will delete any observations that duplicate
previously encountered combinations of BY values.
Use the NODUPREC option to delete duplicate observations (i.e., where all variable values are repeated).
1
NESUG 2006
Coders' Corner
STORING DUPLICATES
Use the DUPOUT= option with NODUPKEY (or NODUPREC) to output duplicates to the specified SAS data set:
proc sort data=test nodupkey dupout=dups;
by id;
run;
Observations in data set TEST are sorted by ID in ascending order. The NODUPKEY option deletes any observations with duplicate BY values from data set TEST. The DUPOUT= option outputs observations with duplicate
BY values to data set DUPS. The resulting data sets look as follows:
Data set TEST [NODUPKEY]
Data set DUPS
Obs
ID
N
Obs
ID
N
1
2
3
4
5
6
111
222
333
444
555
666
1
1
1
1
1
1
1
2
3
4
444
555
666
666
2
2
2
3
Data set TEST contains six observations, including the first occurrence of each value of ID. (Note that N equals
1 for each.) Data set DUPS contains four observations, including the subsequent occurrences of each value of
ID. (Note that N equals 2 or 3 for each.)
PROS AND CONS
Pros: easy to use; can output duplicates to another SAS data set; can delete duplicate observations
Cons: original and duplicates not stored together; no frequencies; noncontiguous duplicates may not be detected
IDENTIFYING DUPLICATES WITH PROC FREQ
Use PROC FREQ to count the number of times each ID occurs:
proc freq data=test;
tables id;
run;
PROC FREQ prints the following output:
The FREQ Procedure
Cumulative
Cumulative
ID
Frequency
Percent
Frequency
Percent
©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤
111
1
10.00
1
10.00
222
1
10.00
2
20.00
333
1
10.00
3
30.00
444
2
20.00
5
50.00
555
2
20.00
7
70.00
666
3
30.00
10
100.00
IDs 111, 222, and 333 each have a Frequency of 1; IDs 444 and 555 each have a Frequency of 2; and ID 666
has a Frequency of 3. In other words, there are three single values, two duplicate values, and one triplicate value.
2
NESUG 2006
Coders' Corner
COUNTING FREQUENCIES
The Frequency indicates how many times each ID occurs. This categorizes each ID as being a single occurrence (Frequency equals 1), a duplicate (Frequency equals 2), a triplicate (Frequency equals 3), and so on.
PROC FREQ may produce voluminous output, however, depending on the number of IDs. Output the frequency
counts to a SAS data set, and run PROC FREQ on the Frequency variable to summarize duplicates:
proc freq data=test noprint;
tables id/out=freqs;
run;
The NOPRINT option on the PROC FREQ statement suppresses printed output. Frequency-count information
for ID is saved to temporary SAS data set FREQS, which looks as follows:
Data set FREQS
Obs
ID
COUNT
PERCENT
1
2
3
4
5
6
111
222
333
444
555
666
1
1
1
2
2
3
10
10
10
20
20
30
Data set FREQ contains six observations and three variables. There is one observation for each value of ID.
The frequency of each ID is stored in variable COUNT. The percentage that each count represents of the total
number of observations is stored in variable PERCENT.
Run a second PROC FREQ to process data set FREQS:
proc freq data=freqs;
tables count;
run;
PROC FREQ prints the following output:
The FREQ Procedure
Frequency Count
Cumulative
Cumulative
COUNT
Frequency
Percent
Frequency
Percent
©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤
1
3
50.00
3
50.00
2
2
33.33
5
83.33
3
1
16.67
6
100.00
COUNT has the value 1 three times (i.e., has a Frequency of 3), meaning there are three single values.
COUNT has the value 2 two times, meaning there are two duplicates.
COUNT has the value 3 one time, meaning there is one triplicate.
The two PROC FREQ steps count all occurrences (i.e., unique values, duplicates, triplicates, and so on,
depending on the data).
3
NESUG 2006
Coders' Corner
STORING DUPLICATES
Merge data set FREQS with data set TEST to subset observations that do not have unique values of ID:
data dups;
merge freqs(in=inF where=(count>1) keep=id count)
test (in=inT);
by id;
if inF and inT;
drop count;
run;
Use the IN= data set option to create variables that store information about the origin of each observation. SAS
creates temporary variables INF and INT that have the value 1 when the respective data set contributes to the
current observation and the value 0 when it does not. Use the WHERE= option to merge only those observations
from data set FREQS where COUNT is greater than 1 (that is, a duplicate). Use the KEEP= option to keep only
the variables ID and COUNT (used in the WHERE= option).
Use an IF-THEN statement to output only those observations with an ID in both data sets FREQS and TEST to
data set DUPS. Use a DROP statement to drop the COUNT variable. The resulting data set looks as follows:
Data set DUPS
Obs
ID
N
1
2
3
4
5
6
7
444
444
555
555
666
666
666
1
2
1
2
1
2
3
Data set TEST must be in sort order, sorted with PROC SORT, or indexed by ID. Data set FREQS ¡ª the output
data set from the FREQ procedure ¡ª is already sorted by ID in ascending order.
PROS AND CONS
Pros: easy to summarize counts and percentages; data-driven
Cons: requires two steps; cannot count combinations of variable values
IDENTIFYING DUPLICATES WITH PROC SQL
Use PROC SQL to count the number of unique IDs and the number of observations:
proc sql;
select count(distinct id) as UniqueIDs,
count(*) as NObs
from test;
quit;
The COUNT function with the DISTINCT keyword counts nonmissing unique values in a column (variable).
The COUNT function with an asterisk (*) counts the number of rows (observations) in the table (data set).
The AS keyword names the variable to store the result. PROC SQL prints the following output:
UniqueIDs
NObs
©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤
6
10
4
NESUG 2006
Coders' Corner
PROC SQL counted six unique ID values (UNIQUEIDS) out of ten observations (NOBS). This indicates that one
or more ID values are duplicated, but it does not indicate which ones. This query is perhaps best used to document that there are no duplicates (i.e., when the number of UNIQUEIDS equals NOBS).
Use PROC SQL to count the number of times each ID occurs:
proc sql;
select id, count(*) as IDCount
from test
group by id;
quit;
This query treats observations with the same value of ID as a group and counts the number of times each occurs.
PROC SQL prints the following output:
ID
IDCount
©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤
111
1
222
1
333
1
444
2
555
2
666
3
Save the query results to a SAS data set. Run a second query to count the number of times each IDCOUNT
occurs:
proc sql;
create table counts as
select id, count(*) as IDCount
from test
group by id;
select IDCount, count(*) as N
from counts
group by IDCount;
quit;
CREATE TABLE outputs the results of the first query to SAS data set COUNTS. The second query treats observations with the same value of IDCOUNT as a group and counts the number of times each occurs. PROC SQL
prints the following output:
IDCount
N
©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤©¤
1
3
2
2
3
1
IDCOUNT 1 has an N of 3, meaning there are three single values. IDCOUNT 2 has an N of 2, meaning there are
two duplicates. IDCOUNT 3 has an N of 1, meaning there is one triplicate.
Note that PROC FREQ could be run on data set COUNTS to produce these results plus percentages. This
would require a second step (i.e., instead of a second query within one PROC SQL step).
STORING DUPLICATES
Perform an inner join of observations in data sets TEST and COUNTS where the ID occurs in COUNTS:
proc sql;
create table dups as
select T.*
5
................
................
In order to avoid copyright disputes, this page is only a partial summary.
To fulfill the demand for quickly locating and searching documents.
It is intelligent file search solution for home and business.
Related download
- data preparation for gephi step by step
- quantitative analysis using excel
- student unique identifier uniq id steps to assigning nde
- in this topic we will explore the contents of the item
- reshaping panel data using excel and stata
- steps to creating your intelligent mail barcode
- rre rapid results entry and policy
- company identifiers insead
- identifying duplicate values
- cspro data entry user s guide v7 5
Related searches
- identifying personal values worksheet
- request duplicate nursing license pa
- dmv duplicate title
- how to find duplicate value in excel
- find duplicate rows in excel
- microsoft excel duplicate formula
- remove duplicate from arraylist
- replicate vs duplicate environmental samples
- duplicate vs replicate
- duplicate vs replicate sampling
- field duplicate epa
- sample duplicate rpd calculation