Transposing Data Using PROC SUMMARY'S IDGROUP Option

ABSTRACT

When clinical data are stored with multiple observations per subject, a common task is to rearrange the data so there is only one observation per subject. That single observation contains all or part of the information previously spread over multiple observations. Such rearranging of data is commonly done with either PROC TRANSPOSE or with a data step that often contains one or more arrays. There is a little used alternative to these methods, the IDGROUP option in PROC SUMMARY. The method is little used since the task is not made explicit in the documentation for the procedure and it has not been described in any papers that the authors could find at SUGI, SAS Global Forum, or various regional user group meetings. This paper describes several situations and shows the SAS® code needed for using the IDGROUP option in PROC SUMMARY as an alternative to the more common methods for rearranging data.


INTRODUCTION
The “id-group-specification” was added to PROC SUMMARY in version 8; this excerpt from the online users guide describes it as follows …

“id-group-specification combines the features and extends the ID statement, the IDMIN option in the PROC statement, and the MAXID and MINID options in the OUTPUT statement to create an OUT= data set that identifies multiple extreme values. The form of the id-group-specification is 

IDGROUP (<MIN|MAX (variable-list-1) <...MIN|MAX (variable-list-n)>> <<MISSING> <OBS>
<LAST>> OUT <[n]> (id-variable-list)=<name(s)>)”

 

This statement implies that IDGROUP supersedes all the other ID forms and that we should consider those forms depreciated. The IDGROUP keyword can be used multiple times in an OUTPUT statement, and the OUTPUT statement can be use multiple times in a single call to PROC SUMMARY.
The MIN|MAX keywords can appear zero or more times within an IDGROUP specification and the effect is somewhat analogous to a BY statement where MAX is similar to DESCENDING and MIN is similar to ASCENDING. For example, a BY statement in PROC SORT “BY descending A B” would order observations first by A, largest to smallest, and within each level of A, B would be ordered from smallest to largest. This would be written in IDGROUP as MAX (A) MIN (B). Also
note that while the options MIN and MAX when used in the context of a summary procedure like PROC SUMMARY may imply that the target variables be numeric, that is not the case for IDGROUP variables. They may be either character or numeric. This makes it possible to find the MAX and MIN of character variables. In summary, the MIN|MAX keywords and the associated variables function similar to the BY statement to order the observations within each level of the
class variables.

The MISSING keyword option directs PROC SUMMARY to treat missing values as valid levels for the ID variables. The default as with most procedures is to ignore observations with missing values. The OBS keyword option directs PROC SUMMARY to include variable(s) in the output data set to identify the observation number of the extreme value(s). The keyword option LAST directs PROC SUMMARY to change the criterion for selecting observations when there are
multiple observations that fall into an ID group.

FINDING EXTREME VALUES AND TRANSPOSING DATA USING IDGROUP

One common example of finding extreme values is finding CMAX and TMAX from concentration data. Using the following small sample data, the example computes the extreme values CMAX, TMAX, CLAST. TMAX and CLAST could be obtained sorting by and merging but using PROC SUMMARY the values are obtained in one step without sorting. While finding interesting extremes we can also compute common summary statistics for concentration.

 

data conc;
input subjid:$3. time:time5. conc @@;
format time time5.;
cards;
001 08:00 0 001 08:30 .1 001 09:00 .25
001 09:35 .2 001 10:05 .12 001 10:30 .05
002 08:01 0 002 08:31 .2 002 09:01 .2
002 09:36 .33 002 10:06 .22 002 10:31 .10
;;;;
run;
proc summary data=conc nway; 1
class subjid; 2
var conc; 3
output out=concStats(drop=_:) 4
idgroup(max(conc) out(conc time)=cmax tmax) 5
idgroup(max(time) out(conc)=clast) 6
n=n mean=mean std=std median=median 7
;
run;


Code details:
1. Call PROC SUMMARY using NWAY option to get class groups by SUBJID only.
2. Specify the CLASS or grouping variable SUBJID;
3. Tell PROC SUMMARY to use CONC as an analysis variable.
4. Begin the OUTPUT statement. Name the output data set and drop variables beginning with the underscore character; we don’t need them for this application.  No output is produce without at least one OUTPUT statement.
5. This IDGROUP option finds maximum concentration for each subject and outputs
    OUT (CONC TIME)= the maximum concentration and the TIME associated with the
    concentration naming them CMAX and TMAX.
6. This second IDGROUP options finds the maximum time for each subject and outputs
    OUT (CONC) the concentration associated with that time naming it CLAST.
7. These statistic options request summary statistics for the variable named in the VAR
    statement. This syntax is acceptable when there is one variable in the VAR list.

PROC SUMMARY produces the following data set, the maximum concentration (CMAX), the time (TMAX) associated with that extreme value and the concentration from the last observed time (CLAST).

Another example of finding multiple extreme values where all values from all observations are returned is very similar to a transpose. Transposing multiple variables to wide format usually requires at least two calls to PROC TRANSPOSE or a somewhat complicated data step using multiple arrays. Consider the following data and the PROC TRANSPOSE steps required to convert it to wide format. The data have FORMAT/INFORMAT and LABEL attributes associated
with each variable. This will help show how the attributes are inherited by the different transpose methods demonstrated.

data have;
attrib memberID length=$1 informat=$1. format=$1. label='Member ID';
attrib IC length=$3 informat=$3. format=$upcase3. label='IC';
attrib Charge length=8 informat=F8. format=dollar10.
label='Charge';
input (_all_)(:) @@;
cards;
1 301 5207 1 301 . 1 . 6082
2 473 5207 2 473 5207 2 301 6082
3 325 6082 3 473 5207
4 473 94 4 352 94
;;;;
run;

PROC CONTENTS and PROC PRINT of data set HAVE …

proc transpose data=have out=IC(drop=_:) prefix=IC_; 1
by memberID; 2
var IC; 3
run;
proc transpose data=have out=charge(drop=_:) prefix=Charge_; 1
by memberID; 2
var charge; 3
run;
data wideFromTranspose; 4
merge IC charge; 5
by memberID; 6
run;


Code details:
1. Call PROC TRANSPOSE
a. Specify input and output data sets names.
b. Output name is transpose variable name
c. PREFIX= specifies the root name of the SAS enumerated variable list created by PROC TRANSPOSE
2. BY statement to group the observations for each MEMBERID, data must be sorted.
3. Specify the name of the variable to transpose.
4. Create a new wide data set.
5. Merge statement lists the names of the output data sets from the PROC TRANSPOSE steps.
6. Match the observations in the merge BY the variable(s) listed.

PROC CONTENTS and PROC PRINT of data set WIDEFROMTRANSPOS …

Advantages to this technique are inheritance of the FORMAT attribute and you do not need to know the dimension of the SAS enumerated variable lists that are created. Disadvantages include the amount of code needed; you need one call to proc transpose for each variable transposed plus the data step to combine the data.

We could use a data set transpose but we consider that method the least desirable option due to complexity and increased possibility for error.

data wideFromDataStep; 1
array IC_[3] $3; 2
array Charge_[3]; 2
do _n_ = 1 by 1 until(last.memberID); 3
set have; 4
by memberID; 4
IC_[_n_] = IC; 5
Charge_[_n_] = Charge; 5
end; 3
drop IC Charge; 6
run; 7


Code details:
1. Start a data step and name the output.
2. Define ARRAY of variables to receive the values of the transposed variables.
3. Use a DO loop UNTIL (LAST. to read each BY group of the data. This avoids the need to RETAIN the array elements and takes advantage of the default initialize to  missing feature of the data step.
4. SET statement names input data set and BY statement sets up FIRST and LAST variables.
5. Using index variable _N_ assign the transpose variables to the associated ARRAYS.
6. Drop the transposed variables.
7. End the data step.

This code produces exactly the same data but not the same data set as above because none of the attributes for the transposed variables are inherited. Each transposed variable must be specified in an array, of the proper type and length, contributing to the possibility of error.

PROC CONTENTS output for data set WIDEFROMDATASTEP.

Contrast the above with a similar transpose produced using PROC SUMMARY, with a little help from PROC SQL.

proc sql noprint; 1
select max(obs) into :obs 2
from 3
( 3
select count(*) as obs 3
from have 3
group by memberid 3
) 3
;
quit;
run;
proc summary nway data=have missing; 4
class memberID; 5
output 6
out = work.wide(drop=_type_ _freq_) 7
idgroup(out[&obs](ic charge)=) 8
;
run;


Code details:
1. Call PROC SQL with NOPRINT option we are creating a macro variable no printed output is required.
2. Create a macro variable OBS as MAX (OBS), we need a macro variable because the value needs to become part of the code for PROC SUMMARY.
3. Sub query is count of records for each MEMBERID.
4. Call PROC SUMMARY
a. NWAY causes CLASS to output the highest WAY of the class levels. Similar to using BY MEMBERID but no sorting needed.
b. MISSING specifies that MISSING values of the CLASS variable be treated as a valid class level.
5. Specify MEMBERID as the CLASS variable to group the observations.
6. OUTPUT statement needed to get PROC SUMMARY to create output data set.
7. OUT= specifies the name of the output data set. We don’t need the two variables (_TYPE_ and _FREQ_) created by PROC SUMMARY so drop them.

8. This IDGROUP specification with no explicit ID variable(s) outputs each variable named in the OUT parameter in the order the values are encountered in the data set.
a. The variables names mentioned in the parenthesized list following OUT[] are the variables for whom we want extreme values.
b. The = following tells PROC SUMMARY to use the same names as the names in the parenthesized list.
c. The number contained in the macro variable &OBS is the number of extreme observations.

This code produced exactly the same output as with a PROC TRANSPOSE for each variable and a MERGE step to combine the wide data sets, or the one step data step transpose. The advantage being that any number of character or numeric variables can be transposed in one step, and all variable attributes are inherited. Since we did not use the MIN or MAX options the extreme (transposed) values are returned in observation order. The disadvantage being the dimension of the SAS enumerated variable lists created by PROC SUMMARY needs to be determined before calling PROC SUMMARY. Here, done using PROC SQL, and the maximum number of observations for any CLASS group cannot exceed 100.

CONCLUSION
The IDGROUP option in PROC SUMMARY can be used to rearrange a data set from narrow (multiple observations per subject) to wide (single observation per subject with multiple variables). There are some limitations, one being the number of observations within CLASS groups.
However, the reduced amount of SAS code required makes the IDGROUP a useful alternative to other conventional methods for transposing many variables in a data set to wide format 


CONTACT INFORMATION
Comments and suggestions are welcome and can be sent to the authors via e-mail.
John King ouachitaclinicaldataservices@gmail.com
Mike Zdeb msz03@albany.edu

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值