SAS 中的Full Outer Join:Merge 语句

data dstarget;
 merge ds1 (in=a)
  ds2 (in=b);
 by USER_ID;
 if a or b
    %do_something;
run;

====================================================

What is a match-merge?

A match-merge

  • combines observations from two or more SAS data sets based on the values of specified common variables (one or more)
  • creates a new data set (the merged data set)
  • is done in a data step with the statements
    • MERGE to name the input data sets
    • BY to name the common variable(s) to be used for matching

Prerequisites for a match-merge

  • input data sets must have a common variable
  • input data sets must be sorted by the common variable(s)

From now on when we use the term "merge" it will mean "match-merge."

 

Examples of merging

When we want to combine two data sets by merging and we know beforehand that each observation in each data set has a match in the other data set we can do a very straight-forward merge. Here are a couple of examples.

Example 1.

Often different data on the same cases are stored in two or more different data sets. For example, you may have two person level data sets on exactly the same individuals but containing different information on those individuals. To combine the data on those individuals into one data set requires a merge. Here is the way the merge would work using the simplest example:

 

        Data set one           Data set two

ID A B ID C

10 1 2 10 0
20 3 4 20 5
30 5 6 30 7


This data step does a merge of data set one and two by ID:

        data three;
merge one two;
by id;


The output data set three looks like this:

            ID    A   B   C

10 1 2 0
20 3 4 5
30 5 6 7

The data sets being merged in the example above contain different data on the same cases and the variable ID is a unique identifier (no duplicates).


Example 2.

Often we want to combine data from two data sets where each observation has a match on the BY-variable(s) but there are duplicate values of the BY-variable(s) in one of the data sets. For example, we may want to add household level variables (in one data set) to persons from those same households (in a second data set). Or, we may want to add community level variables (in one data set) to the households of those same communities (in a second data set).

As an example we will merge two data sets used in examples in the section "Working with grouped observations"--a person level data set named percps99 and a household level data set named hhcps99. We know that the households represented are the same in both data sets and the household identifier is the variable H_SEQ. In the person level data set the variable H_SEQ has duplicate values (when the household has more than one member). In the household level data set, however, H_SEQ is unique.

We want to add the household level variables (in hhcps99) to the persons (in percps99). The two input data sets to the merge are permanent SAS data sets in the same directory. The merged data set will be a temporary SAS data set named indiv.

	libname in '/afs/isis/depts/cpc/computer/stone/data/class01/';

data indiv;
merge in.percps99
in.hhcps99;
by h_seq;

run;

 

More on MERGE: when matching is not perfect

What happens when there is no match on the BY variable(s) in a match-merge?

Here is a simple example:

        Data set one         Data set two

ID A B ID C

10 1 2 10 0
20 3 4 30 1
30 5 6

The second observation (ID=20) in data set one does not have a match on ID in data set two. A match-merge of one and two by ID like this,

         data three;
merge one two;
by id;

results in an output data set three which looks like this:

         ID   A   B     C

10 1 2 0 both data sets contributed to this observation
20 3 4 . only data set one (left-hand data set) contributed to this observation
30 5 6 1 both data sets contributed to this observation

 

The IN=variables

What if you want to keep in the output data set of a merge only the matches (only those observations to which both input data sets contribute)? SAS will set up for you special temporary variables, called the "IN=" variables, so that you can do this and more.

Here's what you have to do:

  • signal to SAS on the MERGE statement that you need the IN= variables for the input data set(s)
  • use the IN= variables in the data step appropriately

So to keep only the matches in the match-merge above, ask for the IN= variables and use them:

         data  three;
merge one(in=x) two(in=y); /* x & y are your choices of names */
by id; /* for the IN= variables for data */
if x=1 and y=1; /* sets one and two respectively */
run;

Data set three will now consist of only the matches on ID:

         ID   A   B     C

10 1 2 0
30 5 6 1

Only the matches are kept in the output data set above because of the way the IN= variables X and Y take on values in the PDV:

  • 1 if the data set contributes to the observation
  • 0 if the data set does not contribute to the observation

For the above example, you can picture the IN= variables X and Y taking on values like this:

      
ID A B C X Y

10 1 2 0 1 1
20 3 4 . 1 0
30 5 6 1 1 1


If you want to keep not only the matches, but also to keep track in separate data sets of the non-matches, you can let the data step create three data sets like this:

     data x1y1          /* x1y1, x1y0, x0y1 are your choices of data set names */
x1y0
x0y1;

merge one(in=x) two(in=y);
by id;

if x=1 and y=1 then output x1y1; /* write all matches to x1y1 */
if x=1 and y=0 then output x1y0;
if x=0 and y=1 then output x0y1;
run;

 

How do you know when you merged correctly?

Before doing a merge, ask yourself these questions:

  • Do you know your input data sets to MERGE?
    • Is each input data set sorted by the BY-variables?
    • Are the BY-variables named the same in each input data set? If not, you will need to do some renaming.
    • Are there other common variables? If yes, watch out--results may surprise you.
    • How many observations and how many variables are in the input data sets?
  • Do you need to keep all variables in the input data sets? If not, you may want to use the DROP= or KEEP= data set options.
  • Can you predict the number of variables in the output (merged) data set? (always possible)
  • Can you predict the N of the output data set? (not always possible, but it's a good idea to ask yourself this question anyhow)

Test your program: print and examine observations BEFORE and AFTER the merge.

Finally, take time to examine the log carefully.

 

Merge a single observation data set to every observation of a multi-observation data set.


proc summary nway data=income;
  var yr_income;
  output out=avg_income (drop= _type_ _freq_)
          mean=avg_yr_income;
run;


data income;
  if _N_ = 1 then set avg_income(keep = avg_yr_income);
  set income;
run;

** NOTE:  This method re-orders the variables in your data set such that avg_yr_income will be the first variable. **;

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值