SAScode: 1比N病例对照匹配（gmatch）

刘·划水摸鱼

已于 2022-04-24 17:25:31 修改

阅读量1.9k

点赞数 1

分类专栏： SAS 文章标签：大数据

于 2022-04-24 10:57:51 首次发布

本文链接：https://blog.csdn.net/liuyi1750/article/details/124377601

版权

本文介绍了SAS宏gmatch，用于使用贪婪算法进行1比N病例对照匹配。内容包括匹配原理、参数设置、匹配变量权重、距离计算方法等，并提供了一个示例数据集的匹配过程。通过对匹配变量的权重设定，gmatch能够实现不同匹配要求，如精确匹配或基于距离的匹配。

摘要由CSDN通过智能技术生成

data a ;set a;
if gdm=1 then timex=2; if gdm=0 then timex=3;run;
data b ;set a;
if _COL2="A1_1" or _COL2="A1_2" then output b;run;

%gmatch(data=b,group=gdm,id=sample,
mvars= age _COL3 batch_ad,wts=5 0 0,dmaxk=5 0 0,dmax=,transf=0,
time=timex, dist= 2,
ncontls=1,seedca=2022,seedco=2021,
out=GDMcc,outnmca=non_case,outnmco=non_control,print=Y );

/*------------------------------------------------------------------*
| The documentation and code below is supplied by HSR CodeXchange.
|
*------------------------------------------------------------------*/



/*------------------------------------------------------------------*
| MACRO NAME : gmatch
| SHORT DESC : Match 1 or more controls to cases using the
| GREEDY algorithm
*------------------------------------------------------------------*
| CREATED BY : Kosanke, Jon (04/07/2004 16:32)
| : Bergstralh, Erik
*------------------------------------------------------------------*
| PURPOSE
|
| GMATCH Macro to match 1 or more controls for each of N cases
| using the GREEDY algorithm--REPLACES GREEDY option of MATCH macro.
| Changes:
| --cases and controls in same dataset
| --not mandatory to randomly pre-ort cases and controls, but recommended
| --options to transform X's and to choose distance metric
| --input parameters consistent with %DIST macro for optimal matching
|
| *******
|
| Macro name: %gmatch
|
| Authors: Jon Kosanke and Erik Bergstralh
|
| Date: July 23, 2003
| October 31, 2003...tweaked print/means based on "time" var
|
| Macro function:
|
| Matching using the GREEDY algorithm
|
| The purpose of this macro is to match 1 or more controls(from a total
| of M) for each of N cases. The controls may be matched to the cases by
| one or more factors(X's). The control selected for a particular
| case(i) will be the control(j) closest to the case in terms of Dij.
| Dij can be defined in multiple ways. Common choices are the Euclidean
| distance and the weighted sum of the absolute differences between the
| case and control matching factors. I.e.,
|
| Dij= SQRT [SUM { W.k*(X.ik-X.jk)**2} ], or
|
| Dij= SUM { W.k*ABS(X.ik-X.jk) },
|
| where the sum is over the number
| of matching factors X(with index
| k) and W.k = the weight assigned
| to matching factor k and X.ik =
| the value of variable X(k) for
| subject i.
|
| The control(j) selected for a case(i) is the one with the smallest Dij
| (subject to constraints DMAX and DMAXK, defined below). In the case of
| ties, the first one encountered will be used. The higher the user-defined
| weight, the more likely it is that the case and control will be matched
| on the factor. Assign large weights (relative to the other weights) to
| obtain exact matches for two-level factors such as gender. An option to
| using weights might be to standarize the X's in some fashion. The macro
| has options to standardize all X's to mean 0 and variance 1 and to use
| ranks.
|
| The matching algorithm used is the GREEDY method. Using the greedy method,
| once a match is made it is never broken. This may result in inefficiencies
| if a previously matched control would be a better match for the current
| case than those controls currently available. (An alternative method is to
| do optimal matching using the VMATCH & DIST macros. This method guarantees
| the best possible matched set in terms of minimizing the total Dij.)
| The GREEDY method generally produces very good matches, especially if the
| control pool is large relative to the number of cases. When multiple
| controls/case are desired, the algorithm first matches 1 control to all
| cases and then proceeds to select second controls.
|
|
| The gmatch macro checks for missing values of matching variables and the
| time variable(if specified) and deletes those observations from the input
| dataset.
|
| Call statement:
|
|
| %gmatch(data=,group=,id=,
| mvars=,wts=,dmaxk=,dmax=,transf,
| time=, dist=,
| ncontls=,seedca=,seedco=,
| out=,outnmca=,outnmco=,print=);
|
| Parameter definitions(R=required parameter):
|
|
| R data SAS data set containing cases and potential controls. Must
| contain the ID, GROUP, and the matching variables.
|
| R group SAS variable defining cases. Group=1 if case, 0 if control.
|
| R id SAS CHARACTER ID variable for the cases and controls.
|
|
| R mvars List of numeric matching variables common to both case and
| contr