等距采样

I ran into a curious problem this week. Let's say you have unevenly-spaced data, and you want to pick the subset that is most nearly equally spaced. How do you do it?

I found a stackoverflow post that gave a really cool answer, but since the post didn't come with a lot of explanation, I thought I'd talk about it a bit.


I've been running some statistical analyses on evolutionary models of stars, and one thing that I noticed was that each track has a different number of points in it. This can happen for a lot of reasons, the simplest of which being that some stars live longer than others. But there are other reasons too: for example, some stars evolve more quickly than others, and they therefore need smaller time steps in order to capture the rapid changes that are occurring.

It dawned on me that this will bias the statistics towards the types of stars that have more data points. To correct for this bias, I wanted to grab a subset of models from every evolutionary track so that they're all the same size. Since the biggest thing that changes about a star throughout its main-sequence lifetime is the amount of hydrogen it has in its core, I decided that I would take the subset of models that are most evenly-spaced in core-hydrogen abundance.

But here's the problem: the points aren't distributed evenly throughout the evolution of the star, since some time steps are bigger and some are smaller than others. So how do I pick?

For the sake of visualization, let's say we have N=50 points, and we want to pick from that list the M=10 points that are the most spread out. Here's a picture of the situation: 


The obvious solution is to just go through the ideal spacing and grab the data point that is closest to each one. This works very well, but it is not actually the best solution.

Why? In some cases, it will result in the same point being picked twice. No problem, you say, just pick the next one that hasn't been picked already. But what if that point was actually the right pick, and the previous one should have been the one to choose another point instead? And what if that previous one was also not able to pick its closest point either? You can see how this gets hairy quickly.

So how do we find the best solution?


Integer programming

It turns out that we can solve this problem using integer programming. We know how far each point is from the ideal spacing; this becomes our cost matrix. We simply need to enforce that we choose the exact number of points that we want such that this cost matrix is minimized, and furthermore, that no point is used twice.

This situation ends up belonging to the class known as transportation problems. These problems are pervasive, and as you might imagine from the name, often arise when having to figure out how to ship things when you have a finite number of vehicles and a cost associated with each route.

The way we solve it is by first saying that what the best result possible is, let's call that B (you know, for best):

     B⃗ [min(H),,(Mi)min(H)+max(H)M1,,max(H)].

This was visualized above with the evenly-spaced dots.

Next, we need to calculate the cost  C  from every point to the ideal spacings:

     C|B1H1||B2H1||BMH1||B1H2||B2H2||BMH2||B1HN||B2HN||BMHN|.

Recall that  N  is the size of our original data, and  M  is the number of points we want to pick.

Now that we have all of the ingredients, all we need to say is that we want to choose exactly  M  of the  N  points such that the cost of that choice is minimized, and furthermore that none of the points are duplicated. We'll call the optimal solution  S^ . It will have the same shape as our cost matrix, and each column of  S^  that contains a 1 indicates a member we should take from  H⃗  . We can calculate it in the following optimization statement:

     S^=argminSsubject to and ijSijCijjSij=1 for all i=1NiSij1 for all j=1M.

We can plug this into our favorite linear transport problem solver, like R's lpSolve, and out comes our final solution: 


and we are done. Happy coding!

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值