4. Median of Two Sorted Arrays

4. Median of Two Sorted Arrays

There are two sorted arrays  nums1 and  nums2 of size m and n respectively.

Find the median of the two sorted arrays. The overall run time complexity should be O(log (m+n)).

Example 1:

nums1 = [1, 3]
nums2 = [2]

The median is 2.0

Example 2:

nums1 = [1, 2]
nums2 = [3, 4]

The median is (2 + 3)/2 = 2.5

Subscribe to see which companies asked this question

这是我做的第二个leetcode题目,一开始以为和第一个一样很简单,但是做的过程中才发现这个题目非常难,给人一种“刚上战场就踩上地雷挂掉了”的感觉。后来搜了一下leetcode的难度分布表(leetcode难度及面试频率)才发现,该问题是难度为5的问题,真是小看了它!网上搜了很多答案,但是鲜见简明正确的解答,唯有一种寻找第k小值的方法非常好,在此整理一下。

       首先对leetcode的编译运行吐槽一下:貌似没有超时判断,而且small和large的数据集相差很小。此题一开始我采用最笨的方法去实现,利用排序将两个数组合并成一个数组,然后返回中位数:

[cpp]  view plain  copy
 print ? 在CODE上查看代码片 派生到我的代码片
  1. class Solution {  
  2. public:  
  3.     double findMedianSortedArrays(int A[], int m, int B[], int n) {  
  4.         // Start typing your C/C++ solution below  
  5.         // DO NOT write int main() function  
  6.         int *a=new int[m+n];  
  7.           
  8.         memcpy(a,A,sizeof(int)*m);  
  9.         memcpy(a+m,B,sizeof(int)*n);  
  10.           
  11.         sort(a,a+n+m);  
  12.           
  13.         double median=(double) ((n+m)%2? a[(n+m)>>1]:(a[(n+m-1)>>1]+a[(n+m)>>1])/2.0);  
  14.           
  15.         delete a;  
  16.           
  17.         return median;  
  18.     }  
  19. };  

该方法居然也通过测试,但是其复杂度最坏情况为O(nlogn),这说明leetcode只对算法的正确性有要求,时间要求其实不严格。

另一种方法即是利用类似merge的操作找到中位数,利用两个分别指向A和B数组头的指针去遍历数组,然后统计元素个数,直到找到中位数,此时算法复杂度为O(n)。之后还尝试了根据算法导论中的习题(9.3-8)扩展的方法,但是该方法会存在无穷多的边界细节问题,而且扩展也不见得正确,这个可从各网页的评论看出,非常不建议大家走这条路。

最后从medianof two sorted arrays中看到了一种非常好的方法。原文用英文进行解释,在此我们将其翻译成汉语。该方法的核心是将原问题转变成一个寻找第k小数的问题(假设两个原序列升序排列),这样中位数实际上是第(m+n)/2小的数。所以只要解决了第k小数的问题,原问题也得以解决。

首先假设数组A和B的元素个数都大于k/2,我们比较A[k/2-1]和B[k/2-1]两个元素,这两个元素分别表示A的第k/2小的元素和B的第k/2小的元素。这两个元素比较共有三种情况:>、<和=。如果A[k/2-1]<B[k/2-1],这表示A[0]到A[k/2-1]的元素都在A和B合并之后的前k小的元素中。换句话说,A[k/2-1]不可能大于两数组合并之后的第k小值,所以我们可以将其抛弃。

证明也很简单,可以采用反证法。假设A[k/2-1]大于合并之后的第k小值,我们不妨假定其为第(k+1)小值。由于A[k/2-1]小于B[k/2-1],所以B[k/2-1]至少是第(k+2)小值。但实际上,在A中至多存在k/2-1个元素小于A[k/2-1],B中也至多存在k/2-1个元素小于A[k/2-1],所以小于A[k/2-1]的元素个数至多有k/2+ k/2-2,小于k,这与A[k/2-1]是第(k+1)的数矛盾。

当A[k/2-1]>B[k/2-1]时存在类似的结论。

当A[k/2-1]=B[k/2-1]时,我们已经找到了第k小的数,也即这个相等的元素,我们将其记为m。由于在A和B中分别有k/2-1个元素小于m,所以m即是第k小的数。(这里可能有人会有疑问,如果k为奇数,则m不是中位数。这里是进行了理想化考虑,在实际代码中略有不同,是先求k/2,然后利用k-k/2获得另一个数。)

通过上面的分析,我们即可以采用递归的方式实现寻找第k小的数。此外我们还需要考虑几个边界条件:

  • 如果A或者B为空,则直接返回B[k-1]或者A[k-1];
  • 如果k为1,我们只需要返回A[0]和B[0]中的较小值;
  • 如果A[k/2-1]=B[k/2-1],返回其中一个;

最终实现的代码为:

[cpp]  view plain  copy
 print ? 在CODE上查看代码片 派生到我的代码片
  1. double findKth(int a[], int m, int b[], int n, int k)  
  2. {  
  3.     //always assume that m is equal or smaller than n  
  4.     if (m > n)  
  5.         return findKth(b, n, a, m, k);  
  6.     if (m == 0)  
  7.         return b[k - 1];  
  8.     if (k == 1)  
  9.         return min(a[0], b[0]);  
  10.     //divide k into two parts  
  11.     int pa = min(k / 2, m), pb = k - pa;  
  12.     if (a[pa - 1] < b[pb - 1])  
  13.         return findKth(a + pa, m - pa, b, n, k - pa);  
  14.     else if (a[pa - 1] > b[pb - 1])  
  15.         return findKth(a, m, b + pb, n - pb, k - pb);  
  16.     else  
  17.         return a[pa - 1];  
  18. }  
  19.   
  20. class Solution  
  21. {  
  22. public:  
  23.     double findMedianSortedArrays(int A[], int m, int B[], int n)  
  24.     {  
  25.         int total = m + n;  
  26.         if (total & 0x1)  
  27.             return findKth(A, m, B, n, total / 2 + 1);  
  28.         else  
  29.             return (findKth(A, m, B, n, total / 2)  
  30.                     + findKth(A, m, B, n, total / 2 + 1)) / 2;  
  31.     }  
  32. };  

我们可以看出,代码非常简洁,而且效率也很高。在最好情况下,每次都有k一半的元素被删除,所以算法复杂度为logk,由于求中位数时k为(m+n)/2,所以算法复杂度为log(m+n)。



This problem is notoriously hard to implement due to all the corner cases. Most implementations consider odd-lengthed and even-lengthed arrays as two different cases and treat them separately. As a matter of fact, with a little mind twist. These two cases can be combined as one, leading to a very simple solution where (almost) no special treatment is needed.

First, let's see the concept of 'MEDIAN' in a slightly unconventional way. That is:

"if we cut the sorted array to two halves of EQUAL LENGTHS, then
median is the AVERAGE OF Max(lower_half) and Min(upper_half), i.e. the
two numbers immediately next to the cut
".

For example, for [2 3 5 7], we make the cut between 3 and 5:

[2 3 / 5 7]

then the median = (3+5)/2. Note that I'll use '/' to represent a cut, and (number / number) to represent a cut made through a number in this article.

for [2 3 4 5 6], we make the cut right through 4 like this:

[2 3 (4/4) 5 7]

Since we split 4 into two halves, we say now both the lower and upper subarray contain 4. This notion also leads to the correct answer: (4 + 4) / 2 = 4;

For convenience, let's use L to represent the number immediately left to the cut, and R the right counterpart. In [2 3 5 7], for instance, we have L = 3 and R = 5, respectively.

We observe the index of L and R have the following relationship with the length of the array N:

N        Index of L / R
1               0 / 0
2               0 / 1
3               1 / 1  
4               1 / 2      
5               2 / 2
6               2 / 3
7               3 / 3
8               3 / 4

It is not hard to conclude that index of L = (N-1)/2, and R is at N/2. Thus, the median can be represented as

(L + R)/2 = (A[(N-1)/2] + A[N/2])/2

To get ready for the two array situation, let's add a few imaginary 'positions' (represented as #'s) in between numbers, and treat numbers as 'positions' as well.

[6 9 13 18]  ->   [# 6 # 9 # 13 # 18 #]    (N = 4)
position index     0 1 2 3 4 5  6 7  8     (N_Position = 9)
		  
[6 9 11 13 18]->   [# 6 # 9 # 11 # 13 # 18 #]   (N = 5)
position index      0 1 2 3 4 5  6 7  8 9 10    (N_Position = 11)

As you can see, there are always exactly 2*N+1 'positions' regardless of length N. Therefore, the middle cut should always be made on the Nth position (0-based). Since index(L) = (N-1)/2 and index(R) = N/2 in this situation, we can infer that index(L) = (CutPosition-1)/2, index(R) = (CutPosition)/2.


Now for the two-array case:

A1: [# 1 # 2 # 3 # 4 # 5 #]    (N1 = 5, N1_positions = 11)

A2: [# 1 # 1 # 1 # 1 #]     (N2 = 4, N2_positions = 9)

Similar to the one-array problem, we need to find a cut that divides the two arrays each into two halves such that

"any number in the two left halves" <= "any number in the two right
halves".

We can also make the following observations:

  1. There are 2N1 + 2N2 + 2 position altogether. Therefore, there must be exactly N1 + N2 positions on each side of the cut, and 2 positions directly on the cut.

  2. Therefore, when we cut at position C2 = K in A2, then the cut position in A1 must be C1 = N1 + N2 - k. For instance, if C2 = 2, then we must have C1 = 4 + 5 - C2 = 7.

     [# 1 # 2 # 3 # (4/4) # 5 #]    
    
     [# 1 / 1 # 1 # 1 #]   
    
  3. When the cuts are made, we'd have two L's and two R's. They are

     L1 = A1[(C1-1)/2]; R1 = A1[C1/2];
     L2 = A2[(C2-1)/2]; R2 = A2[C2/2];
    

In the above example,

    L1 = A1[(7-1)/2] = A1[3] = 4; R1 = A1[7/2] = A1[3] = 4;
    L2 = A2[(2-1)/2] = A2[0] = 1; R2 = A1[2/2] = A1[1] = 1;

Now how do we decide if this cut is the cut we want? Because L1, L2 are the greatest numbers on the left halves and R1, R2 are the smallest numbers on the right, we only need

L1 <= R1 && L1 <= R2 && L2 <= R1 && L2 <= R2

to make sure that any number in lower halves <= any number in upper halves. As a matter of fact, since
L1 <= R1 and L2 <= R2 are naturally guaranteed because A1 and A2 are sorted, we only need to make sure:

L1 <= R2 and L2 <= R1.

Now we can use simple binary search to find out the result.

If we have L1 > R1, it means there are too many large numbers on the left half of A1, then we must move C1 to the left (i.e. move C2 to the right); 
If L2 > R1, then there are too many large numbers on the left half of A2, and we must move C2 to the left.
Otherwise, this cut is the right one. 
After we find the cut, the medium can be computed as (max(L1, L2) + min(R1, R2)) / 2;

Two side notes:

A. since C1 and C2 can be mutually determined from each other, we might as well select the shorter array (say A2) and only move C2 around, and calculate C1 accordingly. That way we can achieve a run-time complexity of O(log(min(N1, N2)))

B. The only edge case is when a cut falls on the 0th(first) or the 2Nth(last) position. For instance, if C2 = 2N2, then R2 = A2[2*N2/2] = A2[N2], which exceeds the boundary of the array. To solve this problem, we can imagine that both A1 and A2 actually have two extra elements, INT_MAX at A[-1] and INT_MAX at A[N]. These additions don't change the result, but make the implementation easier: If any L falls out of the left boundary of the array, then L = INT_MIN, and if any R falls out of the right boundary, then R = INT_MAX.


I know that was not very easy to understand, but all the above reasoning eventually boils down to the following concise code:

 double findMedianSortedArrays(vector<int>& nums1, vector<int>& nums2) {
    int N1 = nums1.size();
    int N2 = nums2.size();
    if (N1 < N2) return findMedianSortedArrays(nums2, nums1);	// Make sure A2 is the shorter one.
    
    if (N2 == 0) return ((double)nums1[(N1-1)/2] + (double)nums1[N1/2])/2;  // If A2 is empty
    
    int lo = 0, hi = N2 * 2;
    while (lo <= hi) {
        int mid2 = (lo + hi) / 2;   // Try Cut 2 
        int mid1 = N1 + N2 - mid2;  // Calculate Cut 1 accordingly
        
        double L1 = (mid1 == 0) ? INT_MIN : nums1[(mid1-1)/2];	// Get L1, R1, L2, R2 respectively
        double L2 = (mid2 == 0) ? INT_MIN : nums2[(mid2-1)/2];
        double R1 = (mid1 == N1 * 2) ? INT_MAX : nums1[(mid1)/2];
        double R2 = (mid2 == N2 * 2) ? INT_MAX : nums2[(mid2)/2];
        
        if (L1 > R2) lo = mid2 + 1;		// A1's lower half is too big; need to move C1 left (C2 right)
        else if (L2 > R1) hi = mid2 - 1;	// A2's lower half too big; need to move C2 left.
        else return (max(L1,L2) + min(R1, R2)) / 2;	// Otherwise, that's the right cut.
    }
    return -1;
} 

If you have any suggestions to make the logic and implementation even more cleaner. Please do let me know!



  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值