No. 05 - The Least k Numbers

No. 05 - The Least k Numbers


Question: Please find out the least  k numbers out of  n numbers. For example, if given the 8 numbers 4, 5, 1, 6, 2, 7, 3 and 8, please return the least 4 numbers 1, 2, 3 and 4.

Analysis: The naïve solution is sort the  n input numbers increasingly, and the least  k numbers should be the first  k numbers. Since it needs to sort, its time complexity is  O(nlog n). Interviewers will ask us explore more efficient solutions.

Solution 1: O(nlogk) time efficiency, be suitable for data with huge size

A data container with capacity  k is firstly created to store the least  k numbers, and then a number is read out of the  n input numbers at each time.   If the container has less than  k numbers, the number read at current round (denoted as  num) is inserted into container directly. If it contains  knumbers already,  num cannot be inserted directly any more. However, it may replace an existing number in the container.  We get the maximum number of the  k numbers in the container, and compare it with  num. If  num is less than the maximum number, we replace the maximum number with  num. Otherwise we discard  num, since we already have  k numbers in the container which are all less than  num and it cannot be one of the least  k numbers.

Three steps may be required when a number is read and the container is full: The first step is to find the maximum number, secondly we may delete the maximum number, and at last we may insert a new number. The second and third steps are optional, which depend on whether the number read at current round is greater than the maximum number in container or not. If we implement the data container as a binary tree, it costs O(log k)  time for these three steps. Therefore, the overall time complexity is O( nlog k)  for n input numbers.

We have different choices for the data container. Since we need to get the maximum number out of k numbers, it intuitively might a maximum heap. In a maximum heap, its root is always greater than its children, so it costs O(1) time to get the maximum number. However, it takes O(log k)  time to insert and delete a number.

We have to write a lot of code for a maximum heap, and it is too difficult in the dozens-of-minute interview. We can also implement it as a red-black tree. A red-black tree classifies its nodes into red and black categories, and assure that it is somewhat balanced based on a set of rules. Therefore, it costs O(log k) time to find, insert and delete a number. The classes set and multiset in STL are all based on red-black trees. We may use data containers in STL directly if our interviewers are not against it. The following sample code is based on the multiset in STL:

typedef multiset< int, greater< int> >            intSet;
typedef multiset< int, greater< int> >::iterator  setIterator;

void GetLeastNumbers( const vector< int>& data, intSet& leastNumbers,  int k)
{
    leastNumbers.clear();

     if(k < 1 || data.size() < k)
         return;

    vector< int>::const_iterator iter = data.begin();
     for(; iter != data.end(); ++ iter)
    {
         if((leastNumbers.size()) < k)
            leastNumbers.insert(*iter);

         else
        {
            setIterator iterGreatest = leastNumbers.begin();

             if(*iter < *(leastNumbers.begin()))
            {
                leastNumbers.erase(iterGreatest);
                leastNumbers.insert(*iter);
            }
        }
    }
}

Solution 2: O(n) time efficiency, be suitable only when we can reorder the input

We can also utilize the function Partition in quick sort to solve this problem with a hypothesis. It assumes that  n input numbers are contained in an array. If it takes the  k-th number as a pilot to partition the input array, all of numbers less than the  k-th number should be at the left side and other greater ones should be at the right side. The  k numbers at the left side are the least  knumbers after the partition. We can develop the following code according to this solution:

void GetLeastNumbers( int* input,  int n,  int* output,  int k)
{
     if(input == NULL || output == NULL || k > n || n <= 0 || k <= 0)
         return;

     int start = 0;
     int end = n - 1;
     int index = Partition(input, n, start, end);
     while(index != k - 1)
    {
         if(index > k - 1)
        {
            end = index - 1;
            index = Partition(input, n, start, end);
        }
         else
        {
            start = index + 1;
            index = Partition(input, n, start, end);
        }
    }

     for( int i = 0; i < k; ++i)
        output[i] = input[i];
}

Comparison between two solutions

The second solution based on the function Partition costs only O( n) time, so it is more efficient than the first one. However, it has two obvious limitations: One limitation is that it needs to load all input numbers into an array, and the other is that we have to reorder the input numbers.
Even though the first takes more time, the second solution does have the two limitations as the first one. It is not required to reorder the input numbers (data in the code above). We read a number from data at each round, and all write operations are taken in the containerleastNumbers. It does not require loading all input number into memory at one time, so it is suitable for huge-size data. Supposing our interview asks us get the least  k numbers from a huge-size input. Obviously we cannot load all data with huge size into limited memory at one time. We can read a number from auxiliary space (such as disk) at each round with the first solution, and determine whether we need to insert it into the container leastNumbers. It works once memory can accommodate leastNumbers, so it is especially works when  n is huge and  k is small.

The characters of these two solutions can be summarized in Table 1:


First Solution
Second Solution
Time complexity
O(n*logk)
O(n)
Reorder input numbers?
No
Yes
Suitable for huge-size data?
Yes
No
Table 1: Pros and cons of two solutions

Since each solution has its own pros and cons, candidates had better to ask interviews for more requirements and details to choose the most suitable solution, including the input data size and whether it is allowed to reorder the input numbers.

The author Harry He owns all the rights of this post. If you are going to use part of or the whole of this ariticle in your blog or webpages,  please add a reference to  http://codercareer.blogspot.com/. If you are going to use it in your books, please contact me (zhedahht@gmail.com) . Thanks.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
这是一个OLS回归结果汇总表,其中包含了回归模型的基本信息、拟合度、系数、显著性检验等多个指标。具体解读如下: - Dep. Variable:因变量为y。 - R-squared:拟合优度为0.049,即模型能够解释因变量变异的4.9%。 - Adj. R-squared:根据自变量数量进行调整后的拟合优度为0.036,说明模型的解释能力有限。 - F-statistic:F值为3.581,对应的P值为0.0305,说明模型具有一定的显著性。 - Prob (F-statistic):F检验的P值为0.0305,小于通常的显著性水平0.05,说明回归模型在总体上是显著的。 - const、x1、x2:分别是截距项和自变量的回归系数。 - t:t值表示系数是否显著,P>|t|表示对应的双侧检验的P值。 - [0.025 0.975]:系数的置信区间,如果包含了0则说明系数不显著。 - Omnibus:Omnibus值为86.169,对应的P值为0.000,说明残差不服从正态分布。 - Durbin-Watson:Durbin-Watson值为2.062,说明残差之间存在自相关性。 - Prob(Omnibus):Omnibus检验的P值为0.000,小于0.05,说明残差不服从正态分布。 - Jarque-Bera (JB):JB值为394.216,对应的P值为2.50e-86,说明残差不服从正态分布。 - Skew:Skew值为2.304,说明残差呈现严重的正偏态分布。 - Kurtosis:Kurtosis值为9.772,说明残差呈现严重的峰态分布。 - Cond. No.:条件数为2.31e+03,说明模型存在较强的多重共线性或者其他数值问题。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值