并行排序很难. Dask.dataframe中有两个选项
set_index
与现在一样,您可以使用单个列索引调用set_index:
In [1]: import pandas as pd
In [2]: import dask.dataframe as dd
In [3]: df = pd.DataFrame({'x': [3, 2, 1], 'y': ['a', 'b', 'c']})
In [4]: ddf = dd.from_pandas(df, npartitions=2)
In [5]: ddf.set_index('x').compute()
Out[5]:
y
x
1 c
2 b
3 a
Unfortunately dask.dataframe does not (as of November 2016) support multi-column indexes
In [6]: ddf.set_index(['x', 'y']).compute()
NotImplementedError: Dask dataframe does not yet support multi-indexes.
You tried to index with this index: ['x', 'y']
Indexes must be single columns only.
nlargest
鉴于你如何措辞你的问题我怀疑这不适用于你,但通常使用排序的情况可以通过更便宜的解决方案nlargest来实现.
In [7]: ddf.x.nlargest(2).compute()
Out[7]:
0 3
1 2
Name: x, dtype: int64
In [8]: ddf.nlargest(2, 'x').compute()
Out[8]:
x y
0 3 a
1 2 b