转载:pandas: Shuffle rows/elements of DataFrame/Series
还有一篇更多介绍的:How to Shuffle Pandas Dataframe Rows in Python
You can randomly shuffle rows of pandas.DataFrame
and elements of pandas.Series
with the sample()
method. There are other ways to shuffle, but using the sample()
method is convenient because it does not require importing other modules.
This article describes the following contents.
- Specify
frac=1
forsample()
to shuffle - Reset index:
ignore_index
,reset_index()
- Update original object
In the sample code, the following CSV file is used.
import pandas as pd
df = pd.read_csv('data/src/sample_pandas_normal.csv')
print(df)
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
# 4 Ellen 24 CA 88
# 5 Frank 30 NY 57
The example uses pandas.DataFrame
, but you can shuffle pandas.Series
in the same way.
Note that you can use sort_values()
and sort_index()
to sort rows according to index or column values. See the following article.
Specify frac=1
for sample()
to shuffle
See the following article for details of the sample()
method.
If the frac
parameter is set to 1
, all the rows are randomly sampled, equivalent to shuffling the entire row.
print(df.sample(frac=1))
# name age state point
# 2 Charlie 18 CA 70
# 1 Bob 42 CA 92
# 3 Dave 68 TX 70
# 0 Alice 24 NY 64
# 5 Frank 30 NY 57
# 4 Ellen 24 CA 88
You can initialize the random number generator with a fixed seed with the random_state
parameter. After initialization with the same seed, they are always shuffled in the same way.
print(df.sample(frac=1, random_state=0))
# name age state point
# 5 Frank 30 NY 57
# 2 Charlie 18 CA 70
# 1 Bob 42 CA 92
# 3 Dave 68 TX 70
# 0 Alice 24 NY 64
# 4 Ellen 24 CA 88
print(df.sample(frac=1, random_state=0))
# name age state point
# 5 Frank 30 NY 57
# 2 Charlie 18 CA 70
# 1 Bob 42 CA 92
# 3 Dave 68 TX 70
# 0 Alice 24 NY 64
# 4 Ellen 24 CA 88
Reset index: ignore_index
, reset_index()
If you want to reindex the result (0, 1, ... , n-1), set the ignore_index
parameter to True
.
print(df.sample(frac=1, ignore_index=True))
# name age state point
# 0 Ellen 24 CA 88
# 1 Frank 30 NY 57
# 2 Bob 42 CA 92
# 3 Dave 68 TX 70
# 4 Alice 24 NY 64
# 5 Charlie 18 CA 70
The ignore_index
was added in pandas 1.3.0
. For earlier versions, you can use the reset_index()
method. Set the drop
parameter to True
to delete the original index.
print(df.sample(frac=1).reset_index(drop=True))
# name age state point
# 0 Bob 42 CA 92
# 1 Dave 68 TX 70
# 2 Alice 24 NY 64
# 3 Charlie 18 CA 70
# 4 Frank 30 NY 57
# 5 Ellen 24 CA 88
Update original object
If you want to update the original object, assign the shuffled result to the original object and overwrite it.
df = df.sample(frac=1)
print(df)
# name age state point
# 0 Alice 24 NY 64
# 5 Frank 30 NY 57
# 1 Bob 42 CA 92
# 4 Ellen 24 CA 88
# 3 Dave 68 TX 70
# 2 Charlie 18 CA 70