Method1:Map
import pandas as pd
df = pd.DataFrame({
'cloth_id': [1001, 1002, 1003, 1004, 1005, 1006],
'size': ['S', 'XL', 'M', 'XS', 'L', 'S'],
})
| cloth_id | size |
0 | 1001 | S |
1 | 1002 | XL |
2 | 1003 | M |
3 | 1004 | XS |
4 | 1005 | L |
5 | 1006 | S |
df_mapping = pd.DataFrame({
'size': ['XS', 'S', 'M', 'L', 'XL'],
})
sort_mapping = df_mapping.reset_index().set_index('size')
sort_mapping
df['size_num'] = df['size'].map(sort_mapping['index'])
df.sort_values('size_num')
| cloth_id | size | size_num |
3 | 1004 | XS | 0 |
0 | 1001 | S | 1 |
5 | 1006 | S | 1 |
2 | 1003 | M | 2 |
4 | 1005 | L | 3 |
1 | 1002 | XL | 4 |
Method2:CategoricalDtype
单个变量排序
导入模块,然后,创建一个自定义类别类型cat_size_order
- 第一个参数设置为
['XS'、'S'、'M'、'L'、'XL']
作为尺寸的唯一值 - 第二个参数
ordered=True
,将此变量视为有序
from pandas.api.types import CategoricalDtype
cat_size_order = CategoricalDtype(
['XS', 'S', 'M', 'L', 'XL'],
ordered=True
)
然后,调用astype(cat_size_order)
将大小数据强制转换为自定义类别类型。通过运行df['size']
,我们可以看到size
列已经被转换为一个类别类型,其顺序为[XS<S<M<L<XL]
df['size'] = df['size'].astype(cat_size_order)
df['size']
0 S
1 XL
2 M
3 XS
4 L
5 S
Name: size, dtype: category
Categories (5, object): [XS < S < M < L < XL]
df.sort_values('size')
| cloth_id | size | size_num |
3 | 1004 | XS | 0 |
0 | 1001 | S | 1 |
5 | 1006 | S | 1 |
2 | 1003 | M | 2 |
4 | 1005 | L | 3 |
1 | 1002 | XL | 4 |
df['codes'] = df['size'].cat.codes
df
| cloth_id | size | size_num | codes |
0 | 1001 | S | 1 | 1 |
1 | 1002 | XL | 4 | 4 |
2 | 1003 | M | 2 | 2 |
3 | 1004 | XS | 0 | 0 |
4 | 1005 | L | 3 | 3 |
5 | 1006 | S | 1 | 1 |
我们可以看到XS、S、M、L和XL的代码分别为0、1、2、3、4和5。codes是类别实际值。通过运行df.info()
,我们可以看到实际上是int8。
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
--- ------ -------------- -----
0 cloth_id 6 non-null int64
1 size 6 non-null category
2 size_num 6 non-null int64
3 codes 6 non-null int8
dtypes: category(1), int64(2), int8(1)
memory usage: 436.0 bytes
多个变量排序
df = pd.DataFrame({
'order_id': [1001, 1002, 1003, 1004, 1005, 1006, 1007],
'customer_id': [10, 12, 12, 12, 10, 10, 10],
'month': ['Feb', 'Jan', 'Jan', 'Feb', 'Feb', 'Jan', 'Feb'],
'day_of_week': ['Mon', 'Wed', 'Sun', 'Tue', 'Sat', 'Mon', 'Thu'],
})
| order_id | customer_id | month | day_of_week |
0 | 1001 | 10 | Feb | Mon |
1 | 1002 | 12 | Jan | Wed |
2 | 1003 | 12 | Jan | Sun |
3 | 1004 | 12 | Feb | Tue |
4 | 1005 | 10 | Feb | Sat |
5 | 1006 | 10 | Jan | Mon |
6 | 1007 | 10 | Feb | Thu |
类似地,让我们创建两个自定义类别类型cat_day_of_week
和cat_month
,并将它们传递给astype()
。
cat_day_of_week = CategoricalDtype(
['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'],
ordered=True
)
cat_month = CategoricalDtype(
['Jan', 'Feb', 'Mar', 'Apr'],
ordered=True,
)
df['day_of_week'] = df['day_of_week'].astype(cat_day_of_week)
df['month'] = df['month'].astype(cat_month)
要按多个变量排序,我们只需要传递一个列表来代替sort_values()
。例如,按month
和day_of_week
排序
df.sort_values(['customer_id', 'month', 'day_of_week'])
order_id | customer_id | month | day_of_week |
5 | 1006 | 10 | Jan | Mon |
0 | 1001 | 10 | Feb | Mon |
6 | 1007 | 10 | Feb | Thu |
4 | 1005 | 10 | Feb | Sat |
1 | 1002 | 12 | Jan | Wed |
2 | 1003 | 12 | Jan | Sun |
3 | 1004 | 12 | Feb | Tue |
Reference:如何对Pandas DataFrame进行自定义排序