【Python】kaggle官方24道pandas练习题(附完整答案)！

最新推荐文章于 2024-07-28 15:01:02 发布

风度78

最新推荐文章于 2024-07-28 15:01:02 发布

阅读量1.1k

点赞数 1

文章标签： python pandas numpy 数据分析开发语言

原文链接：https://mp.weixin.qq.com/s?__biz=MzIwODI2NDkxNQ==&mid=2247509798&idx=3&sn=f2bb53847ea7f4ac34c36675bc33d71e&chksm=97072c9aa070a58cad4cf1fe69a216f4e0b0e5bfc533e71415b3b48796b65cb993168402dffa&scene=126&sessionid=0

版权

公众号：尤而小屋
作者：Peter
编辑：Peter

本文是一篇基于Kaggle上面对24道pandas练习题的讲解，原项目学习地址：

https://www.kaggle.com/code/icarofreire/pandas-24-useful-exercises-with-solutions/notebook

import pandas as pd
import numpy as np

# 允许多个print输出在一个单元格中
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

1. How to create a series from a list, numpy array and dict?

题目：如何基于列表、numpy数组或者字典创建Series型数据。

先创建3种类型的数据：

a_list = list("abcdefg")
numpy_array = np.arange(1, 10)
dictionary = {"A":  0, "B":1, "C":2, "D":3, "E":5}

a_list

['a', 'b', 'c', 'd', 'e', 'f', 'g']

numpy_array

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

dictionary

{'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 5}

基于pd.Series直接生成：

s1 = pd.Series(a_list)
s1

0    a
1    b
2    c
3    d
4    e
5    f
6    g
dtype: object

s2 = pd.Series(numpy_array)
s2

0    1
1    2
2    3
3    4
4    5
5    6
6    7
7    8
8    9
dtype: int32

s3 = pd.Series(dictionary)
s3

A    0
B    1
C    2
D    3
E    5
dtype: int64

在字典的创建中，字典的键当做行索引，字典的值当做Series数据的值

2. How to combine many series to form a dataframe?

如何连接多个Series数据形成一个DataFrame数据

ser1 = pd.Series(list('abcedfghijklmnopqrstuvwxyz'))

ser2 = pd.Series(np.arange(26))

方式1：基于pd.DataFrame直接创建

# 方式1

ser_df = pd.DataFrame(ser1, ser2).reset_index()
ser_df.head()

	index	0
0	0	a
1	1	b
2	2	c
3	3	e
4	4	d

方式2：基于字典形式，指定每个Series数据的列名

# 方式2

ser_df = pd.DataFrame({"col1":ser1, "col2":ser2})
ser_df.head(5)

	col1	col2
0	a	0
1	b	1
2	c	2
3	e	3
4	d	4

方式3：基于concat函数创建

# 方式3

ser_df = pd.concat([ser1, ser2], axis = 1)
ser_df.head()

	0	1
0	a	0
1	b	1
2	c	2
3	e	3
4	d	4

3. How to get the items of series A not present in series B?

如何找出存在但是不存在B中的数据；AB都是Series数据

ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])

通过成员判断函数：isin

ser1[~ser1.isin(ser2)]

0    1
1    2
2    3
dtype: int64

4. How to get the items not common to both series A and series B?

如何找到不同时存在AB中的数据

# 模拟数据

ser1 = pd.Series([1, 2, 3, 4, 5])
ser2 = pd.Series([4, 5, 6, 7, 8])

数据4和5同时在AB中，我们希望找到的数据就是1,2,3,6,7,8

方式1：基于pandas的成员判断函数isin

# 找出AB中各自的独有部分
a_not_b = ser1[~ser1.isin(ser2)]  # 结果为123
b_not_a = ser2[~ser2.isin(ser1)]  # 结果为678

a_not_b

0    1
1    2
2    3
dtype: int64

b_not_a

2    6
3    7
4    8
dtype: int64

再将两个结果合并：

pd.concat([a_not_b,b_not_a],ignore_index=True)

0    1
1    2
2    3
3    6
4    7
5    8
dtype: int64

也可以使用append函数：

a_not_b.append(b_not_a, ignore_index = True)

0    1
1    2
2    3
3    6
4    7
5    8
dtype: int64

方式2：基于numpy的交集和并集函数，配合pandas的成员判断函数

ser_u = pd.Series(np.union1d(ser1, ser2))  # 求并集
ser_i = pd.Series(np.intersect1d(ser1, ser2))  # 求交集

ser_u  # 并集

0    1
1    2
2    3
3    4
4    5
5    6
6    7
7    8
dtype: int64

ser_i  # 交集

0    4
1    5
dtype: int64

从并集的数据中删除掉交集的4和5即可：

ser_u[~ser_u.isin(ser_i)]  # 成员判断

0    1
1    2
2    3
5    6
6    7
7    8
dtype: int64

5. How to get useful infos?

主要是获取数据的最值，中位数，四分位数等统计信息。

state = np.random.RandomState(100)
ser = pd.Series(state.normal(10, 5, 25))
ser

0      1.251173
1     11.713402
2     15.765179
3      8.737820
4     14.906604
5     12.571094
6     11.105898
7      4.649783
8      9.052521
9     11.275007
10     7.709865
11    12.175817
12     7.082025
13    14.084235
14    13.363604
15     9.477944
16     7.343598
17    15.148663
18     7.809322
19     4.408409
20    18.094908
21    17.708026
22     8.740604
23     5.787821
24    10.922593
dtype: float64

# 使用describe

ser.describe()

count    25.000000
mean     10.435437
std       4.253118
min       1.251173
25%       7.709865
50%      10.922593
75%      13.363604
max      18.094908
dtype: float64

6. How to get frequency counts of unique items of a series?

如何获取Series数据中每个唯一值的频次

ser = pd.Series(np.take(list('abcdefgh'), np.random.randint(8, size=30)))
ser

0     d
1     h
2     f
3     g
4     f
5     f
6     d
7     e
8     a
9     a
10    a
11    b
12    f
13    h
14    e
15    e
16    e
17    e
18    d
19    h
20    h
21    c
22    f
23    h
24    f
25    b
26    d
27    c
28    b
29    c
dtype: object

通过value_counts函数来获取：

ser.value_counts()

f    6
h    5
e    5
d    4
b    3
c    3
a    3
g    1
dtype: int64

7. How to convert a numpy array to a dataframe of given shape? (L1)

如何将numpy数值转成指定shape的DataFrame数据

# 1到35间有放回的选择35个数据

ser = pd.Series(np.random.randint(1, 10, 35))  
ser

0     4
1     9
2     4
3     8
4     5
5     4
6     8
7     5
8     7
9     5
10    8
11    3
12    1
13    7
14    7
15    1
16    3
17    2
18    8
19    7
20    6
21    3
22    7
23    4
24    8
25    4
26    9
27    9
28    4
29    2
30    3
31    7
32    5
33    6
34    1
dtype: int32

pd.DataFrame(np.array(ser).reshape(7, 5))

	0	1	2	3	4
0	4	9	4	8	5
1	4	8	5	7	5
2	8	3	1	7	7
3	1	3	2	8	7
4	6	3	7	4	8
5	4	9	9	4	2
6	3	7	5	6	1

pd.DataFrame(ser.values.reshape(7, 5))

	0	1	2	3	4
0	4	9	4	8	5
1	4	8	5	7	5
2	8	3	1	7	7
3	1	3	2	8	7
4	6	3	7	4	8
5	4	9	9	4	2
6	3	7	5	6	1

8. How to find the positions of numbers that are multiples of 3 from a series?

如何从Series数据中找到3的倍数的数据所在位置？

# 模拟数据

np.random.RandomState(100)
ser = pd.Series(np.random.randint(1, 5, 10))
ser

RandomState(MT19937) at 0x2AEECC26440






0    3
1    1
2    2
3    1
4    4
5    4
6    3
7    1
8    2
9    1
dtype: int32

每次随机运行的结果可能不同。

使用where函数：不同被3整除的数据用NaN表示

ser.where(lambda x: x%3 == 0)

0    3.0
1    NaN
2    NaN
3    NaN
4    NaN
5    NaN
6    3.0
7    NaN
8    NaN
9    NaN
dtype: float64

函数dropna()删除空值：

ser.where(lambda x: x%3 == 0).dropna()

0    3.0
6    3.0
dtype: float64

9. How to extract items at given positions from a series

如何从指定位置提取Series中的数据

ser = pd.Series(list('abcdefghijklmnopqrstuvwxyz'))
pos = [0, 4, 8, 14, 20]  # 指定位置

使用loc函数：

ser.loc[pos]

0     a
4     e
8     i
14    o
20    u
dtype: object

使用take函数：

ser.take(pos)

0     a
4     e
8     i
14    o
20    u
dtype: object

10. How to stack two series vertically and horizontally ?

水平或者垂直方向上合并两个Series数据

ser1 = pd.Series(range(5))
ser2 = pd.Series(list('abcde'))

ser1

0    0
1    1
2    2
3    3
4    4
dtype: int64

ser2

0    a
1    b
2    c
3    d
4    e
dtype: object

垂直方向上：

ser1.append(ser2)

0    0
1    1
2    2
3    3
4    4
0    a
1    b
2    c
3    d
4    e
dtype: object

pd.concat([ser1, ser2], axis = 0)

0    0
1    1
2    2
3    3
4    4
0    a
1    b
2    c
3    d
4    e
dtype: object

水平方向上：

pd.concat([ser1, ser2], axis = 1)

	0	1
0	0	a
1	1	b
2	2	c
3	3	d
4	4	e

个人补充：对比numpy几个与合并堆叠相关的函数

np.stack([ser1,ser2])

array([[0, 1, 2, 3, 4],
       ['a', 'b', 'c', 'd', 'e']], dtype=object)

np.vstack([ser1,ser2])

array([[0, 1, 2, 3, 4],
       ['a', 'b', 'c', 'd', 'e']], dtype=object)

np.hstack([ser1,ser2])

array([0, 1, 2, 3, 4, 'a', 'b', 'c', 'd', 'e'], dtype=object)

np.dstack([ser1,ser2])

array([[[0, 'a'],
        [1, 'b'],
        [2, 'c'],
        [3, 'd'],
        [4, 'e']]], dtype=object)

np.concatenate([ser1,ser2],axis=0)

array([0, 1, 2, 3, 4, 'a', 'b', 'c', 'd', 'e'], dtype=object)

np.append(ser1,ser2)

array([0, 1, 2, 3, 4, 'a', 'b', 'c', 'd', 'e'], dtype=object)

11. How to get the positions of items of series A in another series B?

如何获取存在B中的元素同时在A中存在的位置？

ser1 = pd.Series([10, 9, 6, 5, 3, 1, 12, 8, 13])
ser2 = pd.Series([1, 3, 10, 13])

获取两个Series中相同元素在ser1中的位置index：

list(ser1[ser1.isin(ser2)].index)

[0, 4, 5, 8]

基于np.where函数：

[np.where(i == ser1)[0].tolist()[0] for i in ser2]

[5, 4, 0, 8]

基于pandas中的get_loc函数：

[pd.Index(ser1).get_loc(i) for i in ser2]

[5, 4, 0, 8]

pd.Index(ser1)  # 创建Int64Index类型的索引数据

Int64Index([10, 9, 6, 5, 3, 1, 12, 8, 13], dtype='int64')

## 12. How to compute difference of differences between consequtive numbers of a series?

计算Series型数组的一阶差分和二阶差分。

ser = pd.Series([1, 3, 6, 10, 15, 21, 27, 35])

ser.diff(periods = 1).tolist()

[nan, 2.0, 3.0, 4.0, 5.0, 6.0, 6.0, 8.0]

在一阶差分的基础继续做差分：

ser.diff(periods = 1).diff(periods = 1).tolist()

[nan, nan, 1.0, 1.0, 1.0, 1.0, 0.0, 2.0]

13. How to convert a series of date-strings to a timeseries?

如何将字符类型的伪data类型数据转成timeseries数据

ser = pd.Series(['01 Jan 2020', '02-02-2021', '20220303', 
                 '2023/04/04', '2020-05-05', '2022-06-06T12:20'])
ser.dtype

dtype('O')

# 1、直接使用pandas.to_datetime

pd.to_datetime(ser)

0   2020-01-01 00:00:00
1   2021-02-02 00:00:00
2   2022-03-03 00:00:00
3   2023-04-04 00:00:00
4   2020-05-05 00:00:00
5   2022-06-06 12:20:00
dtype: datetime64[ns]

# 使用python的dateutil.parser解析包

from dateutil.parser import parse
ser.map(lambda x: parse(x))

0   2020-01-01 00:00:00
1   2021-02-02 00:00:00
2   2022-03-03 00:00:00
3   2023-04-04 00:00:00
4   2020-05-05 00:00:00
5   2022-06-06 12:20:00
dtype: datetime64[ns]

14. How to filter words that contain atleast 2 vowels from a series?

如何从Series数据中找到至少包含两个元音字母的单词？元音字母指的是aeiou

方式1：使用循环

ser = pd.Series(['Apple', 'Orange', 'Plan', 'Python', 'Money'])

vowels = list("aeiou")

list_ = []

for i in ser:
    c = 0  # 计数器
    for l in list(i.lower()):  # 将遍历的数据全部转成小写
        if l in vowels:  #  如果数据在元音列表中
            c += 1  # 计数器加1
    if c >= 2:  # 循环完后计数器大于1
        list_.append(i)  # 将满足要求的数据添加到列表中
list_

['Apple', 'Orange', 'Money']

ser[ser.isin(list_)]

0     Apple
1    Orange
4     Money
dtype: object

方式2：使用Counter类

from collections import Counter

mask = ser.map(lambda x: sum([Counter(x.lower()).get(i,0) for i in list("aeiou")]) >= 2)
ser[mask]

0     Apple
1    Orange
4     Money
dtype: object

[Counter("Apple".lower()).get(i,0) for i in list("aeiou")]

[1, 1, 0, 0, 0]

[Counter("Python".lower()).get(i,0) for i in list("aeiou")]

[0, 0, 0, 1, 0]

15. How to replace missing spaces in a string with the least frequent character?

如何使用最低频的字符替换字符串中的空格

my_str = 'dbc deb abed ggade'

上面的字符串中最低频的字符是c，所以两个空格用c填充，期待的结果是：dbccdebcabedcggade

方式1：基于pandas统计字符串中每个字符的出现次数，用出现次数最少的替换空格即可

先把原字符串my_str中的空格去掉

ser = pd.Series(list(my_str.replace(" ", "")))
ser

0     d
1     b
2     c
3     d
4     e
5     b
6     a
7     b
8     e
9     d
10    g
11    g
12    a
13    d
14    e
dtype: object

统计每个唯一值的次数，找出最低频的数组（倒数第一位）：

ser.value_counts()

d    4
b    3
e    3
a    2
g    2
c    1
dtype: int64

mini = ser.value_counts().index[-1]
mini

'c'

用mini替换空格：

my_str.replace(" ", mini)

'dbccdebcabedcggade'

方式2：使用Counter类

from collections import Counter

my_str_ = my_str  # 副本

去掉空格，然后统计每个元素的出现次数：

Counter_ = Counter(list(my_str_.replace(" ", "")))
Counter_

Counter({'d': 4, 'b': 3, 'c': 1, 'e': 3, 'a': 2, 'g': 2})

找出次数的最小数据：

mini = min(Counter_, key = Counter_.get)
mini

'c'

my_str.replace(" ", mini)

'dbccdebcabedcggade'

16. How to change column values when importing csv to a dataframe?

利用pandas读取csv文件如何修改列的数据

df = pd.read_csv(
    "housing_preprocessed.csv",  
    index_col = 0, 
    skipfooter=1,  
    converters = {"MEDV": lambda x: "HIGH" if float(x) >= 25 else "LOW"}  # 关键代码
    )

在【关键代码】这行，当MEDV字段中的大于25则表示为HIGH，否则表示为LOW。

17. How to import only specified columns from a csv file?

从csv文件中导入指定的列名；可以通过索引号或者直接指定列名的方式。

df = pd.read_csv(file, usecols = [1, 2, 4], skipfooter=1)  # 索引号
df = pd.read_csv(file, usecols = ["CRIM", "ZN", "CHAS"])  # 列名

18. How to check if a dataframe has any missing values?

如何检查DataFrame中是否有缺失值

df.isnull()  # 查看每个位置是否为空值；如果是用True，否则是False

df.isnull().sum()  # 判断每列有多少个空值

df.isnull().values.any()  # 判断数据中是否至少存在一个空值

df.isnull().values.any(axis=0)  # 显示每列是否至少有一个空值

df.isnull().values.any(axis=1)  # 显示每行是否至少存在一个空值

19. How to replace missing values of multiple numeric columns with the mean?

数值型字段中的缺失值如何用均值填充。

个人补充：下面模拟一份简单的数据进行说明

df = pd.DataFrame({"col":[1,2,3,None]})
df

	col
0	1.0
1	2.0
2	3.0
3	NaN

# 均值

(1 + 2 + 3) / 3

2.0

df[["col"]] = df[["col"]].apply(lambda x: x.fillna(x.mean()))

df

	col
0	1.0
1	2.0
2	3.0
3	2.0

20. How to change the order of columns of a dataframe?

如何改变DataFrame中列名的顺序

# 模拟数据

df = pd.DataFrame(np.arange(20).reshape(-1, 5), 
                  columns=list('abcde'))
df

	a	b	c	d	e
0	0	1	2	3	4
1	5	6	7	8	9
2	10	11	12	13	14
3	15	16	17	18	19

# 方式1：直接认为指定顺序

df[["c", "b", "a", "d", "e"]]

	c	b	a	d	e
0	2	1	0	3	4
1	7	6	5	8	9
2	12	11	10	13	14
3	17	16	15	18	19

# 方式2：交换两个列的位置
def change_cols(df, col1, col2):
    df_columns = df.columns.to_list()
    index1 = df_columns.index(col1)
    index2 = df_columns.index(col2)
    
    df_columns[index1], df_columns[index2] = col2, col1
    
    return df[df_columns]

# 交换be两列的数据的位置
df = change_cols(df, "b", "e")
df

	a	e	c	d	b
0	0	4	2	3	1
1	5	9	7	8	6
2	10	14	12	13	11
3	15	19	17	18	16

# 方式3：翻转列名

df_columns = df.columns
df_columns

Index(['a', 'e', 'c', 'd', 'b'], dtype='object')

df_columns_reversed = df_columns[::-1]  # 翻转
df_columns_reversed

Index(['b', 'd', 'c', 'e', 'a'], dtype='object')

df[df_columns_reversed]  # 翻转后的数据

	b	d	c	e	a
0	1	3	2	4	0
1	6	8	7	9	5
2	11	13	12	14	10
3	16	18	17	19	15

21. How to filter every nth row in a dataframe?

从DataFrame数据中间隔n行取出数据。

比如我们读取本地的iris数据集：

df = pd.read_csv("iris.csv",usecols=['sepal_length', 'sepal_width', 'petal_length'])
df

	sepal_length	sepal_width	petal_length
0	5.1	3.5	1.4
1	4.9	3.0	1.4
2	4.7	3.2	1.3
3	4.6	3.1	1.5
4	5.0	3.6	1.4
...	...	...	...
145	6.7	3.0	5.2
146	6.3	2.5	5.0
147	6.5	3.0	5.2
148	6.2	3.4	5.4
149	5.9	3.0	5.1

150 rows × 3 columns

df.columns

Index(['sepal_length', 'sepal_width', 'petal_length'], dtype='object')

比如我们想间隔20行取出数据：

df[::20]

	sepal_length	sepal_width	petal_length
0	5.1	3.5	1.4
20	5.4	3.4	1.7
40	5.0	3.5	1.3
60	5.0	2.0	3.5
80	5.5	2.4	3.8
100	6.3	3.3	6.0
120	6.9	3.2	5.7
140	6.7	3.1	5.6

# 等价
df.iloc[::20,:]  # 行方向是间隔20行，全部列

	sepal_length	sepal_width	petal_length
0	5.1	3.5	1.4
20	5.4	3.4	1.7
40	5.0	3.5	1.3
60	5.0	2.0	3.5
80	5.5	2.4	3.8
100	6.3	3.3	6.0
120	6.9	3.2	5.7
140	6.7	3.1	5.6

22. How to get the last n rows of a dataframe with row sum > 100?

如何确定最后n行；前提：该行的和大于100

df = pd.DataFrame(np.random.randint(10, 40, 60).reshape(-1, 4))
df1 = df.copy(deep = True)  # 副本

df

	0	1	2	3
0	20	10	24	25
1	22	30	24	10
2	32	21	37	21
3	28	21	14	29
4	33	19	36	26
5	39	18	28	25
6	16	35	14	17
7	29	13	21	30
8	22	26	28	29
9	37	28	16	28
10	24	33	37	33
11	30	30	29	39
12	13	19	21	28
13	36	38	34	30
14	23	10	22	21

方式1：基于pandas按照行求和

df["sum"] = df.sum(axis=1)
df

	0	1	2	3	sum
0	20	10	24	25	79
1	22	30	24	10	86
2	32	21	37	21	111
3	28	21	14	29	92
4	33	19	36	26	114
5	39	18	28	25	110
6	16	35	14	17	82
7	29	13	21	30	93
8	22	26	28	29	105
9	37	28	16	28	109
10	24	33	37	33	127
11	30	30	29	39	128
12	13	19	21	28	81
13	36	38	34	30	138
14	23	10	22	21	76

(df[df["sum"] > 100].index).to_list()[-2:]  # 最后两行

[11, 13]

print("The index of the rows that are greater than 100 are {}".format((df[df["sum"] > 100].index).to_list()[-2:]))

The index of the rows that are greater than 100 are [11, 13]

df.iloc[(df[df["sum"] > 100].index).to_list()[-2:]]  # 最后两行

	0	1	2	3	sum
11	30	30	29	39	128
13	36	38	34	30	138

方式2：基于numpy的求和

rowsums = df1.apply(np.sum, axis=1)  # 安装行求和
rowsums

0      79
1      86
2     111
3      92
4     114
5     110
6      82
7      93
8     105
9     109
10    127
11    128
12     81
13    138
14     76
dtype: int64

每行的和大于100的索引号：

np.where(rowsums > 100)

(array([ 2,  4,  5,  8,  9, 10, 11, 13], dtype=int64),)

np.where(rowsums > 100)[0]

array([ 2,  4,  5,  8,  9, 10, 11, 13], dtype=int64)

np.where(rowsums > 100)[0][-2:]

array([11, 13], dtype=int64)

df1.iloc[np.where(rowsums > 100)[0][-2:], :]

	0	1	2	3
11	30	30	29	39
13	36	38	34	30

23. How to find and cap outliers from a series or dataframe column?

如何定位和捕捉到Series或者DataFrame数据中离群点。

# 模拟数据
ser = pd.Series(np.logspace(-2, 2, 30))
ser1 = ser.copy(deep = True)  # 副本，深拷贝
ser2 = ser.copy(deep = True) 

ser

0       0.010000
1       0.013738
2       0.018874
3       0.025929
4       0.035622
5       0.048939
6       0.067234
7       0.092367
8       0.126896
9       0.174333
10      0.239503
11      0.329034
12      0.452035
13      0.621017
14      0.853168
15      1.172102
16      1.610262
17      2.212216
18      3.039195
19      4.175319
20      5.736153
21      7.880463
22     10.826367
23     14.873521
24     20.433597
25     28.072162
26     38.566204
27     52.983169
28     72.789538
29    100.000000
dtype: float64

指定数据的分位数：

quantiles = np.quantile(ser, [0.05, 0.95])
quantiles

array([1.60492941e-02, 6.38766722e+01])

阈值划分：

ser1[ser1 < quantiles[0]] = quantiles[0]  
ser1[ser1 > quantiles[1]] = quantiles[1]

# 等价
# ser.iloc[np.where(ser < quantiles[0])] = quantiles[0]
# ser.iloc[np.where(ser > quantiles[1])] = quantiles[1]

ser1

0      0.016049
1      0.016049
2      0.018874
3      0.025929
4      0.035622
5      0.048939
6      0.067234
7      0.092367
8      0.126896
9      0.174333
10     0.239503
11     0.329034
12     0.452035
13     0.621017
14     0.853168
15     1.172102
16     1.610262
17     2.212216
18     3.039195
19     4.175319
20     5.736153
21     7.880463
22    10.826367
23    14.873521
24    20.433597
25    28.072162
26    38.566204
27    52.983169
28    63.876672
29    63.876672
dtype: float64

将上面的过程封装成函数：

def cap_outliers(ser, low_perc, high_perc):
    low, high = ser.quantile([low_perc, high_perc]) # 指定分位数
    
    ser[ser < low] = low  # 小于low部分全部赋值为low
    ser[ser > high] = high  # 大于hight部分全部赋值为high
    return ser

capped_ser = cap_outliers(ser2, .05, .95)
capped_ser

0      0.016049
1      0.016049
2      0.018874
3      0.025929
4      0.035622
5      0.048939
6      0.067234
7      0.092367
8      0.126896
9      0.174333
10     0.239503
11     0.329034
12     0.452035
13     0.621017
14     0.853168
15     1.172102
16     1.610262
17     2.212216
18     3.039195
19     4.175319
20     5.736153
21     7.880463
22    10.826367
23    14.873521
24    20.433597
25    28.072162
26    38.566204
27    52.983169
28    63.876672
29    63.876672
dtype: float64

24. How to reverse the rows of a dataframe?

如何翻转DataFrame的行数据

df = pd.DataFrame(np.arange(15).reshape(3, -1))
df

	0	1	2	3	4
0	0	1	2	3	4
1	5	6	7	8	9
2	10	11	12	13	14

方式1：翻转行索引的列表

df.index.to_list()[::-1]  # 行索引翻转

[2, 1, 0]

df.iloc[df.index.to_list()[::-1]]

	0	1	2	3	4
2	10	11	12	13	14
1	5	6	7	8	9
0	0	1	2	3	4

方式2：基于loc或者iloc函数

df.iloc[::-1, :]

	0	1	2	3	4
2	10	11	12	13	14
1	5	6	7	8	9
0	0	1	2	3	4

df.loc[df.index[::-1], :]

	0	1	2	3	4
2	10	11	12	13	14
1	5	6	7	8	9
0	0	1	2	3	4

往期精彩回顾




适合初学者入门人工智能的路线及资料下载(图文+视频)机器学习入门系列下载机器学习及深度学习笔记等资料打印《统计学习方法》的代码复现专辑机器学习微信群请扫码

风度78

关注

1
点赞
踩
11

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

	sepal_length	sepal_width	petal_length
0	5.1	3.5	1.4
1	4.9	3.0	1.4
2	4.7	3.2	1.3
3	4.6	3.1	1.5
4	5.0	3.6	1.4
...	...	...	...
145	6.7	3.0	5.2
146	6.3	2.5	5.0
147	6.5	3.0	5.2
148	6.2	3.4	5.4
149	5.9	3.0	5.1

	sepal_length	sepal_width	petal_length
0	5.1	3.5	1.4
20	5.4	3.4	1.7
40	5.0	3.5	1.3
60	5.0	2.0	3.5
80	5.5	2.4	3.8
100	6.3	3.3	6.0
120	6.9	3.2	5.7
140	6.7	3.1	5.6

	sepal_length	sepal_width	petal_length
0	5.1	3.5	1.4
20	5.4	3.4	1.7
40	5.0	3.5	1.3
60	5.0	2.0	3.5
80	5.5	2.4	3.8
100	6.3	3.3	6.0
120	6.9	3.2	5.7
140	6.7	3.1	5.6

	0	1	2	3
0	20	10	24	25
1	22	30	24	10
2	32	21	37	21
3	28	21	14	29
4	33	19	36	26
5	39	18	28	25
6	16	35	14	17
7	29	13	21	30
8	22	26	28	29
9	37	28	16	28
10	24	33	37	33
11	30	30	29	39
12	13	19	21	28
13	36	38	34	30
14	23	10	22	21

	0	1	2	3	sum
0	20	10	24	25	79
1	22	30	24	10	86
2	32	21	37	21	111
3	28	21	14	29	92
4	33	19	36	26	114
5	39	18	28	25	110
6	16	35	14	17	82
7	29	13	21	30	93
8	22	26	28	29	105
9	37	28	16	28	109
10	24	33	37	33	127
11	30	30	29	39	128
12	13	19	21	28	81
13	36	38	34	30	138
14	23	10	22	21	76

	sepal_length	sepal_width	petal_length
0	5.1	3.5	1.4
1	4.9	3.0	1.4
2	4.7	3.2	1.3
3	4.6	3.1	1.5
4	5.0	3.6	1.4
...	...	...	...
145	6.7	3.0	5.2
146	6.3	2.5	5.0
147	6.5	3.0	5.2
148	6.2	3.4	5.4
149	5.9	3.0	5.1

	sepal_length	sepal_width	petal_length
0	5.1	3.5	1.4
20	5.4	3.4	1.7
40	5.0	3.5	1.3
60	5.0	2.0	3.5
80	5.5	2.4	3.8
100	6.3	3.3	6.0
120	6.9	3.2	5.7
140	6.7	3.1	5.6

	sepal_length	sepal_width	petal_length
0	5.1	3.5	1.4
20	5.4	3.4	1.7
40	5.0	3.5	1.3
60	5.0	2.0	3.5
80	5.5	2.4	3.8
100	6.3	3.3	6.0
120	6.9	3.2	5.7
140	6.7	3.1	5.6

	0	1	2	3
0	20	10	24	25
1	22	30	24	10
2	32	21	37	21
3	28	21	14	29
4	33	19	36	26
5	39	18	28	25
6	16	35	14	17
7	29	13	21	30
8	22	26	28	29
9	37	28	16	28
10	24	33	37	33
11	30	30	29	39
12	13	19	21	28
13	36	38	34	30
14	23	10	22	21

	0	1	2	3	sum
0	20	10	24	25	79
1	22	30	24	10	86
2	32	21	37	21	111
3	28	21	14	29	92
4	33	19	36	26	114
5	39	18	28	25	110
6	16	35	14	17	82
7	29	13	21	30	93
8	22	26	28	29	105
9	37	28	16	28	109
10	24	33	37	33	127
11	30	30	29	39	128
12	13	19	21	28	81
13	36	38	34	30	138
14	23	10	22	21	76

	sepal_length	sepal_width	petal_length
0	5.1	3.5	1.4
1	4.9	3.0	1.4
2	4.7	3.2	1.3
3	4.6	3.1	1.5
4	5.0	3.6	1.4
...	...	...	...
145	6.7	3.0	5.2
146	6.3	2.5	5.0
147	6.5	3.0	5.2
148	6.2	3.4	5.4
149	5.9	3.0	5.1

	sepal_length	sepal_width	petal_length
0	5.1	3.5	1.4
20	5.4	3.4	1.7
40	5.0	3.5	1.3
60	5.0	2.0	3.5
80	5.5	2.4	3.8
100	6.3	3.3	6.0
120	6.9	3.2	5.7
140	6.7	3.1	5.6

	sepal_length	sepal_width	petal_length
0	5.1	3.5	1.4
20	5.4	3.4	1.7
40	5.0	3.5	1.3
60	5.0	2.0	3.5
80	5.5	2.4	3.8
100	6.3	3.3	6.0
120	6.9	3.2	5.7
140	6.7	3.1	5.6

	0	1	2	3
0	20	10	24	25
1	22	30	24	10
2	32	21	37	21
3	28	21	14	29
4	33	19	36	26
5	39	18	28	25
6	16	35	14	17
7	29	13	21	30
8	22	26	28	29
9	37	28	16	28
10	24	33	37	33
11	30	30	29	39
12	13	19	21	28
13	36	38	34	30
14	23	10	22	21

	0	1	2	3	sum
0	20	10	24	25	79
1	22	30	24	10	86
2	32	21	37	21	111
3	28	21	14	29	92
4	33	19	36	26	114
5	39	18	28	25	110
6	16	35	14	17	82
7	29	13	21	30	93
8	22	26	28	29	105
9	37	28	16	28	109
10	24	33	37	33	127
11	30	30	29	39	128
12	13	19	21	28	81
13	36	38	34	30	138
14	23	10	22	21	76