Pandas 动手学入门
本课程基于DataWhale开源教程《Joyful-Pandas》与开源项目 ajcr/100-pandas-puzzles其中译版开发. 在数据分析的实践中,Pandas是一个非常重要的库,它集成了非常多的统计分析功能,《Joyful-Pandas》是一份非常详尽,并且适合新手的Pandas中文教程,同时它也是Pandas官方认证的中文教程!但笔者学习过程中,始终觉得编程学习“纸上得来终觉浅”,对于刚接触一个新的库的“躬行”笔者认为最好的方式就是对照着官方文档以及QuickStart吧例子自己尝试一遍,并且在实践中学会如何查文档,这是一个开发者必备的技能.因此本教程就是基于这样的思想,给出对应的操作以及对应的文档地址,学习者需要自行查阅文档,完成下面的实践.
餐前准备
在这部分中,我们主要学习如何导入Pandas以及查看版本信息,难度是比较简单的.
请你阅读Joyful-Pandas第二章,完成下方练习.
http://joyfulpandas.datawhale.club/Content/ch2.html
1. 以 pd
别名导入 pandas 库
import pandas as pd
2. 打印出pandas 库的版本信息
print('pandas库的版本信息是:{}'.format(pd.__version__))
pandas库的版本信息是:1.2.4
3. 打印 pandas 依赖包及其版本信息
pd.show_versions(True)
{'system': {'commit': '2cb96529396d93b46abab7bbc73a208e708c642e', 'python': '3.8.8.final.0', 'python-bits': 64, 'OS': 'Windows', 'OS-release': '10', 'Version': '10.0.19041', 'machine': 'AMD64', 'processor': 'Intel64 Family 6 Model 165 Stepping 5, GenuineIntel', 'byteorder': 'little', 'LC_ALL': None, 'LANG': None, 'LOCALE': {'language-code': 'Chinese (Simplified)_China', 'encoding': '936'}}, 'dependencies': {'pandas': '1.2.4', 'numpy': '1.22.3', 'pytz': '2021.1', 'dateutil': '2.8.1', 'pip': '21.0.1', 'setuptools': '52.0.0.post20210125', 'Cython': '0.29.23', 'pytest': '6.2.3', 'hypothesis': None, 'sphinx': '4.0.1', 'blosc': None, 'feather': None, 'xlsxwriter': '1.3.8', 'lxml.etree': '4.6.3', 'html5lib': '1.1', 'pymysql': None, 'psycopg2': None, 'jinja2': '2.11.3', 'IPython': '7.22.0', 'pandas_datareader': None, 'bs4': '4.9.3', 'bottleneck': '1.3.2', 'fsspec': '0.9.0', 'fastparquet': None, 'gcsfs': None, 'matplotlib': '3.3.4', 'numexpr': '2.7.3', 'odfpy': None, 'openpyxl': '3.0.7', 'pandas_gbq': None, 'pyarrow': None, 'pyxlsb': None, 's3fs': None, 'scipy': '1.6.2', 'sqlalchemy': '1.4.7', 'tables': '3.6.1', 'tabulate': None, 'xarray': None, 'xlrd': '2.0.1', 'xlwt': '1.3.0', 'numba': '0.53.1'}}
备齐食材
在这部分中,主要介绍了Pandas一个重要的数据结构DataFrame
请你阅读Joyful-Pandas第二章,完成下方练习.
http://joyfulpandas.datawhale.club/Content/ch2.html
4. 使用数据 data
和行索引 labels
创建一个 DataFrame df
import numpy as np
有下面这样的一个数据字典 data
以及列表格式的标签数据 labels
:
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, index=labels)
df
animal | age | visits | priority | |
---|---|---|---|---|
a | cat | 2.5 | 1 | yes |
b | cat | 3.0 | 3 | yes |
c | snake | 0.5 | 2 | no |
d | dog | NaN | 3 | yes |
e | dog | 5.0 | 2 | no |
f | cat | 2.0 | 3 | no |
g | snake | 4.5 | 1 | no |
h | cat | NaN | 1 | yes |
i | dog | 7.0 | 2 | no |
j | dog | 3.0 | 1 | no |
5. 显示该 DataFrame 及其数据相关的基本信息(提示:DataFrame 直接调用的方法)
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, a to j
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 animal 10 non-null object
1 age 8 non-null float64
2 visits 10 non-null int64
3 priority 10 non-null object
dtypes: float64(1), int64(1), object(2)
memory usage: 400.0+ bytes
df.describe()
age | visits | |
---|---|---|
count | 8.000000 | 10.000000 |
mean | 3.437500 | 1.900000 |
std | 2.007797 | 0.875595 |
min | 0.500000 | 1.000000 |
25% | 2.375000 | 1.000000 |
50% | 3.000000 | 2.000000 |
75% | 4.625000 | 2.750000 |
max | 7.000000 | 3.000000 |
6. 返回 DataFrame df
的前4行数据
df.head(4)
animal | age | visits | priority | |
---|---|---|---|---|
a | cat | 2.5 | 1 | yes |
b | cat | 3.0 | 3 | yes |
c | snake | 0.5 | 2 | no |
d | dog | NaN | 3 | yes |
7. 从 DataFrame df
选择标签为 animal
和 age
的列
df[['animal','age']]
animal | age | |
---|---|---|
a | cat | 2.5 |
b | cat | 3.0 |
c | snake | 0.5 |
d | dog | NaN |
e | dog | 5.0 |
f | cat | 2.0 |
g | snake | 4.5 |
h | cat | NaN |
i | dog | 7.0 |
j | dog | 3.0 |
8. 在 [3, 4, 8]
行中,列为 ['animal', 'age']
的数据
df.iloc[[3,4,8]][['animal','age']]
animal | age | |
---|---|---|
d | dog | NaN |
e | dog | 5.0 |
i | dog | 7.0 |
df.loc[df.index[[3,4,8]],['animal','age']]
animal | age | |
---|---|---|
d | dog | NaN |
e | dog | 5.0 |
i | dog | 7.0 |
9. 选择列visits
大于 1 的行
df[df['visits']>1]
animal | age | visits | priority | |
---|---|---|---|---|
b | cat | 3.0 | 3 | yes |
c | snake | 0.5 | 2 | no |
d | dog | NaN | 3 | yes |
e | dog | 5.0 | 2 | no |
f | cat | 2.0 | 3 | no |
i | dog | 7.0 | 2 | no |
df.loc[df['visits']>1]
animal | age | visits | priority | |
---|---|---|---|---|
b | cat | 3.0 | 3 | yes |
c | snake | 0.5 | 2 | no |
d | dog | NaN | 3 | yes |
e | dog | 5.0 | 2 | no |
f | cat | 2.0 | 3 | no |
i | dog | 7.0 | 2 | no |
10. 选择 age
为缺失值的行
df[df['age'].isnull()]
animal | age | visits | priority | |
---|---|---|---|---|
d | dog | NaN | 3 | yes |
h | cat | NaN | 1 | yes |
df[df['age'].isna()]
animal | age | visits | priority | |
---|---|---|---|---|
d | dog | NaN | 3 | yes |
h | cat | NaN | 1 | yes |
11. 选择 animal
是cat且age
小于 3 的行
df[(df['animal']=='cat') & (df['age']<3)]
animal | age | visits | priority | |
---|---|---|---|---|
a | cat | 2.5 | 1 | yes |
f | cat | 2.0 | 3 | no |
df.query('animal=="cat" and age<3')
animal | age | visits | priority | |
---|---|---|---|---|
a | cat | 2.5 | 1 | yes |
f | cat | 2.0 | 3 | no |
12. 选择 age
在 2 到 4 之间的数据(包含边界值)
df[df['age'].between(2,4)]
animal | age | visits | priority | |
---|---|---|---|---|
a | cat | 2.5 | 1 | yes |
b | cat | 3.0 | 3 | yes |
f | cat | 2.0 | 3 | no |
j | dog | 3.0 | 1 | no |
df.query('2<=age<=4')
animal | age | visits | priority | |
---|---|---|---|---|
a | cat | 2.5 | 1 | yes |
b | cat | 3.0 | 3 | yes |
f | cat | 2.0 | 3 | no |
j | dog | 3.0 | 1 | no |
13. 将 ‘f’ 行的 age
改为 1.5
df.loc['f','age']=1.5
df
animal | age | visits | priority | |
---|---|---|---|---|
a | cat | 2.5 | 1 | yes |
b | cat | 3.0 | 3 | yes |
c | snake | 0.5 | 2 | no |
d | dog | NaN | 3 | yes |
e | dog | 5.0 | 2 | no |
f | cat | 1.5 | 3 | no |
g | snake | 4.5 | 1 | no |
h | cat | NaN | 1 | yes |
i | dog | 7.0 | 2 | no |
j | dog | 3.0 | 1 | no |
14. 对 visits
列的数据求和
df['visits'].sum()
19
15. 计算每种 animal
age
的平均值
df.groupby('animal')['age'].mean()
animal
cat 2.333333
dog 5.000000
snake 2.500000
Name: age, dtype: float64
16. 新增一行数据 k,数据自定义,然后再删除新追加的 k 行
df.loc['k'] = [6.0,'snake',2,'no']
df
animal | age | visits | priority | |
---|---|---|---|---|
a | cat | 2.5 | 1 | yes |
b | cat | 3.0 | 3 | yes |
c | snake | 0.5 | 2 | no |
d | dog | NaN | 3 | yes |
e | dog | 5.0 | 2 | no |
f | cat | 1.5 | 3 | no |
g | snake | 4.5 | 1 | no |
h | cat | NaN | 1 | yes |
i | dog | 7.0 | 2 | no |
j | dog | 3.0 | 1 | no |
k | 6.0 | snake | 2 | no |
df.drop('k',inplace=True)
17. 统计每种 animal
的个数
df['animal'].value_counts()
cat 4
dog 4
snake 2
Name: animal, dtype: int64
df.groupby('animal')['animal'].count()
animal
cat 4
dog 4
snake 2
Name: animal, dtype: int64
18. 先根据 age
降序排列,再根据 visits
升序排列(结果 i
列在前面,d
列在最后面)
df.sort_values(by=['age','visits'], ascending=[False, True], inplace=True)
df
animal | age | visits | priority | |
---|---|---|---|---|
i | dog | 7.0 | 2 | no |
e | dog | 5.0 | 2 | no |
g | snake | 4.5 | 1 | no |
j | dog | 3.0 | 1 | no |
b | cat | 3.0 | 3 | yes |
a | cat | 2.5 | 1 | yes |
f | cat | 1.5 | 3 | no |
c | snake | 0.5 | 2 | no |
h | cat | NaN | 1 | yes |
d | dog | NaN | 3 | yes |
19. 将 priority
列的 yes
和 no
用 True
和 False
替换
df['priority'] = df['priority'].map({'yes':True,'no':False})
df
animal | age | visits | priority | |
---|---|---|---|---|
i | dog | 7.0 | 2 | False |
e | dog | 5.0 | 2 | False |
g | snake | 4.5 | 1 | False |
j | dog | 3.0 | 1 | False |
b | cat | 3.0 | 3 | True |
a | cat | 2.5 | 1 | True |
f | cat | 1.5 | 3 | False |
c | snake | 0.5 | 2 | False |
h | cat | NaN | 1 | True |
d | dog | NaN | 3 | True |
20. 将 animal
列的 snake
用 python
替换
df['animal'] = df['animal'].replace('snake','python')
21. 对于每种 animal
和 visit
,求出平均年龄。换句话说,每一行都是动物,每一列都是访问次数,其值是平均年龄(提示:使用数据透视表)
df.pivot_table(index='animal', columns='visits', values='age', aggfunc='mean')
visits | 1 | 2 | 3 |
---|---|---|---|
animal | |||
cat | 2.5 | NaN | 2.5 |
dog | 3.0 | 6.0 | NaN |
python | 4.5 | 0.5 | NaN |
22. 读取data
文件夹下的boston.csv
文件,并尝试利用上面的方法自行分析,得出一些简单的结论.
df2 = pd.read_csv('./data/boston.csv')
df2
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
501 | 0.06263 | 0.0 | 11.93 | 0 | 0.573 | 6.593 | 69.1 | 2.4786 | 1 | 273.0 | 21.0 | 391.99 | 9.67 | 22.4 |
502 | 0.04527 | 0.0 | 11.93 | 0 | 0.573 | 6.120 | 76.7 | 2.2875 | 1 | 273.0 | 21.0 | 396.90 | 9.08 | 20.6 |
503 | 0.06076 | 0.0 | 11.93 | 0 | 0.573 | 6.976 | 91.0 | 2.1675 | 1 | 273.0 | 21.0 | 396.90 | 5.64 | 23.9 |
504 | 0.10959 | 0.0 | 11.93 | 0 | 0.573 | 6.794 | 89.3 | 2.3889 | 1 | 273.0 | 21.0 | 393.45 | 6.48 | 22.0 |
505 | 0.04741 | 0.0 | 11.93 | 0 | 0.573 | 6.030 | 80.8 | 2.5050 | 1 | 273.0 | 21.0 | 396.90 | 7.88 | 11.9 |
506 rows × 14 columns
df2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CRIM 506 non-null float64
1 ZN 506 non-null float64
2 INDUS 506 non-null float64
3 CHAS 506 non-null int64
4 NOX 506 non-null float64
5 RM 506 non-null float64
6 AGE 506 non-null float64
7 DIS 506 non-null float64
8 RAD 506 non-null int64
9 TAX 506 non-null float64
10 PTRATIO 506 non-null float64
11 B 506 non-null float64
12 LSTAT 506 non-null float64
13 MEDV 506 non-null float64
dtypes: float64(12), int64(2)
memory usage: 55.5 KB
df2.describe()
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 |
mean | 3.613524 | 11.363636 | 11.136779 | 0.069170 | 0.554695 | 6.284634 | 68.574901 | 3.795043 | 9.549407 | 408.237154 | 18.455534 | 356.674032 | 12.653063 | 22.532806 |
std | 8.601545 | 23.322453 | 6.860353 | 0.253994 | 0.115878 | 0.702617 | 28.148861 | 2.105710 | 8.707259 | 168.537116 | 2.164946 | 91.294864 | 7.141062 | 9.197104 |
min | 0.006320 | 0.000000 | 0.460000 | 0.000000 | 0.385000 | 3.561000 | 2.900000 | 1.129600 | 1.000000 | 187.000000 | 12.600000 | 0.320000 | 1.730000 | 5.000000 |
25% | 0.082045 | 0.000000 | 5.190000 | 0.000000 | 0.449000 | 5.885500 | 45.025000 | 2.100175 | 4.000000 | 279.000000 | 17.400000 | 375.377500 | 6.950000 | 17.025000 |
50% | 0.256510 | 0.000000 | 9.690000 | 0.000000 | 0.538000 | 6.208500 | 77.500000 | 3.207450 | 5.000000 | 330.000000 | 19.050000 | 391.440000 | 11.360000 | 21.200000 |
75% | 3.677083 | 12.500000 | 18.100000 | 0.000000 | 0.624000 | 6.623500 | 94.075000 | 5.188425 | 24.000000 | 666.000000 | 20.200000 | 396.225000 | 16.955000 | 25.000000 |
max | 88.976200 | 100.000000 | 27.740000 | 1.000000 | 0.871000 | 8.780000 | 100.000000 | 12.126500 | 24.000000 | 711.000000 | 22.000000 | 396.900000 | 37.970000 | 50.000000 |
从df2.info()可以看出14列数据种不存在缺失值,从de2.describe()中可以看出数据最小值为0,也就是数据不存在负值,另外诸如ZH列、TAX列以及B列数据分布范围较广且方差较大,可能存在异常值,需要通过统计分析进一步进行处理。
至此,相信你已经可以通过阅读文档,掌握了pandas中DataFrame
的一些基本操作,下面我们来讲点更进一步的操作,它是上述这些操作的一些巧妙结合
食材搭配
在这个部分中,我们主要介绍对于DataFrame中的数据的一些操作:去重、数据计算、取值······
请你阅读Joyful-Pandas第二章,完成下方练习. http://joyfulpandas.datawhale.club/Content/ch2.html
df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
df
A | |
---|---|
0 | 1 |
1 | 2 |
2 | 2 |
3 | 3 |
4 | 4 |
5 | 5 |
6 | 5 |
7 | 5 |
8 | 6 |
9 | 7 |
10 | 7 |
23. 请用pandas操作输出上面的df
中A
列出现的元素的唯一值(即:出现过的所有元素的集合)
df['A'].unique()
array([1, 2, 3, 4, 5, 6, 7], dtype=int64)
24. 输出df
中的唯一值及其对应出现的频数
df['A'].value_counts()
5 3
2 2
7 2
1 1
3 1
4 1
6 1
Name: A, dtype: int64
25. 将df
进行数据降重
df.drop_duplicates('A')
A | |
---|---|
0 | 1 |
1 | 2 |
3 | 3 |
4 | 4 |
5 | 5 |
8 | 6 |
9 | 7 |
26. 给定一组随机数据
df = pd.DataFrame(np.random.random(size=(5, 3)))
df
0 | 1 | 2 | |
---|---|---|---|
0 | 0.580241 | 0.632042 | 0.354775 |
1 | 0.122504 | 0.784829 | 0.920090 |
2 | 0.231789 | 0.597295 | 0.149788 |
3 | 0.568011 | 0.135924 | 0.011110 |
4 | 0.353484 | 0.835216 | 0.300393 |
使每个元素减去所在行的平均值?
df.sub(df.mean(1),axis=0)
0 | 1 | 2 | |
---|---|---|---|
0 | 0.057888 | 0.109689 | -0.167577 |
1 | -0.486637 | 0.175688 | 0.310949 |
2 | -0.094502 | 0.271004 | -0.176502 |
3 | 0.329663 | -0.102424 | -0.227239 |
4 | -0.142881 | 0.338852 | -0.195971 |
27. 返回下列df
数字总和最小那列的标签
df = pd.DataFrame(np.random.random(size=(5, 10)), columns=list('abcdefghij'))
df
a | b | c | d | e | f | g | h | i | j | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.681472 | 0.375433 | 0.157866 | 0.844610 | 0.515149 | 0.843195 | 0.482867 | 0.019393 | 0.146855 | 0.819014 |
1 | 0.718139 | 0.829071 | 0.240966 | 0.995575 | 0.388320 | 0.178618 | 0.379783 | 0.418137 | 0.549539 | 0.939299 |
2 | 0.299162 | 0.336460 | 0.445597 | 0.935320 | 0.964780 | 0.669580 | 0.150017 | 0.983547 | 0.028738 | 0.398523 |
3 | 0.155546 | 0.860910 | 0.719335 | 0.989113 | 0.032604 | 0.180492 | 0.867443 | 0.021331 | 0.559367 | 0.466884 |
4 | 0.181180 | 0.062531 | 0.813016 | 0.982059 | 0.113649 | 0.412184 | 0.857099 | 0.564338 | 0.420419 | 0.322087 |
df.sum(axis=0).sort_values().index[0]
'i'
df.sum(axis=0).idxmin()
'i'
28. 计算一个 DataFrame 有多少不重复的行?
df = pd.DataFrame(np.random.randint(0, 2, size=(10, 3)))
df
0 | 1 | 2 | |
---|---|---|---|
0 | 1 | 0 | 0 |
1 | 1 | 0 | 1 |
2 | 1 | 0 | 1 |
3 | 0 | 0 | 1 |
4 | 0 | 1 | 1 |
5 | 1 | 1 | 1 |
6 | 0 | 1 | 0 |
7 | 1 | 1 | 1 |
8 | 0 | 1 | 1 |
9 | 1 | 1 | 1 |
df.drop_duplicates().shape[0]
6
29. DataFrame 数据如下,A 和 B 都是 0-100 之间(包括边界值)的数值,对 A 进行分段分组(i.e. (0, 10], (10, 20], …),求每组内 B 的和。输出应该和下述一致:
A
(0, 10] 635
(10, 20] 360
(20, 30] 315
(30, 40] 306
(40, 50] 750
(50, 60] 284
(60, 70] 424
(70, 80] 526
(80, 90] 835
(90, 100] 852
df = pd.DataFrame(np.random.RandomState(8765).randint(1, 101,
size=(100, 2)),
columns = ["A", "B"])
df
A | B | |
---|---|---|
0 | 46 | 29 |
1 | 75 | 22 |
2 | 49 | 63 |
3 | 33 | 43 |
4 | 71 | 75 |
... | ... | ... |
95 | 60 | 87 |
96 | 57 | 40 |
97 | 86 | 19 |
98 | 50 | 56 |
99 | 97 | 94 |
100 rows × 2 columns
df.groupby(pd.cut(df['A'], bins=range(0, 110, 10), right=True))['B'].sum()
A
(0, 10] 635
(10, 20] 360
(20, 30] 315
(30, 40] 306
(40, 50] 750
(50, 60] 284
(60, 70] 424
(70, 80] 526
(80, 90] 835
(90, 100] 852
Name: B, dtype: int32
用pandas进行数据清洗
下面的df
是我们用到的数据集
df = pd.DataFrame({'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm',
'Budapest_PaRis', 'Brussels_londOn'],
'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],
'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],
'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )',
'12. Air France', '"Swiss Air"']})
df
From_To | FlightNumber | RecentDelays | Airline | |
---|---|---|---|---|
0 | LoNDon_paris | 10045.0 | [23, 47] | KLM(!) |
1 | MAdrid_miLAN | NaN | [] | <Air France> (12) |
2 | londON_StockhOlm | 10065.0 | [24, 43, 87] | (British Airways. ) |
3 | Budapest_PaRis | NaN | [13] | 12. Air France |
4 | Brussels_londOn | 10085.0 | [67, 32] | "Swiss Air" |
30. FlightNumber列中的某些值缺失(它们是NaN)。这些数字是有规律的,即每行增加 10,因此NaN
需要放置 10055 和 10075。修改df
以填充这些缺失的数字并使该列成为整数列(而不是浮点列)
df['FlightNumber'] = df['FlightNumber'].interpolate().astype(int) # 插值函数为linear
df
From_To | FlightNumber | RecentDelays | Airline | |
---|---|---|---|---|
0 | LoNDon_paris | 10045 | [23, 47] | KLM(!) |
1 | MAdrid_miLAN | 10055 | [] | <Air France> (12) |
2 | londON_StockhOlm | 10065 | [24, 43, 87] | (British Airways. ) |
3 | Budapest_PaRis | 10075 | [13] | 12. Air France |
4 | Brussels_londOn | 10085 | [67, 32] | "Swiss Air" |
31. From_To列作为两个单独的列会更好!拆分下划线分隔符_
前后的每个字符串. 将其拆分成两列,存放在一个名为“temp”的临时 DataFrame,将列名 ‘From’ 和 ‘To’ 分配给这个临时DataFrame.
temp = df['From_To'].str.split('_',expand=True)
temp.columns = ['From','To']
temp
From | To | |
---|---|---|
0 | LoNDon | paris |
1 | MAdrid | miLAN |
2 | londON | StockhOlm |
3 | Budapest | PaRis |
4 | Brussels | londOn |
32. 注意城市名称的大小写是混合在一起的。标准化字符串,以便只有第一个字母是大写的(例如“londON”应该变成“London”。)
temp['From'] = temp['From'].str.capitalize()
temp['To'] = temp['To'].str.capitalize()
temp
From | To | |
---|---|---|
0 | London | Paris |
1 | Madrid | Milan |
2 | London | Stockholm |
3 | Budapest | Paris |
4 | Brussels | London |
33. 将From_To
列从df
中删去,将temp
处理好的数据合并到df
中
df = df.drop('From_To', axis=1)
df2 = pd.concat([df, temp], axis=1)
df2
FlightNumber | RecentDelays | Airline | From | To | |
---|---|---|---|---|---|
0 | 10045 | [23, 47] | KLM(!) | London | Paris |
1 | 10055 | [] | <Air France> (12) | Madrid | Milan |
2 | 10065 | [24, 43, 87] | (British Airways. ) | London | Stockholm |
3 | 10075 | [13] | 12. Air France | Budapest | Paris |
4 | 10085 | [67, 32] | "Swiss Air" | Brussels | London |
**34.**在AirLine
列中,您可以看到航空公司名称周围出现了一些额外的符号。只提取航空公司名称。例如’(British Airways. )‘应该变成’British Airways’.
df2['Airline'] = df2['Airline'].str.extract('([a-zA-Z\s]+)', expand=False).str.strip()
df2
FlightNumber | RecentDelays | Airline | From | To | |
---|---|---|---|---|---|
0 | 10045 | [23, 47] | KLM | London | Paris |
1 | 10055 | [] | Air France | Madrid | Milan |
2 | 10065 | [24, 43, 87] | British Airways | London | Stockholm |
3 | 10075 | [13] | Air France | Budapest | Paris |
4 | 10085 | [67, 32] | Swiss Air | Brussels | London |
35. 在 RecentDelays 列中,值已作为列表输入到 DataFrame 中。我们希望每个第一个值在它自己的列中,每个第二个值在它自己的列中,依此类推。如果没有第 N 个值,则该值应为 NaN。
将 Series 列表展开为名为 的 DataFrame delays,重命名列delay_1,delay_2等等,并将不需要的 RecentDelays 列替换df为delays
df3=df2['RecentDelays'].apply(pd.Series)
df3
0 | 1 | 2 | |
---|---|---|---|
0 | 23.0 | 47.0 | NaN |
1 | NaN | NaN | NaN |
2 | 24.0 | 43.0 | 87.0 |
3 | 13.0 | NaN | NaN |
4 | 67.0 | 32.0 | NaN |
df3.columns = ['delay_{}'.format(n) for n in range(1, len(df3.columns)+1)]
df2 = df2.drop('RecentDelays', axis=1).join(df3)
df2
FlightNumber | Airline | From | To | delay_1 | delay_2 | delay_3 | |
---|---|---|---|---|---|---|---|
0 | 10045 | KLM | London | Paris | 23.0 | 47.0 | NaN |
1 | 10055 | Air France | Madrid | Milan | NaN | NaN | NaN |
2 | 10065 | British Airways | London | Stockholm | 24.0 | 43.0 | 87.0 |
3 | 10075 | Air France | Budapest | Paris | 13.0 | NaN | NaN |
4 | 10085 | Swiss Air | Brussels | London | 67.0 | 32.0 | NaN |
43, 87]
British Airways
London
Stockholm
3
10075
[13]
Air France
Budapest
Paris
4
10085
[67, 32]
Swiss Air
Brussels
London
35. 在 RecentDelays 列中,值已作为列表输入到 DataFrame 中。我们希望每个第一个值在它自己的列中,每个第二个值在它自己的列中,依此类推。如果没有第 N 个值,则该值应为 NaN。
将 Series 列表展开为名为 的 DataFrame delays,重命名列delay_1,delay_2等等,并将不需要的 RecentDelays 列替换df为delays
df3=df2['RecentDelays'].apply(pd.Series)
df3
0 | 1 | 2 | |
---|---|---|---|
0 | 23.0 | 47.0 | NaN |
1 | NaN | NaN | NaN |
2 | 24.0 | 43.0 | 87.0 |
3 | 13.0 | NaN | NaN |
4 | 67.0 | 32.0 | NaN |
df3.columns = ['delay_{}'.format(n) for n in range(1, len(df3.columns)+1)]
df2 = df2.drop('RecentDelays', axis=1).join(df3)
df2
FlightNumber | Airline | From | To | delay_1 | delay_2 | delay_3 | |
---|---|---|---|---|---|---|---|
0 | 10045 | KLM | London | Paris | 23.0 | 47.0 | NaN |
1 | 10055 | Air France | Madrid | Milan | NaN | NaN | NaN |
2 | 10065 | British Airways | London | Stockholm | 24.0 | 43.0 | 87.0 |
3 | 10075 | Air France | Budapest | Paris | 13.0 | NaN | NaN |
4 | 10085 | Swiss Air | Brussels | London | 67.0 | 32.0 | NaN |