学习pandas全套代码【超详细】数据查看、输入输出、选取、集成、清洗、转换、重塑、数学和统计方法、排序

本篇博客将会给出大家平时使用pandas的时候经常需要用到的功能代码,同时也会给出运行结果,以帮助大家更进一步的理解。

另外,我也以注释的形式更进一步的补充说明代码的功能及其作用,需要本篇博文中用到的文档文件以及代码的朋友,也可以三连支持一下,并评论留下你的邮箱,我会在看到后的第一时间发送给你。

当然啦,你也可以把本篇博文当作一本小小的pandas书籍,当需要用到pandas哪些知识的时候,Ctrl+F就可以搜索到啦,现在不看的话就先收藏着。

更新的另外一篇,欢迎先来点击收藏:学习pandas全套代码【超详细】分箱操作、分组聚合、时间序列、数据可视化

只告诉大家学什么但是不给予方向的行为无异于耍流氓,这里也是分享我多年收藏的技术干货,可以共享给喜欢我文章的朋友们,如果你肯花时间沉下心去学习,它们一定能帮到你,干货内容包括:
在这里插入图片描述
上述这份完整版的Python全套学习资料已经上传CSDN官方,如果需要可以微信扫描下方CSDN官方认证二维码 即可领取

👉[[CSDN大礼包:《python安装包&全套学习资料》免费分享]]安全链接,放心点击

目录
  • 第一部分:pandas数据结构
    • 1.1 Series
    • 1.2 DataFrame
  • 第二部分:数据查看
  • 第三部分:数据输入与输出
    • 3.1 csv
    • 3.2 Excel
    • 3.3 HDF5
    • 3.4 SQL
  • 第四部分:数据选取
    • 4.1 获取数据
    • 4.2 标签选择
    • 4.3 位置选择
    • 4.4 boolean索引
    • 4.5 赋值操作
  • 第五部分:数据集成
    • 5.1 concat数据串联
    • 5.2 数据插入
    • 5.3 Join SQL风格合并
  • 第六部分:数据清洗
  • 第七部分:数据转换
    • 7.1 轴和元素转换
    • 7.2 map映射元素转变
    • 7.3 apply映射元素转变
    • 7.4 transform元素转变
    • 7.5 重排随机抽样哑变量
  • 第八部分:数据重塑
  • 第九部分:数学和统计方法
    • 9.1 简单统计指标
    • 9.2 索引标签、位置获取
    • 9.3 更多统计指标
    • 9.4 高级统计指标
  • 第十部分:排序
  • 结束语

第一部分:pandas数据结构

import numpy as np
import pandas as pd # pandas基于NumPy,升级

pandas的主要数据结构是 Series(⼀维数据)与 DataFrame(二维数据)。

1.1 Series

# Series
l = np.array([1,2,3,6,9]) # NumPy数组

s1 = pd.Series(data = l)
display(l,s1) # Series是一维的数组,和NumPy数组不一样:Series多了索引

array([1, 2, 3, 6, 9])



0    1
1    2
2    3
3    6
4    9
dtype: int64

s2 = pd.Series(data = l,index = list('ABCDE'))
s2

A    1
B    2
C    3
D    6
E    9
dtype: int64

s3 = pd.Series(data = {'A':149,'B':130,'C':118,'D':99,'E':66})
s3

A    149
B    130
C    118
D     99
E     66
dtype: int64

1.2 DataFrame

# Series是一维的,功能比较少
# DataFrame是二维的,多个Series公用索引,组成了DataFrame
# 像不像 Excel,所有数据,结构化
df1 = pd.DataFrame(data = np.random.randint(0,151,size = (10,3)),
                   index = list('ABCDEFHIJK'), # 行索引
                   columns=['Python','Math','En'],dtype=np.float16) # 列索引
df1

PythonMathEn
A113.037.070.0
B92.022.011.0
C0.09.066.0
D40.0145.023.0
E25.0133.0108.0
F124.016.0130.0
H121.085.0133.0
I84.0125.039.0
J111.036.0137.0
K55.026.085.0
df2 = pd.DataFrame(data = {'Python':[66,99,128],'Math':[88,65,137],'En':[100,121,45]})
df2 # 字典,key作为列索引,不指定index默认从0开始索引,自动索引一样

PythonMathEn
06688100
19965121
212813745

第二部分:数据查看

df = pd.DataFrame(data = np.random.randint(0,151,size = (100,3)),
                  columns=['Python','Math','En'])
df

PythonMathEn
0133139141
18217130
25151145
31277011
4936091
955713396
969121134
9776109113
98998229
99285488

100 rows × 3 columns

df.shape # 查看DataFrame形状

(100, 3)

df.head(n = 3) # 显示前N个,默认N = 5

PythonMathEn
0133139141
18217130
25151145
df.tail() # 显示后n个

PythonMathEn
955713396
969121134
9776109113
98998229
99285488
df.dtypes # 数据类型

Python    int64
Math      int64
En        int64
dtype: object

df.info() # 比较详细信息

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Python  100 non-null    int64
 1   Math    100 non-null    int64
 2   En      100 non-null    int64
dtypes: int64(3)
memory usage: 2.5 KB

df.describe() # 描述:平均值、标准差、中位数、四等分、最大值,最小值

PythonMathEn
count100.000000100.000000100.000000
mean85.79000077.41000067.630000
std41.37517344.90530943.883835
min3.0000000.0000003.000000
25%54.50000040.25000031.250000
50%84.50000081.00000058.500000
75%123.000000113.250000103.000000
max149.000000149.000000147.000000
df.values # 值,返回的是NumPy数组

array([[133, 139, 141],
       [ 82,  17, 130],
       [ 51,  51, 145],
       [127,  70,  11],
       [ 93,  60,  91],
       [103, 110, 103],
       [ 27, 133,  32],
       [148,  99, 128],
       [139,  97,  44],
       [ 64,  85,  71],
       [147,  94,  37],
       [114,  12,  16],
       [ 16,  54,  44],
       [123,   3,  76],
       [137,  97, 123],
       [149, 113,  74],
       [ 69,  38,   7],
       [ 68, 122,   4],
       [ 53,  13,  47],
       [113, 127, 124],
       [ 55, 139,  47],
       [140, 114,  14],
       [ 84, 111, 115],
       [ 65,   5, 136],
       [ 96,  50,  89],
       [145, 130,  15],
       [111,  30,  66],
       [132, 122, 144],
       [ 79,   5,  45],
       [115,  29,  49],
       [ 27,  55,  83],
       [ 29,  74,  38],
       [ 87, 100,  45],
       [132, 147, 119],
       [ 66,  90,  40],
       [ 67, 108,  48],
       [ 78,  28,  46],
       [105, 137, 110],
       [132, 119,  55],
       [117,  23,  79],
       [ 12,  29,  12],
       [114,  58, 119],
       [139,   0,  42],
       [ 61,  69, 142],
       [141,  73, 107],
       [ 49,  12,  19],
       [  8,   1,  75],
       [134,  60,  25],
       [138,  80,  79],
       [112, 115,  26],
       [ 77,   4, 120],
       [140, 100,  35],
       [ 82, 129,   4],
       [100,   8,  25],
       [ 77,  97,  78],
       [ 55, 113,  53],
       [ 45,  73,  37],
       [ 44,   0,  80],
       [ 26,  74,  52],
       [ 99,  75, 147],
       [111,   8, 144],
       [ 55, 146,  15],
       [140, 106,  74],
       [ 91,  78,  92],
       [130, 108,  41],
       [ 34,  41, 136],
       [  3, 139,   4],
       [123,  93,   4],
       [ 24, 103,   3],
       [ 44, 122,  92],
       [ 83,  45,  50],
       [ 46, 149, 103],
       [ 48, 127,  92],
       [  3,  51,  57],
       [136, 136,  82],
       [ 65, 102,  16],
       [ 23,  61, 118],
       [138,  15,   6],
       [ 83,  91,   4],
       [109,  24,  54],
       [ 40,  43, 125],
       [103, 123, 141],
       [116, 113,  38],
       [137,  71, 126],
       [ 69, 143,  83],
       [  8,  60,  60],
       [ 40,  22,  95],
       [ 73,  19,  17],
       [137, 129, 103],
       [109, 142,  94],
       [ 85, 105,  10],
       [ 97, 107,  19],
       [ 79,  12,  27],
       [143,  74,  18],
       [ 32, 114,  52],
       [ 57, 133,  96],
       [ 91,  21, 134],
       [ 76, 109, 113],
       [ 99,  82,  29],
       [ 28,  54,  88]])

df.columns # 列索引

Index(['Python', 'Math', 'En'], dtype='object')

df.index # 行索引 0 ~ 99

RangeIndex(start=0, stop=100, step=1)

第三部分:数据输入与输出

3.1 csv

df = pd.DataFrame(data = np.random.randint(0,151,size = (100,3)),
                  columns=['Python','Math','En'])
df # 行索引,列索引

PythonMathEn
01012854
1454774
210333133
3702481
490143121
9514525139
965351109
97357130
98865120
991496675

100 rows × 3 columns

df.to_csv('./data.csv',sep = ',',
          index = True, # 保存行索引
          header=True) # 保存列索引

df.to_csv('./data2.csv',sep = ',',
          index = False, # 不保存行索引
          header=False) # 不保存列索引

pd.read_csv('./data.csv',
            index_col=0) # 第一列作为行索引

PythonMathEn
01012854
1454774
210333133
3702481
490143121
9514525139
965351109
97357130
98865120
991496675

100 rows × 3 columns

pd.read_csv('./data2.csv',header =None)

012
01012854
1454774
210333133
3702481
490143121
9514525139
965351109
97357130
98865120
991496675

100 rows × 3 columns

3.2 Excel

df

PythonMathEn
01012854
1454774
210333133
3702481
490143121
9514525139
965351109
97357130
98865120
991496675

100 rows × 3 columns

df.to_excel('./data.xls')

pd.read_excel('./data.xls',
              index_col=0) # 第一列作为行索引

PythonMathEn
01012854
1454774
210333133
3702481
490143121
9514525139
965351109
97357130
98865120
991496675

100 rows × 3 columns

3.3 HDF5

df.to_hdf('./data.h5',key = 'score')

df2 = pd.DataFrame(data = np.random.randint(6,100,size = (1000,5)),
                   columns=['计算机','化工','生物','工程','教师'])
df2

计算机化工生物工程教师
06422166860
19547727637
28848925037
3753886383
46214202145
9958994889727
996526821850
9977699109256
9986631556594
999821388914

1000 rows × 5 columns

df2.to_hdf('./data.h5',key = 'salary')

pd.read_hdf('./data.h5',key = 'salary')

计算机化工生物工程教师
06422166860
19547727637
28848925037
3753886383
46214202145
9958994889727
996526821850
9977699109256
9986631556594
999821388914

1000 rows × 5 columns

3.4 SQL

from sqlalchemy import create_engine # 数据库引擎,构建和数据库的连接

# PyMySQL
# 类似网页地址
engine = create_engine('mysql+pymysql://root:12345678@localhost/pandas?charset=utf8')

df2.to_sql('salary',engine,index=False) # 将Python中数据DataFrame保存到Mysql

df3 = pd.read_sql('select * from salary limit 50',con = engine)
df3

计算机化工生物工程教师
06422166860
19547727637
28848925037
3753886383
46214202145
59541843716
63445119394
79910573063
85960123793
94158156770
10198639664
117561784989
128684316827
134298242085
149515978087
155222443574
166720652410
17946416266
188676807219
19618164266
20779284187
218716751434
222382924232
236189282140
242212388914
257712468912
266045527167
272976942691
281460828860
295636446037
306377434282
312571365121
32768687838
339359862578
347340128666
351030541371
36948587585
378141611255
388068669284
395336842666
401962634745
418939913186
425743534819
436616231910
444628788121
45385376498
46559470644
475633921784
486968238790
491247328015

第四部分:数据选取

4.1 获取数据

df = pd.DataFrame(np.random.randint(0,151,size = (10,3)),
                  index=list('ABCDEFHIJK'),columns=['Python','Math','En'])
df

PythonMathEn
A885248
B786294
C91471
D861521
E17171
F12313855
H5917140
I68858
J1007063
K793772
df['Python'] # 获取数据Series

A     88
B     78
C      9
D     86
E      1
F    123
H     59
I     68
J    100
K     79
Name: Python, dtype: int64

df.Python # 属性,DataFrame中列索引,表示属性

A     88
B     78
C      9
D     86
E      1
F    123
H     59
I     68
J    100
K     79
Name: Python, dtype: int64

df[['Python','En']] # 获取多列数据

PythonEn
A8848
B7894
C971
D8621
E171
F12355
H59140
I6858
J10063
K7972

4.2 标签选择

# 标签,就是行索引 location = loc 位置
df.loc['A']

Python    88
Math      52
En        48
Name: A, dtype: int64

df.loc[['A','F','K']]

PythonMathEn
A885248
F12313855
K793772
df.loc['A','Python']

88

df.loc[['A','C','F'],'Python']

A     88
C      9
F    123
Name: Python, dtype: int64

df.loc['A'::2,['Math','En']]

MathEn
A5248
C1471
E7171
H17140
J7063
df.loc['A':'D',:]

PythonMathEn
A885248
B786294
C91471
D861521

4.3 位置选择

df.iloc[0]

Python    88
Math      52
En        48
Name: A, dtype: int64

df.iloc[[0,2,4]]

PythonMathEn
A885248
C91471
E17171
df.iloc[0:4,[0,2]]

PythonEn
A8848
B7894
C971
D8621
df.iloc[3:8:2]

PythonMathEn
D861521
F12313855
I68858

4.4 boolean索引

cond = df.Python > 80 # 将Python大于80分的成绩获取
df[cond]

PythonMathEn
A885248
D861521
F12313855
J1007063
cond = df.mean(axis = 1) > 75 # 平均分大于75,优秀,筛选出来
df[cond]

PythonMathEn
B786294
F12313855
J1007063
cond = (df.Python > 70) & (df.Math > 70)
df[cond]

PythonMathEn
F12313855
cond = df.index.isin(['C','E','H','K']) # 判断数据是否在数组中
df[cond] # 删选出来了符合条件的数据

PythonMathEn
C91471
E17171
H5917140
K793772

4.5 赋值操作

df['Python']['A'] = 150 # 修改某个位置的值
df

PythonMathEn
A1505248
B786294
C91471
D861521
E17171
F12313855
H5917140
I68858
J1007063
K793772
df['Java'] = np.random.randint(0,151,size = 10) # 新增加一列
df

PythonMathEnJava
A150524865
B78629425
C9147182
D861521139
E1717167
F12313855145
H591714053
I68858141
J100706311
K793772127
df.loc[['C','D','E'],'Math'] = 147 # 修改多个人的成绩
df

PythonMathEnJava
A150524865
B78629425
C91477182
D8614721139
E11477167
F12313855145
H591714053
I68858141
J100706311
K793772127
cond = df < 60
df[cond] = 60 # where 条件操作,符合这条件值,修改,不符合,不改变

df

PythonMathEnJava
A150606065
B78629460
C601477182
D8614760139
E601477167
F12313860145
H606014060
I686060141
J100706360
K796072127
df.iloc[3::3,[0,2]] += 100

df

PythonMathEnJava
A150606065
B78629460
C601477182
D186147160139
E601477167
F12313860145
H1606024060
I686060141
J100706360
K17960172127

第五部分:数据集成

5.1 concat数据串联

# np.concatenate NumPy数据集成
df1 = pd.DataFrame(np.random.randint(0,151,size = (10,3)),
                   columns=['Python','Math','En'],
                   index = list('ABCDEFHIJK'))
df2 = pd.DataFrame(np.random.randint(0,151,size = (10,3)),
                   columns = ['Python','Math','En'],
                   index = list('QWRTUYOPLM'))
df3 = pd.DataFrame(np.random.randint(0,151,size = (10,2)),
                  columns=['Java','Chinese'],index = list('ABCDEFHIJK'))

pd.concat([df1,df2],axis = 0) # axis = 0变是行合并,行增加

PythonMathEn
A1087453
B981647
C7177128
D9123131
E2590132
F10510686
H1464281
I83436
J102798
K921147
Q1195943
W2062106
R7782128
T4411915
U4914962
Y949088
O10572133
P87109123
L125140149
M14822102
pd.concat([df1,df3],axis = 1) # axis = 1表示列增加

PythonMathEnJavaChinese
A10874536181
B981647117117
C7177128484
D9123131149115
E259013211373
F1051068614026
H1464281117118
I8343610391
J1027984320
K9211479372
df1.append(df2) # append追加,在行后面直接进行追加

PythonC++MathEn
A10859.07453
B984.01647
C7127.077128
D917.0123131
E2560.090132
F105136.010686
H146112.04281
I83120.0436
J10228.0798
K9253.01147
Q119NaN5943
W20NaN62106
R77NaN82128
T44NaN11915
U49NaN14962
Y94NaN9088
O105NaN72133
P87NaN109123
L125NaN140149
M148NaN22102
df1.append(df3) # 出现空数据,原因在于:df1的列索引和df3列索引不一致

PythonC++MathEnJavaChinese
A108.059.074.053.0NaNNaN
B98.04.016.047.0NaNNaN
C71.027.077.0128.0NaNNaN
D9.017.0123.0131.0NaNNaN
E25.060.090.0132.0NaNNaN
F105.0136.0106.086.0NaNNaN
H146.0112.042.081.0NaNNaN
I83.0120.04.036.0NaNNaN
J102.028.079.08.0NaNNaN
K92.053.011.047.0NaNNaN
ANaNNaNNaNNaN61.081.0
BNaNNaNNaNNaN117.0117.0
CNaNNaNNaNNaN48.04.0
DNaNNaNNaNNaN149.0115.0
ENaNNaNNaNNaN113.073.0
FNaNNaNNaNNaN140.026.0
HNaNNaNNaNNaN117.0118.0
INaNNaNNaNNaN103.091.0
JNaNNaNNaNNaN43.020.0
KNaNNaNNaNNaN93.072.0
pd.concat([df1,df3],axis = 0)

PythonC++MathEnJavaChinese
A108.059.074.053.0NaNNaN
B98.04.016.047.0NaNNaN
C71.027.077.0128.0NaNNaN
D9.017.0123.0131.0NaNNaN
E25.060.090.0132.0NaNNaN
F105.0136.0106.086.0NaNNaN
H146.0112.042.081.0NaNNaN
I83.0120.04.036.0NaNNaN
J102.028.079.08.0NaNNaN
K92.053.011.047.0NaNNaN
ANaNNaNNaNNaN61.081.0
BNaNNaNNaNNaN117.0117.0
CNaNNaNNaNNaN48.04.0
DNaNNaNNaNNaN149.0115.0
ENaNNaNNaNNaN113.073.0
FNaNNaNNaNNaN140.026.0
HNaNNaNNaNNaN117.0118.0
INaNNaNNaNNaN103.091.0
JNaNNaNNaNNaN43.020.0
KNaNNaNNaNNaN93.072.0

5.2 数据插入

df1

PythonMathEn
A1087453
B981647
C7177128
D9123131
E2590132
F10510686
H1464281
I83436
J102798
K921147
df1.insert(loc = 1, # 插入位置
           column='C++', # 插入一列,这一列名字
           value = np.random.randint(0,151,size = 10)) # 插入的值

df1

PythonC++MathEn
A108597453
B9841647
C712777128
D917123131
E256090132
F10513610686
H1461124281
I83120436
J10228798
K92531147

5.3 Join SQL风格合并

df1 = pd.DataFrame(data = {'name':['softpo','Brandon','Ella','Daniel','张三'],
                           'height':[175,180,169,177,168]}) # 身高
df2 = pd.DataFrame(data = {'name':['softpo','Brandon','Ella','Daniel','李四'],
                           'weight':[70,65,74,63,88]}) # 体重
df3 = pd.DataFrame(data = {'名字':['softpo','Brandon','Ella','Daniel','张三'],
                           'salary':np.random.randint(20,100,size = 5)}) # 薪水
display(df1,df2,df3)

nameheight
0softpo175
1Brandon180
2Ella169
3Daniel177
4张三168
nameweight
0softpo70
1Brandon65
2Ella74
3Daniel63
4李四88
名字salary
0softpo64
1Brandon48
2Ella25
3Daniel26
4张三96
pd.concat([df1,df2],axis = 1)

nameheightnameweight
0softpo175softpo70
1Brandon180Brandon65
2Ella169Ella74
3Daniel177Daniel63
4张三168李四88
# 根据共同的属性,合并数据
# df1 和 df2 共同属性:name
# 数据库,合并join 共同key
# inner内合并
pd.merge(df1,df2,how = 'inner') # 根据共同name进行合并,两表合并,外键

nameheightweight
0softpo17570
1Brandon18065
2Ella16974
3Daniel17763
pd.merge(df1,df2,how = 'outer') # 外合并,所有数据保留,不对应位置,填充了空数据

nameheightweight
0softpo175.070.0
1Brandon180.065.0
2Ella169.074.0
3Daniel177.063.0
4张三168.0NaN
5李四NaN88.0
pd.merge(df1,df2,how = 'left')

nameheightweight
0softpo17570.0
1Brandon18065.0
2Ella16974.0
3Daniel17763.0
4张三168NaN
pd.merge(df1,df3,left_on='name',right_on='名字')

nameheight名字salary
0softpo175softpo64
1Brandon180Brandon48
2Ella169Ella25
3Daniel177Daniel26
4张三168张三96
df4 = pd.DataFrame(data = np.random.randint(0,151,size = (10,3)),
                   columns=['Python','Math','En'],index = list('ABCDEFHIJK'))
df4

PythonMathEn
A71789
B14511640
C56150139
D886641
E87139117
F1414518
H93119114
I110892
J23596
K125599
score_mean = df4.mean(axis = 1).round(1)
score_mean

A     55.7
B    100.3
C    115.0
D     65.0
E    114.3
F     68.0
H    108.7
I     67.0
J     44.3
K     64.3
dtype: float64

df4.insert(loc = 2,column='平均分',value=score_mean)

df4

PythonMath平均分En
A71755.789
B145116100.340
C56150115.0139
D886665.041
E87139114.3117
F1414568.018
H93119108.7114
I1108967.02
J23544.396
K1255964.39
df5 = df4.iloc[:,[0,1,3]]
df5

PythonMathEn
A71789
B14511640
C56150139
D886641
E87139117
F1414518
H93119114
I110892
J23596
K125599
score_mean.name = '平均分'
score_mean

A     55.7
B    100.3
C    115.0
D     65.0
E    114.3
F     68.0
H    108.7
I     67.0
J     44.3
K     64.3
Name: 平均分, dtype: float64

df5

PythonMathEn
A71789
B14511640
C56150139
D886641
E87139117
F1414518
H93119114
I110892
J23596
K125599
pd.merge(df5,score_mean,
         left_index=True, # 数据合并根据行索引,对应
         right_index=True) # 右边数据根据行索引,对应

PythonMathEn平均分
A7178955.7
B14511640100.3
C56150139115.0
D88664165.0
E87139117114.3
F141451868.0
H93119114108.7
I11089267.0
J2359644.3
K12559964.3

第六部分:数据清洗

df = pd.DataFrame(data = {'color':['red','blue','red','green','green','blue',None,np.NaN,'green'],
                          'price':[20,15,20,18,18,22,30,30,22]})
df

colorprice
0red20
1blue15
2red20
3green18
4green18
5blue22
6None30
7NaN30
8green22
# 重复数据删除
df.drop_duplicates() # 非重复数据,索引7和索引6重复数据,None和NaN一回事

colorprice
0red20
1blue15
3green18
5blue22
6None30
8green22
df

colorprice
0red20
1blue15
2red20
3green18
4green18
5blue22
6None30
7NaN30
8green22
df.dropna() # 空数据过滤

colorprice
0red20
1blue15
2red20
3green18
4green18
5blue22
8green22
# 删除行,或者列
df.drop(labels=[2,4,6,8]) # 默认情况下删除行

colorprice
0red20
1blue15
3green18
5blue22
7NaN30
# 删除指定的列
df.drop(labels='color',axis = 1) # 删除列,axis = 1

price
020
115
220
318
418
522
630
730
822
df.filter(items=['price']) # 参数意思,保留数据price

price
020
115
220
318
418
522
630
730
822
df['size'] = 1024 # 广播
df

colorpricesize
0red201024
1blue151024
2red201024
3green181024
4green181024
5blue221024
6None301024
7NaN301024
8green221024
df.filter(like = 'i') # 模糊匹配,保留了带有i这个字母的索引

pricesize
0201024
1151024
2201024
3181024
4181024
5221024
6301024
7301024
8221024
df['hello'] = 512
df

colorpricesizehello
0red201024512
1blue151024512
2red201024512
3green181024512
4green181024512
5blue221024512
6None301024512
7NaN301024512
8green221024512
# 正则表达式,方式很多
df.filter(regex = 'e$') # 正则表达式,正则表达式,限制e必须在最后

pricesize
0201024
1151024
2201024
3181024
4181024
5221024
6301024
7301024
8221024
df.filter(regex='e') # 只要带有e全部选出来

pricesizehello
0201024512
1151024512
2201024512
3181024512
4181024512
5221024512
6301024512
7301024512
8221024512
# 异常值过滤
a = np.random.randint(0,1000,size = 200)
a

array([647, 871,  35, 738, 789, 587, 413, 559, 648, 993, 579, 129, 825,
       904, 356, 316, 997, 800,  35, 601,   1, 208, 465, 614, 680, 619,
       922, 346, 994, 135,   5, 650, 165, 475,  95, 194, 225, 455, 634,
       717, 836, 678, 156, 203, 263, 180, 143, 248, 407,  56, 202, 947,
        46, 408, 686, 530, 545, 273, 125, 964, 323, 775, 313, 238, 242,
       804, 228, 322, 322, 768, 556,   9, 629, 938, 932, 859, 955, 707,
       729, 541, 280, 493, 255, 681, 428, 992, 420, 650, 267,  32, 662,
       185, 756, 319, 313, 271, 229, 711, 803,  85, 527, 853, 670, 685,
       423, 458, 628, 701, 253, 495, 548, 879, 503, 115,  90, 978, 665,
       532, 198, 482, 412, 850, 879, 913,  96, 177, 778, 337, 502, 128,
        49, 747, 591,  22, 557, 105, 136, 775, 626, 515, 959, 869, 245,
       437,  51, 236, 438, 489, 854,  49, 163, 687, 488, 175, 428, 517,
       493, 377, 100, 728, 717, 926, 689, 186, 777, 639,  79,  83, 620,
       623, 931, 918, 721, 315, 133, 423, 161, 999, 341,  55, 837, 582,
       530, 805,  22, 301, 177, 322, 708,  14,  50, 864, 889, 929, 967,
       497, 624, 127, 539,  14])

# 异常值,大于800,小于 100算作异常,认为定义的。根据实际情况。
cond = (a <=800) & (a >=100)
a[cond]

array([647, 738, 789, 587, 413, 559, 648, 579, 129, 356, 316, 800, 601,
       208, 465, 614, 680, 619, 346, 135, 650, 165, 475, 194, 225, 455,
       634, 717, 678, 156, 203, 263, 180, 143, 248, 407, 202, 408, 686,
       530, 545, 273, 125, 323, 775, 313, 238, 242, 228, 322, 322, 768,
       556, 629, 707, 729, 541, 280, 493, 255, 681, 428, 420, 650, 267,
       662, 185, 756, 319, 313, 271, 229, 711, 527, 670, 685, 423, 458,
       628, 701, 253, 495, 548, 503, 115, 665, 532, 198, 482, 412, 177,
       778, 337, 502, 128, 747, 591, 557, 105, 136, 775, 626, 515, 245,
       437, 236, 438, 489, 163, 687, 488, 175, 428, 517, 493, 377, 100,
       728, 717, 689, 186, 777, 639, 620, 623, 721, 315, 133, 423, 161,
       341, 582, 530, 301, 177, 322, 708, 497, 624, 127, 539])

# 正态分布,平均值是0,标准差是1
b = np.random.randn(100000)
b

array([-1.17335196,  2.02215212, -0.29891071, ..., -1.6762474 ,
       -1.27071523, -1.15187761])

# 过滤异常值 
cond = np.abs(b) > 3*1 # 这些异常值,找到了
b[cond]

array([ 3.46554243,  3.08127362,  3.55119821,  3.62774922,  3.11823028,
        3.22620922, -3.10381164, -3.20067563, -3.04607325, -3.04427703,
        3.09111414, -3.28220862,  3.00499105, -3.06179762, -3.17331972,
       -3.37172359,  3.93766782, -3.22895232, -3.13737479,  3.07612751,
       -3.43215209, -3.27660651, -3.35116041,  4.74328695,  3.25586636,
       -3.54090785,  3.08881127,  3.00635551,  3.5018534 , -3.14463788,
       -3.0182886 , -3.12145648, -3.24276219,  3.08087834,  3.04820238,
       -3.24173442, -3.14648209,  3.87748281, -3.07660111, -3.16083928,
        3.32641202, -3.05228179,  3.04924043,  3.02825131, -3.08360056,
       -3.04890894, -3.27258041, -3.07339115, -3.38375287, -3.14267022,
       -3.7207377 ,  3.4813841 , -3.12866105, -3.17122631,  3.0599701 ,
        3.12393087,  3.20253178, -3.05221958, -3.35532417,  3.02450167,
       -3.28385568,  3.3422833 , -3.11052755, -3.09647003,  3.32353664,
       -3.70215812, -3.07916575, -3.13546874,  3.20575826, -3.67982084,
       -3.17055893,  3.4836615 , -3.30039879, -3.27774497,  3.02125912,
        3.12332885,  3.01456477,  3.15958151, -3.34101369,  3.32444673,
        3.06479889,  3.14506863,  3.15670827,  3.15066995,  3.14705869,
       -3.20526898, -3.0761338 ,  3.20716127, -3.20941307, -3.7212859 ,
       -3.51785834, -3.06096986, -3.05425748, -3.47049261,  3.22285172,
       -3.32233224, -3.04630606,  3.41215312, -3.16482337, -3.01813609,
       -3.05441573, -3.10394416,  3.03469642,  3.01493847, -3.11901071,
        3.5996865 ,  3.48194227, -3.77734847,  3.04588004,  3.10611158,
       -3.20473003, -3.4377999 ,  3.22680244, -3.1536921 , -3.22798726,
        3.34569796,  3.06046948, -3.16955677,  3.12613756,  3.04286964,
        3.01148054,  3.18525226, -4.08971624, -3.55427596, -5.39879049,
        3.05203254,  3.08944491, -3.02258209,  3.17316913, -3.1615401 ,
        3.17205118, -3.24221772, -3.14421237, -3.74675036,  3.61678522,
        3.59097443, -3.0302881 ,  3.23236707, -3.00850012,  3.33608986,
       -3.02859152, -3.7000766 , -3.10992575, -3.00412636, -3.05657102,
       -3.05208781,  3.14017797,  3.46457731,  3.15619413, -3.43236114,
        3.08259529, -3.84578168,  3.04203424, -3.29444028, -3.01764756,
        3.11300256,  3.23071233,  3.20785451, -3.15668756,  3.44176099,
       -3.19985577, -3.14126853, -3.26482841, -3.62208271, -3.55305069,
        3.09639491, -3.18178713, -3.03662021,  3.17247227,  3.3908074 ,
       -3.63563705, -3.56417097,  3.02823554, -3.06955375,  3.74305364,
        3.63993306, -3.14193492, -3.04032527, -3.28310908, -3.37949723,
       -3.25915912, -3.01206123, -3.10871377, -3.22982732,  3.8136103 ,
        3.48893313,  3.9918267 ,  3.4526763 , -3.46595488, -3.29996013,
       -3.42965097,  3.151502  ,  3.10548689, -3.44707735,  3.21881565,
        3.50932999, -3.12410382,  3.30296386,  3.02454576, -3.20072608,
        3.54339754, -3.17847739, -3.21475045,  3.03546088, -3.06225619,
        3.48158164,  3.15243123, -3.06358376,  3.27300242,  3.32577453,
        3.23535167, -3.04681725,  3.33439387,  3.10620079,  3.52883469,
       -3.1790272 ,  3.02641222, -3.45636819,  3.21009424,  3.08045954,
       -3.59721754,  3.24693695,  3.05920919, -3.43674159, -3.00370946,
       -3.48031594, -3.28748467,  3.42581649,  3.46912521, -3.28384157,
        3.76358974, -3.34035865,  3.12978233,  3.44856854, -3.04074246,
        3.50018071,  3.33188267, -3.09775514, -3.49356906, -3.09902374,
        3.12068562, -3.1776565 , -3.44282129,  3.19286374, -3.28304596,
       -3.10080963, -3.37189709,  3.77743156,  3.03547536,  3.22045459,
       -3.44007263,  3.01331408,  3.49733677,  3.28831922,  3.62147013,
        3.03458981,  3.15447237, -3.33931478,  3.09858431, -3.23592306,
        3.3144797 ,  3.37067342, -3.18749118,  3.09319307, -3.34390567,
        3.29819563,  3.3120354 ,  3.04166958, -3.00975323,  3.0347423 ,
       -3.82502331, -3.13125028, -3.0876424 ,  3.13929221,  3.570775  ,
       -3.37420738,  3.17527797,  3.13396148, -3.70088631, -3.04054948,
        3.05399103,  3.24908851,  3.19666266, -3.64071456, -3.85271081,
        3.06864652,  3.53367592,  3.54650649,  3.6355438 ,  3.657715  ,
        4.03831601,  3.61651925])

第七部分:数据转换

7.1 轴和元素转换

import numpy as np
import pandas as pd

df = pd.DataFrame(data = np.random.randint(0,10,size = (10,3)),
                  columns=['Python','Tensorflow','Keras'],
                  index = list('ABCDEFHIJK'))
df

PythonTensorflowKeras
A253
B500
C704
D047
E869
F826
H678
I769
J479
K671
df.rename(index = {'A':'X','K':'Y'}, # 行索引
          columns={'Python':'人工智能'}, # 列索引修改
          inplace=True) # 替换原数据

df.replace(5,50,inplace=True)
df

人工智能TensorflowKeras
X2503
B5000
C704
D047
E869
F826
H678
I769
J479
Y671
df.replace([2,7],1024,inplace=True)
df

人工智能TensorflowKeras
X1024503
B5000
C102404
D041024
E869
F810246
H610248
I102469
J410249
Y610241
df.iloc[4,2] = np.NaN # 空数据

df.replace({0:2048,np.nan:-100},inplace=True)
df

人工智能TensorflowKeras
X1024503.0
B5020482048.0
C102420484.0
D204841024.0
E86-100.0
F810246.0
H610248.0
I102469.0
J410249.0
Y610241.0
df.replace({'Tensorflow':1024},-1024) # 指定某一列,进行数据替换

人工智能TensorflowKeras
X1024503.0
B5020482048.0
C102420484.0
D204841024.0
E86-100.0
F8-10246.0
H6-10248.0
I102469.0
J4-10249.0
Y6-10241.0

7.2 map映射元素转变

# map 只能针对一列,就是Series
# 有一些没有对应,那么返回就是空数据
df['人工智能'].map({1024:3.14,2048:2.718,6:1108}) # 跟据字典对数据进行改变

X       3.140
B         NaN
C       3.140
D       2.718
E         NaN
F         NaN
H    1108.000
I       3.140
J         NaN
Y    1108.000
Name: 人工智能, dtype: float64

df['Keras'].map(lambda x :True if x > 0 else False) # 如果大于 0 返回True,不然返回False

X     True
B     True
C     True
D     True
E    False
F     True
H     True
I     True
J     True
Y     True
Name: Keras, dtype: bool

def convert(x):
    if x >= 1024:
        return True
    else:
        return False
df['level'] = df['Tensorflow'].map(convert) # map映射,映射是Tensorflow中这一列中每一个数据,传递到方法中
df

人工智能TensorflowKeraslevel
X1024503.0False
B5020482048.0True
C102420484.0True
D204841024.0False
E86-100.0False
F810246.0True
H610248.0True
I102469.0False
J410249.0True
Y610241.0True

7.3 apply映射元素转变

# 既可以操作Series又可以操作DataFrame
df['人工智能'].apply(lambda x : x + 100)

X    1124
B     150
C    1124
D    2148
E     108
F     108
H     106
I    1124
J     104
Y     106
Name: 人工智能, dtype: int64

df['level'].apply(lambda x:1 if x else 0)

X    0
B    1
C    1
D    0
E    0
F    1
H    1
I    0
J    1
Y    1
Name: level, dtype: int64

df.apply(lambda x : x + 1000) # apply对 所有的数据进行映射

人工智能TensorflowKeraslevel
X202410501003.01000
B105030483048.01001
C202430481004.01001
D304810042024.01000
E10081006900.01000
F100820241006.01001
H100620241008.01001
I202410061009.01000
J100420241009.01001
Y100620241001.01001
def convert(x):
    return (x.median(),x.count(),x.min(),x.max(),x.std()) # 返回中位数,返回的是计数
df.apply(convert).round(1) # 默认操作列数据

人工智能TensorflowKeraslevel
029.01024.07.01
110.010.010.010
24.04.0-100.0False
32048.02048.02048.0True
4717.8800.4694.90.516398
df

人工智能TensorflowKeraslevel
X1024503.0False
B5020482048.0True
C102420484.0True
D204841024.0False
E86-100.0False
F810246.0True
H610248.0True
I102469.0False
J410249.0True
Y610241.0True
df.apply(convert,axis = 1) # axis = 1,操作数据就是行数据

X     (26.5, 4, False, 1024, 503.68732033541073)
B    (1049.0, 4, True, 2048, 1167.8622564326668)
C      (514.0, 4, True, 2048, 979.1007353689405)
D     (514.0, 4, False, 2048, 979.3623776042588)
E        (3.0, 4, -100.0, 8, 52.443620520834884)
F        (7.0, 4, True, 1024, 509.5085049993441)
H        (7.0, 4, True, 1024, 509.5085049993441)
I         (7.5, 4, False, 1024, 509.51373877453)
J        (6.5, 4, True, 1024, 509.6773489179208)
Y         (3.5, 4, 1.0, 1024, 510.6721061503164)
dtype: object

7.4 transform元素转变

df = pd.DataFrame(np.random.randint(0,10,size = (10,3)),
                  columns=['Python','Tensorflow','Keras'],
                  index = list('ABCDEFHIJK'))
display(df)
# 可以针对一列数据,Series进行运算
df['Python'].transform(lambda x : 1024 if x > 5 else -1024) # 这个功能和map,apply类似的

PythonTensorflowKeras
A119
B691
C146
D951
E418
F257
H723
I789
J542
K707
A   -1024
B    1024
C   -1024
D    1024
E   -1024
F   -1024
H    1024
I    1024
J   -1024
K    1024
Name: Python, dtype: int64

df['Tensorflow'].apply([np.sqrt,np.square,np.cumsum]) # 针对一列,进行不同的操作

sqrtsquarecumsum
A1.00000011
B3.0000008110
C2.0000001614
D2.2360682519
E1.000000120
F2.2360682525
H1.414214427
I2.8284276435
J2.0000001639
K0.000000039
df['Tensorflow'].transform([np.sqrt,np.square,np.cumsum]) # 针对一列,进行不同的操作

sqrtsquarecumsum
A1.00000011
B3.0000008110
C2.0000001614
D2.2360682519
E1.000000120
F2.2360682525
H1.414214427
I2.8284276435
J2.0000001639
K0.000000039
def convert(x):
    if x > 5:
        return True
    else:
        return False
# 可以针对DataFrame进行运算
df.transform({'Python':np.cumsum,'Tensorflow':np.square,'Keras':convert}) # 对不同的列,执行不同的操作

PythonTensorflowKeras
A11True
B781False
C816True
D1725False
E211True
F2325True
H304False
I3764True
J4216False
K490True
df.apply({'Python':np.cumsum,'Tensorflow':np.square,'Keras':convert}) # 对不同的列,执行不同的操作

PythonTensorflowKeras
A11True
B781False
C816True
D1725False
E211True
F2325True
H304False
I3764True
J4216False
K490True

7.5 重排随机抽样哑变量

df

PythonTensorflowKeras
A119
B691
C146
D951
E418
F257
H723
I789
J542
K707
index = np.random.permutation(10) # 返回打乱顺讯的索引
index

array([3, 4, 1, 2, 7, 9, 0, 8, 5, 6])

# 重排,索引打乱
df.take(index)

PythonTensorflowKeras
D951
E418
B691
C146
I789
K707
A119
J542
F257
H723
# 从大量数据中随机抽取数据
df.take(np.random.randint(0,10,size = 20)) # 随机抽样20个数据

PythonTensorflowKeras
J542
J542
D951
K707
H723
I789
J542
A119
C146
J542
I789
D951
I789
K707
A119
B691
H723
D951
B691
H723
df2 = pd.DataFrame(data = {'key':['a','b','a','b','c','b','c']})
df2

key
0a
1b
2a
3b
4c
5b
6c
# one-hot,哑变量
# str类型数据,经过哑变量变换可以使用数字表示
pd.get_dummies(df2,prefix='',prefix_sep='') # 1表示,有;0表示,没有

abc
0100
1010
2100
3010
4001
5010
6001

第八部分:数据重塑

df

PythonTensorflowKeras
A119
B691
C146
D951
E418
F257
H723
I789
J542
K707
df.T # 转置,行变列,列变行

ABCDEFHIJK
Python1619427757
Tensorflow1945152840
Keras9161873927
df2 = pd.DataFrame(np.random.randint(0,10,size = (20,3)),
                   columns=['Python','Math','En'],
                   index = pd.MultiIndex.from_product([list('ABCDEFHIJK'),['期中','期末']])) # 多层索引

df2

PythonMathEn
A期中330
期末658
B期中559
期末752
C期中079
期末975
D期中565
期末796
E期中739
期末914
F期中995
期末089
H期中700
期末166
I期中818
期末799
J期中508
期末366
K期中822
期末352
df2.unstack(level = 1) # 将行索引变成列索引,-1表示最后一层

PythonMathEn
期中期末期中
A363
B575
C097
D576
E793
F909
H710
I871
J530
K832
df2.unstack(level = -1) # 将行索引变成列索引,-1表示最后一层

PythonMathEn
期中期末期中
A363
B575
C097
D576
E793
F909
H710
I871
J530
K832
df2.stack() # 列变成行了

A  期中  Python    3
       Math      3
       En        0
   期末  Python    6
       Math      5
       En        8
B  期中  Python    5
       Math      5
       En        9
   期末  Python    7
       Math      5
       En        2
C  期中  Python    0
       Math      7
       En        9
   期末  Python    9
       Math      7
       En        5
D  期中  Python    5
       Math      6
       En        5
   期末  Python    7
       Math      9
       En        6
E  期中  Python    7
       Math      3
       En        9
   期末  Python    9
       Math      1
       En        4
F  期中  Python    9
       Math      9
       En        5
   期末  Python    0
       Math      8
       En        9
H  期中  Python    7
       Math      0
       En        0
   期末  Python    1
       Math      6
       En        6
I  期中  Python    8
       Math      1
       En        8
   期末  Python    7
       Math      9
       En        9
J  期中  Python    5
       Math      0
       En        8
   期末  Python    3
       Math      6
       En        6
K  期中  Python    8
       Math      2
       En        2
   期末  Python    3
       Math      5
       En        2
dtype: int64

df2.unstack().stack(level = 0)

期中期末
AEn08
Math35
Python36
BEn92
Math55
Python57
CEn95
Math77
Python09
DEn56
Math69
Python57
EEn94
Math31
Python79
FEn59
Math98
Python90
HEn06
Math06
Python71
IEn89
Math19
Python87
JEn86
Math06
Python53
KEn22
Math25
Python83
df2.mean() # 计算的是 列

Python    5.45
Math      4.85
En        5.60
dtype: float64

df2.mean(axis = 1)

A  期中    2.000000
   期末    6.333333
B  期中    6.333333
   期末    4.666667
C  期中    5.333333
   期末    7.000000
D  期中    5.333333
   期末    7.333333
E  期中    6.333333
   期末    4.666667
F  期中    7.666667
   期末    5.666667
H  期中    2.333333
   期末    4.333333
I  期中    5.666667
   期末    8.333333
J  期中    4.333333
   期末    5.000000
K  期中    4.000000
   期末    3.333333
dtype: float64

df2.mean(level=1) # 计算期中期末所有学生的平均分

PythonMathEn
期中5.73.65.5
期末5.26.15.7
df2.mean(level = 0) # 计算每位学生期中和期末平均分

PythonMathEn
A4.54.04.0
B6.05.05.5
C4.57.07.0
D6.07.55.5
E8.02.06.5
F4.58.57.0
H4.03.03.0
I7.55.08.5
J4.03.07.0
K5.53.52.0

第九部分:数学和统计方法

9.1 简单统计指标

df = pd.DataFrame(np.random.randint(0,10,size = (20,3)),
                  columns=['Python','Math','En'],index = list('QWERTYUIOPASDFGHJKLZ'))
df

PythonMathEn
Q143
W820
E309
R989
T913
Y531
U580
I183
O035
P661
A040
S394
D528
F292
G198
H952
J575
K265
L273
Z298
df.iloc[6,2] = np.NAN
display(df)

PythonMathEn
Q143.0
W820.0
E309.0
R989.0
T913.0
Y531.0
U58NaN
I183.0
O035.0
P661.0
A040.0
S394.0
D528.0
F292.0
G198.0
H952.0
J575.0
K265.0
L273.0
Z298.0
df.count() # 统计非空数据数量

Python    20
Math      20
En        19
dtype: int64

display(df.mean(),df.median()) # 平均值,中位数

Python    3.900000
Math      5.500000
En        4.157895
dtype: float64



Python    3.0
Math      6.0
En        3.0
dtype: float64

display(df.min(),df.max()) # 最小值,最大值

Python    0.0
Math      0.0
En        0.0
dtype: float64



Python    9.0
Math      9.0
En        9.0
dtype: float64

df['Python'].unique() # 去除重复数据

array([1, 8, 3, 9, 5, 0, 6, 2])

df['Math'].value_counts() # 统计出现的频次

9    4
8    3
7    2
6    2
4    2
3    2
2    2
5    1
1    1
0    1
Name: Math, dtype: int64

df.quantile(q = [0,0.25,0.5,0.75,1]) # 百分位数

PythonMathEn
0.000.000.00.0
0.251.753.02.0
0.503.006.03.0
0.755.258.06.5
1.009.009.09.0
df.describe().round(1)

PythonMathEn
count20.020.019.0
mean3.95.54.2
std3.02.93.0
min0.00.00.0
25%1.83.02.0
50%3.06.03.0
75%5.28.06.5
max9.09.09.0

9.2 索引标签、位置获取

df['Python'].argmax() # 返回最大值索引

3

df['En'].argmin() # 最小值索引

1

df.idxmax() # 返回最大值的标签

Python    R
Math      S
En        E
dtype: object

df.idxmin() # 返回最小值标签

Python    O
Math      E
En        W
dtype: object

9.3 更多统计指标

df.cumsum() # 累加和

PythonMathEn
Q143.0
W963.0
E12612.0
R211421.0
T301524.0
Y351825.0
U4026NaN
I413428.0
O413733.0
P474334.0
A474734.0
S505638.0
D555846.0
F576748.0
G587656.0
H678158.0
J728863.0
K749468.0
L7610171.0
Z7811079.0
df.cumprod() # 累乘和

PythonMathEn
Q143.0
W880.0
E2400.0
R21600.0
T194400.0
Y972000.0
U486000NaN
I4860000.0
O000.0
P000.0
A000.0
S000.0
D000.0
F000.0
G000.0
H000.0
J000.0
K000.0
L000.0
Z000.0
df.cummin() # 累计最小值

PythonMathEn
Q143.0
W120.0
E100.0
R100.0
T100.0
Y100.0
U10NaN
I100.0
O000.0
P000.0
A000.0
S000.0
D000.0
F000.0
G000.0
H000.0
J000.0
K000.0
L000.0
Z000.0
df.cummax() # 累计最大值

PythonMathEn
Q143.0
W843.0
E849.0
R989.0
T989.0
Y989.0
U98NaN
I989.0
O989.0
P989.0
A989.0
S999.0
D999.0
F999.0
G999.0
H999.0
J999.0
K999.0
L999.0
Z999.0
df.std() # 标准差

Python    3.041814
Math      2.946898
En        3.004869
dtype: float64

df.var()

Python    9.252632
Math      8.684211
En        9.029240
dtype: float64

df.diff() # 差分,当前数据减去上一个的差值

PythonMathEn
QNaNNaNNaN
W7.0-2.0-3.0
E-5.0-2.09.0
R6.08.00.0
T0.0-7.0-6.0
Y-4.02.0-2.0
U0.05.0NaN
I-4.00.0NaN
O-1.0-5.02.0
P6.03.0-4.0
A-6.0-2.0-1.0
S3.05.04.0
D2.0-7.04.0
F-3.07.0-6.0
G-1.00.06.0
H8.0-4.0-6.0
J-4.02.03.0
K-3.0-1.00.0
L0.01.0-2.0
Z0.02.05.0
df.pct_change().round(3) # 计算百分比变化

PythonMathEn
QNaNNaNNaN
W7.000-0.500-1.000
E-0.625-1.000inf
R2.000inf0.000
T0.000-0.875-0.667
Y-0.4442.000-0.667
U0.0001.6670.000
I-0.8000.0002.000
O-1.000-0.6250.667
Pinf1.000-0.800
A-1.000-0.333-1.000
Sinf1.250inf
D0.667-0.7781.000
F-0.6003.500-0.750
G-0.5000.0003.000
H8.000-0.444-0.750
J-0.4440.4001.500
K-0.600-0.1430.000
L0.0000.167-0.400
Z0.0000.2861.667

9.4 高级统计指标

df.cov() # 协方差:自己和别人计算

PythonMathEn
Python9.252632-2.157895-0.695906
Math-2.1578958.6842111.160819
En-0.6959061.1608199.029240
df.var() # 方差: 自己和自己计算

Python    9.252632
Math      8.684211
En        9.029240
dtype: float64

df['Python'].cov(df['Math'])

-2.157894736842105

df.corr() # 相关性系数 -1 ~ 1

PythonMathEn
Python1.000000-0.240731-0.074376
Math-0.2407311.0000000.130217
En-0.0743760.1302171.000000
df.corrwith(df['En']) # 一列的相关性系数

Python   -0.074376
Math      0.130217
En        1.000000
dtype: float64

第十部分:排序

df = pd.DataFrame(np.random.randint(0,20,size = (20,3)),
                  columns=['Python','Tensorflow','Keras'],index = list('QWERTYUIOPASDFGHJKLZ'))
df

PythonTensorflowKeras
Q1734
W13187
E12110
R3514
T11157
Y5154
U1827
I736
O1185
P1260
A4184
S1558
D81114
F3217
G4178
H1214
J126
K17916
L11144
Z16134
df.sort_index(axis = 0,ascending=False) # 降序

PythonTensorflowKeras
Z16134
Y5154
W13187
U1827
T11157
S1558
R3514
Q1734
P1260
O1185
L11144
K17916
J126
I736
H1214
G4178
F3217
E12110
D81114
A4184
df.sort_index(ascending=True) # 升序

PythonTensorflowKeras
A4184
D81114
E12110
F3217
G4178
H1214
I736
J126
K17916
L11144
O1185
P1260
Q1734
R3514
S1558
T11157
U1827
W13187
Y5154
Z16134
df.sort_values(by = 'Python',ascending=True) # 根据Python属性进行升序排列

PythonTensorflowKeras
J126
O1185
R3514
F3217
A4184
G4178
Y5154
I736
D81114
T11157
L11144
P1260
E12110
H1214
W13187
S1558
Z16134
K17916
Q1734
U1827
df.sort_values(by = ['Python','Tensorflow'],ascending=True) 
# 先根据Python进行排序,如果相等在根据Tensorflow排序

PythonTensorflowKeras
J126
O1185
F3217
R3514
G4178
A4184
Y5154
I736
D81114
L11144
T11157
H1214
P1260
E12110
W13187
S1558
Z16134
Q1734
K17916
U1827
df.nlargest(n = 5,columns='Python') # 根据Python进行排序,获取最大的5个数值

PythonTensorflowKeras
U1827
Q1734
K17916
Z16134
S1558
df.nsmallest(5,columns='Keras') # 根据Keras进行排序,获取最小的5个

PythonTensorflowKeras
E12110
P1260
Q1734
Y5154
A4184

结束语

本篇博文的代码是在jupyter上运行的,不过具体在哪运行都没什么大的区别。

感谢收看,祝学业和工作进步! 需要本文资料的话,欢迎关注评论。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值