Python基础(十一) 超详细的Pandas库三万字总结_gdp_data

学好 Python 不论是就业还是做副业赚钱都不错,但要学会 Python 还是要有一个学习规划。最后大家分享一份全套的 Python 学习资料,给那些想学习 Python 的小伙伴们一点帮助!

一、Python所有方向的学习路线

Python所有方向路线就是把Python常用的技术点做整理,形成各个领域的知识点汇总,它的用处就在于,你可以按照上面的知识点去找对应的学习资源,保证自己学得较为全面。

二、学习软件

工欲善其事必先利其器。学习Python常用的开发软件都在这里了,给大家节省了很多时间。

三、全套PDF电子书

书籍的好处就在于权威和体系健全,刚开始学习的时候你可以只看视频或者听某个人讲课,但等你学完之后,你觉得你掌握了,这时候建议还是得去看一下书籍,看权威技术书籍也是每个程序员必经之路。

四、入门学习视频

我们在看视频学习的时候,不能光动眼动脑不动手,比较科学的学习方法是在理解之后运用它们,这时候练手项目就很适合了。

五、实战案例

光学理论是没用的,要学会跟着一起敲,要动手实操,才能将自己的所学运用到实际当中去,这时候可以搞点实战案例来学习。

六、面试资料

我们学习Python必然是为了找到高薪的工作,下面这些面试题是来自阿里、腾讯、字节等一线互联网大厂最新的面试资料,并且有阿里大佬给出了权威的解答,刷完这一套面试资料相信大家都能找到满意的工作。

网上学习资料一大堆,但如果学到的知识不成体系,遇到问题时只是浅尝辄止,不再深入研究,那么很难做到真正的技术提升。

需要这份系统化学习资料的朋友,可以戳这里获取

一个人可以走的很快,但一群人才能走的更远!不论你是正从事IT行业的老鸟或是对IT行业感兴趣的新人,都欢迎加入我们的的圈子(技术交流、学习资源、职场吐槽、大厂内推、面试辅导),让我们一起学习成长!

AC
2019-01-040.305088-0.978434
2019-01-050.3133830.163155
2019-01-060.250613-0.858240
  • 行分散取值,列切片
df.loc[["2019-01-02", "2019-01-06"], "C": "D"]

上面这种方式是行不通的。

df.iloc[[1, 5], 0: 3]

ABC
2019-01-02-0.234414-1.1946741.080779
2019-01-060.250613-0.904400-0.858240
  • 行、列均分散取值
df.loc[["2019-01-04", "2019-01-06"], ["A", "D"]]

同样,上面这种方式是行不通的。

df.iloc[[1, 5], [0, 3]]

AD
2019-01-02-0.234414-2.294395
2019-01-060.250613-1.573342

4、布尔索引

相当于numpy当中的掩码操作。

df

ABCD
2019-01-01-0.935378-0.1907420.925984-0.818969
2019-01-02-0.234414-1.1946741.080779-2.294395
2019-01-03-0.1415720.0581181.1022481.207726
2019-01-040.3050880.535920-0.9784340.177251
2019-01-050.3133830.2340410.163155-0.296649
2019-01-060.250613-0.904400-0.858240-1.573342
df > 0

ABCD
2019-01-01FalseFalseTrueFalse
2019-01-02FalseFalseTrueFalse
2019-01-03FalseTrueTrueTrue
2019-01-04TrueTrueFalseTrue
2019-01-05TrueTrueTrueFalse
2019-01-06TrueFalseFalseFalse
df[df > 0]

ABCD
2019-01-01NaNNaN0.925984NaN
2019-01-02NaNNaN1.080779NaN
2019-01-03NaN0.0581181.1022481.207726
2019-01-040.3050880.535920NaN0.177251
2019-01-050.3133830.2340410.163155NaN
2019-01-060.250613NaNNaNNaN

可以观察到,为true的部分都被取到了,而false没有。

df.A > 0

2019-01-01    False
2019-01-02    False
2019-01-03    False
2019-01-04     True
2019-01-05     True
2019-01-06     True
Freq: D, Name: A, dtype: bool

df[df.A > 0]

ABCD
2019-01-040.3050880.535920-0.9784340.177251
2019-01-050.3133830.2340410.163155-0.296649
2019-01-060.250613-0.904400-0.858240-1.573342
  • isin()方法
df2 = df.copy()
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
df2

ABCDE
2019-01-01-0.935378-0.1907420.925984-0.818969one
2019-01-02-0.234414-1.1946741.080779-2.294395one
2019-01-03-0.1415720.0581181.1022481.207726two
2019-01-040.3050880.535920-0.9784340.177251three
2019-01-050.3133830.2340410.163155-0.296649four
2019-01-060.250613-0.904400-0.858240-1.573342three
ind = df2["E"].isin(["two", "four"])
ind     

2019-01-01    False
2019-01-02    False
2019-01-03     True
2019-01-04    False
2019-01-05     True
2019-01-06    False
Freq: D, Name: E, dtype: bool

df2[ind]

ABCDE
2019-01-03-0.1415720.0581181.1022481.207726two
2019-01-050.3133830.2340410.163155-0.296649four

(5)赋值

df

  • DataFrame 增加新列
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20190101', periods=6))
s1

2019-01-01    1
2019-01-02    2
2019-01-03    3
2019-01-04    4
2019-01-05    5
2019-01-06    6
Freq: D, dtype: int64

df["E"] = s1
df

ABCDE
2019-01-01-0.935378-0.1907420.925984-0.8189691
2019-01-02-0.234414-1.1946741.080779-2.2943952
2019-01-03-0.1415720.0581181.1022481.2077263
2019-01-040.3050880.535920-0.9784340.1772514
2019-01-050.3133830.2340410.163155-0.2966495
2019-01-060.250613-0.904400-0.858240-1.5733426
  • 修改赋值
df.loc["2019-01-01", "A"] = 0
df

ABCDE
2019-01-010.000000-0.1907420.925984-0.8189691
2019-01-02-0.234414-1.1946741.080779-2.2943952
2019-01-03-0.1415720.0581181.1022481.2077263
2019-01-040.3050880.535920-0.9784340.1772514
2019-01-050.3133830.2340410.163155-0.2966495
2019-01-060.250613-0.904400-0.858240-1.5733426
df.iloc[0, 1] = 0
df

ABCDE
2019-01-010.0000000.0000000.925984-0.8189691
2019-01-02-0.234414-1.1946741.080779-2.2943952
2019-01-03-0.1415720.0581181.1022481.2077263
2019-01-040.3050880.535920-0.9784340.1772514
2019-01-050.3133830.2340410.163155-0.2966495
2019-01-060.250613-0.904400-0.858240-1.5733426
df["D"] = np.array([5]\*len(df))   # 可简化成df["D"] = 5
df

ABCDE
2019-01-010.0000000.0000000.92598451
2019-01-02-0.234414-1.1946741.08077952
2019-01-03-0.1415720.0581181.10224853
2019-01-040.3050880.535920-0.97843454
2019-01-050.3133830.2340410.16315555
2019-01-060.250613-0.904400-0.85824056
  • 修改index和columns
df.index = [i for i in range(len(df))]
df

ABCDE
00.0000000.0000000.92598451
1-0.234414-1.1946741.08077952
2-0.1415720.0581181.10224853
30.3050880.535920-0.97843454
40.3133830.2340410.16315555
50.250613-0.904400-0.85824056
df.columns = [i for i in range(df.shape[1])]
df

01234
00.0000000.0000000.92598451
1-0.234414-1.1946741.08077952
2-0.1415720.0581181.10224853
30.3050880.535920-0.97843454
40.3133830.2340410.16315555
50.250613-0.904400-0.85824056

11.3 数值运算及统计分析

image-20221002211052367

1、数据的查看

import pandas as pd
import numpy as np

dates = pd.date_range(start='2019-01-01', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=["A", "B", "C", "D"])
df

ABCD
2019-01-01-0.8540430.412345-2.296051-0.048964
2019-01-021.371364-0.121454-0.2996531.095375
2019-01-03-0.714591-1.1032240.9792500.319455
2019-01-04-1.3975570.4260080.233861-1.651887
2019-01-050.4340260.459830-0.0954441.220302
2019-01-06-0.1338760.074500-1.0281470.605402

(1)查看前面的行

df.head()    # 默认5行,也可以进行设置

ABCD
2019-01-01-0.8540430.412345-2.296051-0.048964
2019-01-021.371364-0.121454-0.2996531.095375
2019-01-03-0.714591-1.1032240.9792500.319455
2019-01-04-1.3975570.4260080.233861-1.651887
2019-01-050.4340260.459830-0.0954441.220302
df.head(2)

ABCD
2019-01-01-0.8540430.412345-2.296051-0.048964
2019-01-021.371364-0.121454-0.2996531.095375

(2)查看后面的行

df.tail()    # 默认5行

ABCD
2019-01-021.371364-0.121454-0.2996531.095375
2019-01-03-0.714591-1.1032240.9792500.319455
2019-01-04-1.3975570.4260080.233861-1.651887
2019-01-050.4340260.459830-0.0954441.220302
2019-01-06-0.1338760.074500-1.0281470.605402
df.tail(3) 

ABCD
2019-01-04-1.3975570.4260080.233861-1.651887
2019-01-050.4340260.459830-0.0954441.220302
2019-01-06-0.1338760.074500-1.0281470.605402

(3)查看总体信息

df.iloc[0, 3] = np.nan
df

ABCD
2019-01-01-0.8540430.412345-2.296051NaN
2019-01-021.371364-0.121454-0.2996531.095375
2019-01-03-0.714591-1.1032240.9792500.319455
2019-01-04-1.3975570.4260080.233861-1.651887
2019-01-050.4340260.459830-0.0954441.220302
2019-01-06-0.1338760.074500-1.0281470.605402
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6 entries, 2019-01-01 to 2019-01-06
Freq: D
Data columns (total 4 columns):
A    6 non-null float64
B    6 non-null float64
C    6 non-null float64
D    5 non-null float64
dtypes: float64(4)
memory usage: 240.0 bytes

2、Numpy通用函数同样适用于Pandas

(1)向量化运算

x = pd.DataFrame(np.arange(4).reshape(1, 4))
x

0123
00123
x+5

0123
05678
np.exp(x)

0123
01.02.7182827.38905620.085537
y = pd.DataFrame(np.arange(4,8).reshape(1, 4))
y

0123
04567
x\*y

0123
0051221

(2)矩阵化运算

np.random.seed(42)
x = pd.DataFrame(np.random.randint(10, size=(30, 30)))
x

012345678920212223242526272829
063746926744095809263
182426486132031731559
235191937686870772072
320496986870242049668
499260334669686008838
526578402972040700115
664002149565085233292
723638076173010446882
822375707301152830304
937762002564232004528
1047042034605619190708
1156969218796528959950
1239554074460729694946
1384099015875840349946
1430469954316103712002
1542007912126394173848
1639487202318003852038
1728632944286942618990
1856798191443525699262
1919378602804322381800
2045526897573508043251
2124819714670182046504
2245246444991769915521
2305480644128507692043
2497090374153782219224
2541954504899307023759
2667197262610659803839
2728135177028045455637
2868622743751792459532
2930300954321304808756

30 rows × 30 columns

  • 转置
z = x.T
z

012345678920212223242526272829
068329262234240946263
132509643275455717880
274142506375824091163
342996703762148959320
466160828526960047520
594993410708746352179
628383047009144706745
766764296727444442734
871686951355691186073
943876767067792591252
1038713530754026419910
1171406768357505105835
1279162878559041292431
1328465301723185882557
1459771056359001698359
1544949079277421686904
1611828942819931584176
1773874336241802759759
1856055619804501376521
1917823158108073708487
2042609253143018390811
2100826000125175736073
2293748481530860805490
2351026050228297279524
2487700724800096208448
2503740034304419120550
2691298036043652933598
2725068128352550278657
2865763198025024253335
2939288522481413499726

30 rows × 30 columns

np.random.seed(1)
y = pd.DataFrame(np.random.randint(10, size=(30, 30)))
y

012345678920212223242526272829
058950017691706997691
101883987369204927798
269377459367711308645
362578447740198231272
460926627701540789570
593914468891870342035
612430607284336735324
740338356751731669696
800296067036795495256
968777260527062436763
1006476295994939125408
1123994482160598660473
1201606164258807207119
1351596498752432004250
1403853147328557591393
1533613050527177383063
1606596466223686513263
1767280186005625430621
1894409877617997114656
1941151262330009859340
2098639908162901394888
2128649055616756874240
2203590365116253939519
2377086120441960283725
2460423105701127529473
2550214946935535927416
2698181626182512533618
2718646954729315117126
2807743278520283739238
2980268364976785725345

30 rows × 30 columns

x.dot(y)

012345678920212223242526272829
0616560723739612457681799575590523739613580668602733585657700
1520438691600612455666764707592555681503679641506779494633590
2557570786807690469804828704573563675712758793672754550756638
3605507664701660496698806651575582685668586629534678484591626
4599681753873721563754770620654633747661677726649716610735706
5422354602627613396617627489423456572559537499384589436574507
6359446599599481357577572451464449550495532633554663476565602
7531520698590607537665696571472576588551665652527742528650599
8449322547533593399584638587424402596523523447362561386529484
9373433525601522345551521434447508498438478459418488407503496
10500427574607667477652656615477622702531610558532598471582561
11664694772841779574730810711608591760616638721676846678754708
12545547687701721576689724710532674684648694710564757571671656
13574586723750691494696787667523618681568682715644756557690604
14502382645557570403538677500501369650507576546531554437616463
15510505736651649510719733694557605717574642678576755455598654
16567376614612643514598724547464456639520560569442596517659532
17626716828765740603809852692591664716655721742612819593744712
18600559667664641556624815638564581701559677710554748597614657
19445431661681641552690719602474515637576620572512599455622538
20523569784725713501740772638640589775664686726672747548723645
21487465553639517449592609454398492567534404554417561466498492
22479449574686583377566614563455453539491501596520722478565501
23483386476526550426492585536482322541438456487408502426474481
24523551658767537444663731576577522590525664691548635526641538
25652656738753853508752815669576694833693606575616704559728672
26578577744856699497779800733587630754704834760680765592731629
27554494665689630574695703636599554685532658649554693577634668
28498552659784552492690775544551567636518599742521733533605604
29513491563642477367589647516484428574504548553483540407547455

30 rows × 30 columns

%timeit x.dot(y)

218 µs ± 18.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit np.dot(x, y)

81.1 µs ± 2.85 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

  • 执行相同运算,Numpy与Pandas的对比
x1 = np.array(x)
x1

y1 = np.array(y)
y1

%timeit x1.dot(y1)

22.1 µs ± 992 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit np.dot(x1, y1)

22.6 µs ± 766 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit np.dot(x.values, y.values)

42.9 µs ± 1.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

x2 = list(x1)
y2 = list(y1)
x3 = []
y3 = []
for i in x2:
    res = []
    for j in i:
        res.append(int(j))
    x3.append(res)
for i in y2:
    res = []
    for j in i:
        res.append(int(j))
    y3.append(res)

def f(x, y):
    res = []
    for i in range(len(x)):
        row = []
        for j in range(len(y[0])):
            sum_row = 0
            for k in range(len(x[0])):
                sum_row += x[i][k]\*y[k][j]
            row.append(sum_row)
        res.append(row)
    return res          

%timeit f(x3, y3)

4.29 ms ± 207 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

一般来说,纯粹的计算在Numpy里执行的更快

Numpy更侧重于计算,Pandas更侧重于数据处理

(3)广播运算

np.random.seed(42)
x = pd.DataFrame(np.random.randint(10, size=(3, 3)), columns=list("ABC"))
x

ABC
0637
1469
2267
  • 按行广播
x.iloc[0]

A    6
B    3
C    7
Name: 0, dtype: int32

x/x.iloc[0]

ABC
01.0000001.01.000000
10.6666672.01.285714
20.3333332.01.000000
  • 按列广播
x.A

0    6
1    4
2    2
Name: A, dtype: int32

x.div(x.A, axis=0)             # add sub div mul

ABC
01.00.51.166667
11.01.52.250000
21.03.03.500000
x.div(x.iloc[0], axis=1)

ABC
01.0000001.01.000000
10.6666672.01.285714
20.3333332.01.000000

3、新的用法

(1)索引对齐

A = pd.DataFrame(np.random.randint(0, 20, size=(2, 2)), columns=list("AB"))
A

AB
037
121
B = pd.DataFrame(np.random.randint(0, 10, size=(3, 3)), columns=list("ABC"))
B

ABC
0751
1409
2580
  • pandas会自动对齐两个对象的索引,没有的值用np.nan表示
A+B

ABC
010.012.0NaN
16.01.0NaN
2NaNNaNNaN
  • 缺省值也可用fill_value来填充
A.add(B, fill_value=0)

ABC
010.012.01.0
16.01.09.0
25.08.00.0
A\*B

ABC
021.035.0NaN
18.00.0NaN
2NaNNaNNaN

(2)统计相关

  • 数据种类统计
y = np.random.randint(3, size=20)
y

array([2, 2, 2, 1, 2, 1, 1, 2, 1, 2, 2, 0, 2, 0, 2, 2, 0, 0, 2, 1])

np.unique(y)

array([0, 1, 2])

用Counter方法统计数据

from collections import Counter
Counter(y)

Counter({2: 11, 1: 5, 0: 4})

y1 = pd.DataFrame(y, columns=["A"])
y1

A
02
12
22
31
42
51
61
72
81
92
102
110
122
130
142
152
160
170
182
191

np.unique(y1)

有value counter的方法

y1["A"].value_counts()

2    11
1     5
0     4
Name: A, dtype: int64

  • 产生新的结果,并进行排序
population_dict = {"BeiJing": 2154,
                   "ShangHai": 2424,
                   "ShenZhen": 1303,
                   "HangZhou": 981 }
population = pd.Series(population_dict) 

GDP_dict = {"BeiJing": 30320,
            "ShangHai": 32680,
            "ShenZhen": 24222,
            "HangZhou": 13468 }
GDP = pd.Series(GDP_dict)

city_info = pd.DataFrame({"population": population,"GDP": GDP})
city_info

populationGDP
BeiJing215430320
ShangHai242432680
ShenZhen130324222
HangZhou98113468
city_info["per\_GDP"] = city_info["GDP"]/city_info["population"]
city_info

populationGDPper_GDP
BeiJing21543032014.076137
ShangHai24243268013.481848
ShenZhen13032422218.589409
HangZhou9811346813.728848

递增排序

city_info.sort_values(by="per\_GDP")

populationGDPper_GDP
ShangHai24243268013.481848
HangZhou9811346813.728848
BeiJing21543032014.076137
ShenZhen13032422218.589409

递减排序

city_info.sort_values(by="per\_GDP", ascending=False)

populationGDPper_GDP
ShenZhen13032422218.589409
BeiJing21543032014.076137
HangZhou9811346813.728848
ShangHai24243268013.481848

按轴进行排序

data = pd.DataFrame(np.random.randint(20, size=(3, 4)), index=[2, 1, 0], columns=list("CBAD"))
data

CBAD
2313178
1119146
0117142

行排序

data.sort_index()

CBAD
0117142
1119146
2313178

列排序

data.sort_index(axis=1)

ABCD
2171338
1141916
0147112
data.sort_index(axis=1, ascending=False)

DCBA
2831317
1611914
0211714
  • 统计方法
df = pd.DataFrame(np.random.normal(2, 4, size=(6, 4)),columns=list("ABCD"))
df

ABCD
01.0821983.557396-3.0604766.367969
113.1132526.7745592.8745535.527044
2-2.036341-4.3331775.094802-0.152567
3-3.386712-1.522365-2.5222092.537716
44.3284915.5509945.5773295.019991
51.171336-0.493910-4.0326136.398588

非空个数

df.count()

A    6
B    6
C    6
D    6
dtype: int64

求和

df.sum()

A    14.272224
B     9.533497
C     3.931385
D    25.698741
dtype: float64

df.sum(axis=1)

0     7.947086
1    28.289408
2    -1.427283
3    -4.893571
4    20.476806
5     3.043402
dtype: float64

最大值 最小值

df.min()

A   -3.386712
B   -4.333177
C   -4.032613
D   -0.152567
dtype: float64

df.max(axis=1)

0     6.367969
1    13.113252
2     5.094802
3     2.537716
4     5.577329
5     6.398588
dtype: float64

df

ABCD
01.0821983.557396-3.0604766.367969
113.1132526.7745592.8745535.527044
2-2.036341-4.3331775.094802-0.152567
3-3.386712-1.522365-2.5222092.537716
44.3284915.5509945.5773295.019991
51.171336-0.493910-4.0326136.398588
df.idxmax()

A    1
B    1
C    4
D    5
dtype: int64

均值

df.mean()

A    2.378704
B    1.588916
C    0.655231
D    4.283124
dtype: float64

方差

df.var()

A    34.980702
B    19.110656
C    18.948144
D     6.726776
dtype: float64

标准差

df.std()

A    5.914449
B    4.371574
C    4.352947
D    2.593603
dtype: float64

中位数

df.median()

A    1.126767
B    1.531743
C    0.176172
D    5.273518
dtype: float64

众数

data = pd.DataFrame(np.random.randint(5, size=(10, 2)), columns=list("AB"))
data

AB
042
132
220
324
420
541
620
711
834
920
data.mode()

AB
020

75%分位数

df.quantile(0.75)

A    3.539202
B    5.052594
C    4.539740
D    6.157738
Name: 0.75, dtype: float64

  • 用describe()可以获取所有属性
df.describe()

ABCD
count6.0000006.0000006.0000006.000000
mean2.3787041.5889160.6552314.283124
std5.9144494.3715744.3529472.593603
min-3.386712-4.333177-4.032613-0.152567
25%-1.256706-1.265251-2.9259103.158284
50%1.1267671.5317430.1761725.273518
75%3.5392025.0525944.5397406.157738
max13.1132526.7745595.5773296.398588
data_2 = pd.DataFrame([["a", "a", "c", "d"],
                       ["c", "a", "c", "b"],
                       ["a", "a", "d", "c"]], columns=list("ABCD"))
data_2

ABCD
0aacd
1cacb
2aadc
  • 字符串类型的describe
data_2.describe()

ABCD
count3333
unique2123
topaacd
freq2321

相关性系数和协方差

df.corr()

ABCD
A1.0000000.8310630.3310600.510821
B0.8310631.0000000.1792440.719112
C0.3310600.1792441.000000-0.450365
D0.5108210.719112-0.4503651.000000
df.corrwith(df["A"])

A    1.000000
B    0.831063
C    0.331060
D    0.510821
dtype: float64

自定义输出

apply(method)的用法:使用method方法默认对每一列进行相应的操作

df

ABCD
01.0821983.557396-3.0604766.367969
113.1132526.7745592.8745535.527044
2-2.036341-4.3331775.094802-0.152567
3-3.386712-1.522365-2.5222092.537716
44.3284915.5509945.5773295.019991
51.171336-0.493910-4.0326136.398588
df.apply(np.cumsum)

ABCD
01.0821983.557396-3.0604766.367969
114.19545010.331955-0.18592311.895013
212.1591095.9987784.90887811.742447
38.7723974.4764132.38666914.280162
413.10088810.0274067.96399919.300153
514.2722249.5334973.93138525.698741
df.apply(np.cumsum, axis=1)

ABCD
01.0821984.6395941.5791177.947086
113.11325219.88781122.76236428.289408
2-2.036341-6.369518-1.274717-1.427283
3-3.386712-4.909077-7.431287-4.893571
44.3284919.87948515.45681420.476806
51.1713360.677427-3.3551863.043402
df.apply(sum)

A    14.272224
B     9.533497
C     3.931385
D    25.698741
dtype: float64

df.sum()

A    14.272224
B     9.533497
C     3.931385
D    25.698741
dtype: float64

df.apply(lambda x: x.max()-x.min())

A    16.499965
B    11.107736
C     9.609942
D     6.551155
dtype: float64

def my\_describe(x):
    return pd.Series([x.count(), x.mean(), x.max(), x.idxmin(), x.std()], \
                     index=["Count", "mean", "max", "idxmin", "std"])
df.apply(my_describe)

ABCD
Count6.0000006.0000006.0000006.000000
mean2.3787041.5889160.6552314.283124
max13.1132526.7745595.5773296.398588
idxmin3.0000002.0000005.0000002.000000
std5.9144494.3715744.3529472.593603

11.4 缺失值处理

1、发现缺失值

import pandas as pd
import numpy as np

data = pd.DataFrame(np.array([[1, np.nan, 2],
                              [np.nan, 3, 4],
                              [5, 6, None]]), columns=["A", "B", "C"])
data

ABC
01NaN2
1NaN34
256None

注意:有None、字符串等,数据类型全部变为object,它比int和float更消耗资源

np.nan是一个特殊的浮点数,类型是浮点类型,所以表示缺失值时最好使用NaN。

data.dtypes

A    object
B    object
C    object
dtype: object

data.isnull()

ABC
0FalseTrueFalse
1TrueFalseFalse
2FalseFalseTrue
data.notnull()

ABC
0TrueFalseTrue
1FalseTrueTrue
2TrueTrueFalse

2、删除缺失值

data = pd.DataFrame(np.array([[1, np.nan, 2, 3],
                              [np.nan, 4, 5, 6],
                              [7, 8, np.nan, 9],
                              [10, 11 , 12, 13]]), columns=["A", "B", "C", "D"])
data

ABCD
01.0NaN2.03.0
1NaN4.05.06.0
27.08.0NaN9.0
310.011.012.013.0

注意:np.nan是一种特殊的浮点数

data.dtypes

A    float64
B    float64
C    float64
D    float64
dtype: object

(1)删除整行

data.dropna()

ABCD
310.011.012.013.0

(2)删除整列

data.dropna(axis="columns")

D
03.0
16.0
29.0
313.0
data["D"] = np.nan
data

ABCD
01.0NaN2.0NaN
1NaN4.05.0NaN
27.08.0NaNNaN
310.011.012.0NaN
data.dropna(axis="columns", how="all")

ABC
01.0NaN2.0
1NaN4.05.0
27.08.0NaN
310.011.012.0

all表示都是缺失值时才删除。

data.dropna(axis="columns", how="any")

0
1
2
3
data.loc[3] = np.nan
data

ABCD
01.0NaN2.0NaN
1NaN4.05.0NaN
27.08.0NaNNaN
3NaNNaNNaNNaN
data.dropna(how="all")

ABCD
01.0NaN2.0NaN
1NaN4.05.0NaN
27.08.0NaNNaN

3、填充缺失值

data = pd.DataFrame(np.array([[1, np.nan, 2, 3],
                              [np.nan, 4, 5, 6],
                              [7, 8, np.nan, 9],
                              [10, 11 , 12, 13]]), columns=["A", "B", "C", "D"])
data

ABCD
01.0NaN2.03.0
1NaN4.05.06.0
27.08.0NaN9.0
310.011.012.013.0
data.fillna(value=5)

ABCD
01.05.02.03.0
15.04.05.06.0
27.08.05.09.0
310.011.012.013.0
  • 用均值进行替换
fill = data.mean()
fill

A    6.000000
B    7.666667
C    6.333333
D    7.750000
dtype: float64

data.fillna(value=fill)

ABCD
01.07.6666672.0000003.0
16.04.0000005.0000006.0
27.08.0000006.3333339.0
310.011.00000012.00000013.0

全部数据的平均值,先进行摊平,再进行填充即可。

fill = data.stack().mean()
fill

7.0

data.fillna(value=fill)

ABCD
01.07.02.03.0
17.04.05.06.0
27.08.07.09.0
310.011.012.013.0

11.5 合并数据

  • 构造一个生产DataFrame的函数
import pandas as pd
import numpy as np

def make\_df(cols, ind):
    "一个简单的DataFrame"
    data = {c: [str(c)+str(i) for i in ind]  for c in cols}
    return pd.DataFrame(data, ind)

make_df("ABC", range(3))

ABC
0A0B0C0
1A1B1C1
2A2B2C2
  • 垂直合并
df_1 = make_df("AB", [1, 2])
df_2 = make_df("AB", [3, 4])
print(df_1)
print(df_2)

    A   B
1  A1  B1
2  A2  B2
    A   B
3  A3  B3
4  A4  B4

pd.concat([df_1, df_2])

AB
1A1B1
2A2B2
3A3B3
4A4B4
  • 水平合并
df_3 = make_df("AB", [0, 1])
df_4 = make_df("CD", [0, 1])
print(df_3)
print(df_4)

    A   B
0  A0  B0
1  A1  B1
    C   D
0  C0  D0
1  C1  D1

pd.concat([df_3, df_4], axis=1)

ABCD
0A0B0C0D0
1A1B1C1D1
  • 索引重叠

行重叠

df_5 = make_df("AB", [1, 2])
df_6 = make_df("AB", [1, 2])
print(df_5)
print(df_6)

    A   B
1  A1  B1
2  A2  B2
    A   B
1  A1  B1
2  A2  B2

pd.concat([df_5, df_6])

AB
1A1B1
2A2B2
1A1B1
2A2B2
pd.concat([df_5, df_6],ignore_index=True)

AB
0A1B1
1A2B2
2A1B1
3A2B2

列重叠

df_7 = make_df("ABC", [1, 2])
df_8 = make_df("BCD", [1, 2])
print(df_7)
print(df_8)

    A   B   C
1  A1  B1  C1
2  A2  B2  C2
    B   C   D
1  B1  C1  D1
2  B2  C2  D2

pd.concat([df_7, df_8], axis=1)

ABCBCD
1A1B1C1B1C1D1
2A2B2C2B2C2D2
pd.concat([df_7, df_8],axis=1, ignore_index=True)

012345
1A1B1C1B1C1D1
2A2B2C2B2C2D2
  • 对齐合并merge()
df_9 = make_df("AB", [1, 2])
df_10 = make_df("BC", [1, 2])
print(df_9)
print(df_10)

    A   B
1  A1  B1
2  A2  B2
    B   C
1  B1  C1
2  B2  C2

pd.merge(df_9, df_10)

ABC
0A1B1C1
1A2B2C2
df_9 = make_df("AB", [1, 2])
df_10 = make_df("CB", [2, 1])
print(df_9)
print(df_10)

    A   B
1  A1  B1
2  A2  B2
    C   B
2  C2  B2
1  C1  B1

pd.merge(df_9, df_10)

ABC
0A1B1C1
1A2B2C2

【例】 合并城市信息

population_dict = {"city": ("BeiJing", "HangZhou", "ShenZhen"),
                   "pop": (2154, 981, 1303)}
population = pd.DataFrame(population_dict)
population

citypop
0BeiJing2154
1HangZhou981
2ShenZhen1303
GDP_dict = {"city": ("BeiJing", "ShangHai", "HangZhou"),
            "GDP": (30320, 32680, 13468)}
GDP = pd.DataFrame(GDP_dict)
GDP

cityGDP
0BeiJing30320
1ShangHai32680
2HangZhou13468
city_info = pd.merge(population, GDP)
city_info

citypopGDP
0BeiJing215430320
1HangZhou98113468

这里outer是求并集

city_info = pd.merge(population, GDP, how="outer")
city_info

citypopGDP
0BeiJing2154.030320.0
1HangZhou981.013468.0
2ShenZhen1303.0NaN
3ShangHaiNaN32680.0

11.6 分组和数据透视表

image-20221002211059551

df = pd.DataFrame({"key":["A", "B", "C", "C", "B", "A"],
                  "data1": range(6),
                  "data2": np.random.randint(0, 10, size=6)})
df

keydata1data2
0A01
1B14
2C29
3C39
4B41
5A59

(1)分组

  • 延迟计算
df.groupby("key")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002276795A240>

这说明已经分好了,等待我们用什么样的方法进行处理后,再显示。

df.groupby("key").sum()

data1data2
key
A510
B56
C511
df.groupby("key").mean()

data1data2
key
A2.55.0
B2.53.0
C2.55.5

可以打印看看这是什么东西:

for i in df.groupby("key"):
    print(str(i))

('A',   key  data1  data2
0   A      0      2
5   A      5      8)
('B',   key  data1  data2
1   B      1      2
4   B      4      4)
('C',   key  data1  data2
2   C      2      8
3   C      3      3)

  • 按列取值
df.groupby("key")["data2"].sum()

key
A    10
B     6
C    11
Name: data2, dtype: int32

  • 按组迭代
for data, group in df.groupby("key"):
    print("{0:5} shape={1}".format(data, group.shape))

A     shape=(2, 3)
B     shape=(2, 3)
C     shape=(2, 3)

  • 调用方法
df.groupby("key")["data1"].describe()

countmeanstdmin25%50%75%max
key
A2.02.53.5355340.01.252.53.755.0
B2.02.52.1213201.01.752.53.254.0
C2.02.50.7071072.02.252.52.753.0
  • 支持更复杂的操作
df.groupby("key").aggregate(["min", "median", "max"])

data1data2
minmedian
key
A02.5
B12.5
C22.5
  • 过滤
def filter\_func(x):
    return x["data2"].std() > 3
df.groupby("key")["data2"].std()

key
A    4.242641
B    1.414214
C    3.535534
Name: data2, dtype: float64

df.groupby("key").filter(filter_func)

keydata1data2
0A02
2C28
3C33
5A58
  • 转换
df

keydata1data2
0A02
1B12
2C28
3C33
4B44
5A58
df.groupby("key").transform(lambda x: x-x.mean())

data1data2
0-2.5-3.0
1-1.5-1.0
2-0.52.5
30.5-2.5
41.51.0
52.53.0
df

keydata1data2
0A01
1B14
2C29
3C39
4B41
5A59
df.groupby("key").apply(lambda x: x-x.mean())

data1data2
0-2.5-4.0
1-1.51.5
2-0.50.0
30.50.0
41.5-1.5
52.54.0
  • apply()方法
df

keydata1data2
0A02
1B12
2C28
3C33
4B44
5A58
def norm\_by\_data2(x):
    x["data1"] /= x["data2"].sum()
    return x

df.groupby("key").apply(norm_by_data2)

keydata1data2
0A0.0000002
1B0.1666672
2C0.1818188
3C0.2727273
4B0.6666674
5A0.5000008
  • 将列表、数组设为分组键

这里的L相当于一个新的标签替代原来的行标签。

L = [0, 1, 0, 1, 2, 0]
df

keydata1data2
0A02
1B12
2C28
3C33
4B44
5A58
df.groupby(L).sum()

data1data2
0718
145
244
  • 用字典将索引映射到分组
df2 = df.set_index("key")
df2

data1data2
key
A02
B12
C28
C33
B44
A58
mapping = {"A": "first", "B": "constant", "C": "constant"}
df2.groupby(mapping).sum()

data1data2
constant1017
first510
  • 任意Python函数
df2.groupby(str.lower).mean()

data1data2
a2.55.0
b2.53.0
c2.55.5
  • 多个有效值组成的列表

只有这两个数都相等,才会分到同一个组。

df2.groupby([str.lower, mapping]).mean()

data1data2
afirst2.55.0
bconstant2.53.0
cconstant2.55.5

【例1】 行星观测数据处理

import seaborn as sns

planets = sns.load_dataset("planets")

planets.shape

(1035, 6)

planets.head()

methodnumberorbital_periodmassdistanceyear
0Radial Velocity1269.3007.1077.402006
1Radial Velocity1874.7742.2156.952008
2Radial Velocity1763.0002.6019.842011
3Radial Velocity1326.03019.40110.622007
4Radial Velocity1516.22010.50119.472009
planets.describe()

numberorbital_periodmassdistanceyear
count1035.000000992.000000513.000000808.0000001035.000000
mean1.7855072002.9175962.638161264.0692822009.070531
std1.24097626014.7283043.818617733.1164933.972567
min1.0000000.0907060.0036001.3500001989.000000
25%1.0000005.4425400.22900032.5600002007.000000
50%1.00000039.9795001.26000055.2500002010.000000
75%2.000000526.0050003.040000178.5000002012.000000
max7.000000730000.00000025.0000008500.0000002014.000000
planets.head()

methodnumberorbital_periodmassdistanceyear
0Radial Velocity1269.3007.1077.402006
1Radial Velocity1874.7742.2156.952008
2Radial Velocity1763.0002.6019.842011
3Radial Velocity1326.03019.40110.622007
4Radial Velocity1516.22010.50119.472009
decade = 10 \* (planets["year"] // 10)
decade.head()

0    2000
1    2000
2    2010
3    2000
4    2000
Name: year, dtype: int64

decade = decade.astype(str) + "s"
decade.name = "decade"
decade.head()

0    2000s
1    2000s
2    2010s
3    2000s
4    2000s
Name: decade, dtype: object

planets.head()

methodnumberorbital_periodmassdistanceyear
0Radial Velocity1269.3007.1077.402006
1Radial Velocity1874.7742.2156.952008
2Radial Velocity1763.0002.6019.842011
3Radial Velocity1326.03019.40110.622007
4Radial Velocity1516.22010.50119.472009
planets.groupby(["method", decade]).sum()

numberorbital_periodmassdistanceyear
methoddecade
Astrometry2010s21.262360e+030.0000035.754023
Eclipse Timing Variations2000s51.930800e+046.05000261.446025
2010s102.345680e+044.200001000.0012065
Imaging2000s291.350935e+060.00000956.8340139
2010s216.803750e+040.000001210.0836208
Microlensing2000s121.732500e+040.000000.0020070
2010s154.750000e+030.0000041440.0026155
Orbital Brightness Modulation2010s52.127920e+000.000002360.006035
Pulsar Timing1990s91.900153e+020.000000.005978
2000s13.652500e+040.000000.002003
2010s19.070629e-020.000001200.002011
Pulsation Timing Variations2000s11.170000e+030.000000.002007
Radial Velocity1980s18.388800e+0111.6800040.571989
1990s521.091561e+0468.17820723.7155943
2000s4752.633526e+05945.3192815201.16619775
2010s4241.809630e+05316.4789011382.67432451
Transit2000s642.897102e+020.0000031823.31124462
2010s7128.087813e+031.47000102419.46673999
Transit Timing Variations2010s92.393505e+020.000003313.008050

这里使用两个中括号[[]],取出来是DF类型的数据,而一个中括号[]取出来是Serios的数据,前者更美观一点。

planets.groupby(["method", decade])[["number"]].sum().unstack().fillna(0)

number
decade1980s
method
Astrometry0.0
Eclipse Timing Variations0.0
Imaging0.0
Microlensing0.0
Orbital Brightness Modulation0.0
Pulsar Timing0.0
Pulsation Timing Variations0.0
Radial Velocity1.0
Transit0.0
Transit Timing Variations0.0

(2)数据透视表

【例2】泰坦尼克号乘客数据分析

import seaborn as sns

titanic = sns.load_dataset("titanic")

titanic.head()

survivedpclasssexagesibspparchfareembarkedclasswhoadult_maledeckembark_townalivealone
003male22.0107.2500SThirdmanTrueNaNSouthamptonnoFalse
111female38.01071.2833CFirstwomanFalseCCherbourgyesFalse
213female26.0007.9250SThirdwomanFalseNaNSouthamptonyesTrue
311female35.01053.1000SFirstwomanFalseCSouthamptonyesFalse
403male35.0008.0500SThirdmanTrueNaNSouthamptonnoTrue
T = titanic[titanic.age.notnull()].copy()

T.age.apply(lambda x: 60 if x>=60 else x)
T.age.value_counts()

24.00    30
22.00    27
60.00    26
18.00    26
28.00    25
30.00    25
19.00    25
21.00    24
25.00    23
36.00    22
29.00    20
35.00    18
32.00    18
27.00    18
26.00    18
31.00    17
16.00    17
34.00    15
20.00    15
33.00    15
23.00    15
39.00    14
40.00    13
17.00    13
42.00    13
45.00    12
38.00    11
4.00     10
50.00    10
2.00     10
         ..
8.00      4
5.00      4
11.00     4
6.00      3
7.00      3
46.00     3
30.50     2
57.00     2
0.83      2
55.00     2
10.00     2
59.00     2
13.00     2
28.50     2
40.50     2
45.50     2
0.75      2
32.50     2
34.50     1
55.50     1
0.92      1
36.50     1
12.00     1
53.00     1
14.50     1
0.67      1
20.50     1
23.50     1
24.50     1
0.42      1
Name: age, Length: 77, dtype: int64

Age = 10\*(T["age"]//10)
Age = Age.astype(int)
Age.head()
Age.value_counts()

20    220
30    167
10    102
40     89
0      62
50     48
60     26
Name: age, dtype: int64

Age.astype(str)+"s"

0      20s
1      30s
2      20s
3      30s
4      30s
6      50s
7       0s
8      20s
9      10s
10      0s
11     50s
12     20s
13     30s
14     10s
15     50s
16      0s
18     30s
20     30s
21     30s
22     10s
23     20s
24      0s
25     30s
27     10s
30     40s
33     60s
34     20s
35     40s
37     20s
38     10s
      ... 
856    40s
857    50s
858    20s
860    40s
861    20s
862    40s
864    20s
865    40s
866    20s
867    30s
869     0s
870    20s
871    40s
872    30s
873    40s
874    20s
875    10s
876    20s
877    10s
879    50s
880    20s
881    30s
882    20s
883    20s
884    20s
885    30s
886    20s
887    10s
889    20s
890    30s
Name: age, Length: 714, dtype: object

T.groupby(["sex", Age])["survived"].mean().unstack()

age0102030405060
sex
female0.6333330.7555560.7222220.8333330.6875000.8888891.000000
male0.5937500.1228070.1689190.2149530.2105260.1333330.136364
T.age = Age
T.pivot_table("survived", index="sex", columns="age")

age0102030405060
sex
female0.6333330.7555560.7222220.8333330.6875000.8888891.000000
male0.5937500.1228070.1689190.2149530.2105260.1333330.136364
titanic.describe()

survivedpclassagesibspparchfare
count891.000000891.000000714.000000891.000000891.000000891.000000
mean0.3838382.30864229.6991180.5230080.38159432.204208
std0.4865920.83607114.5264971.1027430.80605749.693429
min0.0000001.0000000.4200000.0000000.0000000.000000
25%0.0000002.00000020.1250000.0000000.0000007.910400
50%0.0000003.00000028.0000000.0000000.00000014.454200
75%1.0000003.00000038.0000001.0000000.00000031.000000
max1.0000003.00000080.0000008.0000006.000000512.329200
titanic.groupby("sex")[["survived"]].mean()

survived
sex
female0.742038
male0.188908
titanic.groupby("sex")["survived"].mean()

sex
female    0.742038
male      0.188908
Name: survived, dtype: float64

titanic.groupby(["sex", "class"])["survived"].aggregate("mean").unstack()

classFirstSecondThird
sex
female0.9680850.9210530.500000
male0.3688520.1574070.135447
  • 数据透视表:用更直观的方式实现上面的功能。
titanic.pivot_table("survived", index="sex", columns="class") # 默认返回平均值

classFirstSecondThird
sex
female0.9680850.9210530.500000
male0.3688520.1574070.135447
titanic.pivot_table("survived", index="sex", columns="class", aggfunc="mean", margins=True) # aggfunc="mean"即为默认值 margins=True 会加一个总的列和总的行。

classFirstSecondThirdAll
sex
female0.9680850.9210530.5000000.742038
male0.3688520.1574070.1354470.188908
All0.6296300.4728260.2423630.383838
titanic.pivot_table(index="sex", columns="class", aggfunc={"survived": "sum", "fare": "mean"}) # 要处理的那一列和要处理的方法组成一个键值对。

faresurvived
classFirstSecond
sex
female106.12579821.970121
male67.22612719.741782

11.7 其他

(1)向量化字符串操作

(2) 处理时间序列

(3) 多级索引:用于多维数据

base_data = np.array([[1771, 11115 ],
                      [2154, 30320],
                      [2141, 14070],
                      [2424, 32680],
                      [1077, 7806],
                      [1303, 24222],
                      [798, 4789],
                      [981, 13468]]) 
data = pd.DataFrame(base_data, index=[["BeiJing","BeiJing","ShangHai","ShangHai","ShenZhen","ShenZhen","HangZhou","HangZhou"]\
                                     , [2008, 2018]\*4], columns=["population", "GDP"])
data

populationGDP
BeiJing2008177111115

一、Python所有方向的学习路线

Python所有方向的技术点做的整理,形成各个领域的知识点汇总,它的用处就在于,你可以按照下面的知识点去找对应的学习资源,保证自己学得较为全面。

img
img

二、Python必备开发工具

工具都帮大家整理好了,安装就可直接上手!img

三、最新Python学习笔记

当我学到一定基础,有自己的理解能力的时候,会去阅读一些前辈整理的书籍或者手写的笔记资料,这些笔记详细记载了他们对一些技术点的理解,这些理解是比较独到,可以学到不一样的思路。

img

四、Python视频合集

观看全面零基础学习视频,看视频学习是最快捷也是最有效果的方式,跟着视频中老师的思路,从基础到深入,还是很容易入门的。

img

五、实战案例

纸上得来终觉浅,要学会跟着视频一起敲,要动手实操,才能将自己的所学运用到实际当中去,这时候可以搞点实战案例来学习。img

六、面试宝典

在这里插入图片描述

在这里插入图片描述

简历模板在这里插入图片描述

网上学习资料一大堆,但如果学到的知识不成体系,遇到问题时只是浅尝辄止,不再深入研究,那么很难做到真正的技术提升。

需要这份系统化学习资料的朋友,可以戳这里获取

一个人可以走的很快,但一群人才能走的更远!不论你是正从事IT行业的老鸟或是对IT行业感兴趣的新人,都欢迎加入我们的的圈子(技术交流、学习资源、职场吐槽、大厂内推、面试辅导),让我们一起学习成长!

  • 30
    点赞
  • 17
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值