Python基础(十一) 超详细的Pandas库三万字总结_gdp_data

populationGDP
BeiJing215430320
ShangHai242432680
ShenZhen130324222
HangZhou98113468

注意:数量不够的会自动补齐

pd.DataFrame({"population": population,
              "GDP": GDP,
              "country": "China"})

populationGDPcountry
BeiJing215430320China
ShangHai242432680China
ShenZhen130324222China
HangZhou98113468China

3、通过字典列表对象创建

  • 字典索引作为index,字典键作为columns
import numpy as np
import pandas as pd

data = [{"a": i, "b": 2\*i} for i in range(3)]
data

[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]

data = pd.DataFrame(data)
data

ab
000
112
224

行的标签没有排,因此行从0开始,列的标签延续。

  • 从中取出一列数据
data1 = data["a"].copy()
data1

0    0
1    1
2    2
Name: a, dtype: int64

data1[0] = 10
data1

0    10
1     1
2     2
Name: a, dtype: int64

data

ab
000
112
224
  • 不存在的键,会默认值为NaN
data = [{"a": 1, "b":1},{"b": 3, "c":4}]
data

[{'a': 1, 'b': 1}, {'b': 3, 'c': 4}]

pd.DataFrame(data)

abc
01.01NaN
1NaN34.0

4、通过Numpy二维数组创建

data = np.random.randint(10, size=(3, 2))
data

array([[1, 6],
       [2, 9],
       [4, 0]])

pd.DataFrame(data, columns=["foo", "bar"], index=["a", "b", "c"])

foobar
a16
b29
c40

11.2 DataFrame性质

1、属性

data = pd.DataFrame({"pop": population, "GDP": GDP})
data

popGDP
BeiJing215430320
ShangHai242432680
ShenZhen130324222
HangZhou98113468

(1)df.values 返回numpy数组表示的数据

data.values

array([[ 2154, 30320],
       [ 2424, 32680],
       [ 1303, 24222],
       [  981, 13468]], dtype=int64)

(2)df.index 返回行索引

data.index

Index(['BeiJing', 'ShangHai', 'ShenZhen', 'HangZhou'], dtype='object')

(3)df.columns 返回列索引

data.columns

Index(['pop', 'GDP'], dtype='object')

(4)df.shape 形状

data.shape

(4, 2)

(5) pd.size 大小

data.size

8

(6)pd.dtypes 返回每列数据类型

data.dtypes

pop    int64
GDP    int64
dtype: object

2、索引

data

popGDP
BeiJing215430320
ShangHai242432680
ShenZhen130324222
HangZhou98113468

(1)获取列

  • 字典式
data["pop"]

BeiJing     2154
ShangHai    2424
ShenZhen    1303
HangZhou     981
Name: pop, dtype: int64

data[["GDP", "pop"]]

GDPpop
BeiJing303202154
ShangHai326802424
ShenZhen242221303
HangZhou13468981
  • 对象属性式
data.GDP

BeiJing     30320
ShangHai    32680
ShenZhen    24222
HangZhou    13468
Name: GDP, dtype: int64

(2)获取行

  • 绝对索引 df.loc
data.loc["BeiJing"]

pop     2154
GDP    30320
Name: BeiJing, dtype: int64

data.loc[["BeiJing", "HangZhou"]]

popGDP
BeiJing215430320
HangZhou98113468
  • 相对索引 df.iloc
data

popGDP
BeiJing215430320
ShangHai242432680
ShenZhen130324222
HangZhou98113468
data.iloc[0]

pop     2154
GDP    30320
Name: BeiJing, dtype: int64

data.iloc[[1, 3]]

popGDP
ShangHai242432680
HangZhou98113468

(3)获取标量

data

popGDP
BeiJing215430320
ShangHai242432680
ShenZhen130324222
HangZhou98113468
data.loc["BeiJing", "GDP"]

30320

data.iloc[0, 1]

30320

data.values[0][1]

30320

(4)Series对象的索引

type(data.GDP)

pandas.core.series.Series

GDP

BeiJing     30320
ShangHai    32680
ShenZhen    24222
HangZhou    13468
dtype: int64

GDP["BeiJing"]

30320

3、切片

dates = pd.date_range(start='2019-01-01', periods=6)
dates

DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
               '2019-01-05', '2019-01-06'],
              dtype='datetime64[ns]', freq='D')

df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=["A", "B", "C", "D"])
df

ABCD
2019-01-01-0.935378-0.1907420.925984-0.818969
2019-01-02-0.234414-1.1946741.080779-2.294395
2019-01-03-0.1415720.0581181.1022481.207726
2019-01-040.3050880.535920-0.9784340.177251
2019-01-050.3133830.2340410.163155-0.296649
2019-01-060.250613-0.904400-0.858240-1.573342

(1)行切片

df["2019-01-01": "2019-01-03"]

ABCD
2019-01-01-0.935378-0.1907420.925984-0.818969
2019-01-02-0.234414-1.1946741.080779-2.294395
2019-01-03-0.1415720.0581181.1022481.207726
df.loc["2019-01-01": "2019-01-03"]

ABCD
2019-01-01-0.935378-0.1907420.925984-0.818969
2019-01-02-0.234414-1.1946741.080779-2.294395
2019-01-03-0.1415720.0581181.1022481.207726
df.iloc[0: 3]

ABCD
2019-01-01-0.935378-0.1907420.925984-0.818969
2019-01-02-0.234414-1.1946741.080779-2.294395
2019-01-03-0.1415720.0581181.1022481.207726

注意:这里的3是取不到的。

(2)列切片

df

ABCD
2019-01-01-0.935378-0.1907420.925984-0.818969
2019-01-02-0.234414-1.1946741.080779-2.294395
2019-01-03-0.1415720.0581181.1022481.207726
2019-01-040.3050880.535920-0.9784340.177251
2019-01-050.3133830.2340410.163155-0.296649
2019-01-060.250613-0.904400-0.858240-1.573342
df.loc[:, "A": "C"]

ABC
2019-01-01-0.935378-0.1907420.925984
2019-01-02-0.234414-1.1946741.080779
2019-01-03-0.1415720.0581181.102248
2019-01-040.3050880.535920-0.978434
2019-01-050.3133830.2340410.163155
2019-01-060.250613-0.904400-0.858240
df.iloc[:, 0: 3]

ABC
2019-01-01-0.935378-0.1907420.925984
2019-01-02-0.234414-1.1946741.080779
2019-01-03-0.1415720.0581181.102248
2019-01-040.3050880.535920-0.978434
2019-01-050.3133830.2340410.163155
2019-01-060.250613-0.904400-0.858240

(3)多种多样的取值

df

ABCD
2019-01-01-0.935378-0.1907420.925984-0.818969
2019-01-02-0.234414-1.1946741.080779-2.294395
2019-01-03-0.1415720.0581181.1022481.207726
2019-01-040.3050880.535920-0.9784340.177251
2019-01-050.3133830.2340410.163155-0.296649
2019-01-060.250613-0.904400-0.858240-1.573342
  • 行、列同时切片
df.loc["2019-01-02": "2019-01-03", "C":"D"]

CD
2019-01-021.080779-2.294395
2019-01-031.1022481.207726
df.iloc[1: 3, 2:]

CD
2019-01-021.080779-2.294395
2019-01-031.1022481.207726
  • 行切片,列分散取值
df.loc["2019-01-04": "2019-01-06", ["A", "C"]]

AC
2019-01-040.305088-0.978434
2019-01-050.3133830.163155
2019-01-060.250613-0.858240
df.iloc[3:, [0, 2]]

AC
2019-01-040.305088-0.978434
2019-01-050.3133830.163155
2019-01-060.250613-0.858240
  • 行分散取值,列切片
df.loc[["2019-01-02", "2019-01-06"], "C": "D"]

上面这种方式是行不通的。

df.iloc[[1, 5], 0: 3]

ABC
2019-01-02-0.234414-1.1946741.080779
2019-01-060.250613-0.904400-0.858240
  • 行、列均分散取值
df.loc[["2019-01-04", "2019-01-06"], ["A", "D"]]

同样,上面这种方式是行不通的。

df.iloc[[1, 5], [0, 3]]

AD
2019-01-02-0.234414-2.294395
2019-01-060.250613-1.573342

4、布尔索引

相当于numpy当中的掩码操作。

df

ABCD
2019-01-01-0.935378-0.1907420.925984-0.818969
2019-01-02-0.234414-1.1946741.080779-2.294395
2019-01-03-0.1415720.0581181.1022481.207726
2019-01-040.3050880.535920-0.9784340.177251
2019-01-050.3133830.2340410.163155-0.296649
2019-01-060.250613-0.904400-0.858240-1.573342
df > 0

ABCD
2019-01-01FalseFalseTrueFalse
2019-01-02FalseFalseTrueFalse
2019-01-03FalseTrueTrueTrue
2019-01-04TrueTrueFalseTrue
2019-01-05TrueTrueTrueFalse
2019-01-06TrueFalseFalseFalse
df[df > 0]

ABCD
2019-01-01NaNNaN0.925984NaN
2019-01-02NaNNaN1.080779NaN
2019-01-03NaN0.0581181.1022481.207726
2019-01-040.3050880.535920NaN0.177251
2019-01-050.3133830.2340410.163155NaN
2019-01-060.250613NaNNaNNaN

可以观察到,为true的部分都被取到了,而false没有。

df.A > 0

2019-01-01    False
2019-01-02    False
2019-01-03    False
2019-01-04     True
2019-01-05     True
2019-01-06     True
Freq: D, Name: A, dtype: bool

df[df.A > 0]

ABCD
2019-01-040.3050880.535920-0.9784340.177251
2019-01-050.3133830.2340410.163155-0.296649
2019-01-060.250613-0.904400-0.858240-1.573342
  • isin()方法
df2 = df.copy()
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
df2

ABCDE
2019-01-01-0.935378-0.1907420.925984-0.818969one
2019-01-02-0.234414-1.1946741.080779-2.294395one
2019-01-03-0.1415720.0581181.1022481.207726two
2019-01-040.3050880.535920-0.9784340.177251three
2019-01-050.3133830.2340410.163155-0.296649four
2019-01-060.250613-0.904400-0.858240-1.573342three
ind = df2["E"].isin(["two", "four"])
ind     

2019-01-01    False
2019-01-02    False
2019-01-03     True
2019-01-04    False
2019-01-05     True
2019-01-06    False
Freq: D, Name: E, dtype: bool

df2[ind]

ABCDE
2019-01-03-0.1415720.0581181.1022481.207726two
2019-01-050.3133830.2340410.163155-0.296649four

(5)赋值

df

  • DataFrame 增加新列
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20190101', periods=6))
s1

2019-01-01    1
2019-01-02    2
2019-01-03    3
2019-01-04    4
2019-01-05    5
2019-01-06    6
Freq: D, dtype: int64

df["E"] = s1
df

ABCDE
2019-01-01-0.935378-0.1907420.925984-0.8189691
2019-01-02-0.234414-1.1946741.080779-2.2943952
2019-01-03-0.1415720.0581181.1022481.2077263
2019-01-040.3050880.535920-0.9784340.1772514
2019-01-050.3133830.2340410.163155-0.2966495
2019-01-060.250613-0.904400-0.858240-1.5733426
  • 修改赋值
df.loc["2019-01-01", "A"] = 0
df

ABCDE
2019-01-010.000000-0.1907420.925984-0.8189691
2019-01-02-0.234414-1.1946741.080779-2.2943952
2019-01-03-0.1415720.0581181.1022481.2077263
2019-01-040.3050880.535920-0.9784340.1772514
2019-01-050.3133830.2340410.163155-0.2966495
2019-01-060.250613-0.904400-0.858240-1.5733426
df.iloc[0, 1] = 0
df

ABCDE
2019-01-010.0000000.0000000.925984-0.8189691
2019-01-02-0.234414-1.1946741.080779-2.2943952
2019-01-03-0.1415720.0581181.1022481.2077263
2019-01-040.3050880.535920-0.9784340.1772514
2019-01-050.3133830.2340410.163155-0.2966495
2019-01-060.250613-0.904400-0.858240-1.5733426
df["D"] = np.array([5]\*len(df))   # 可简化成df["D"] = 5
df

ABCDE
2019-01-010.0000000.0000000.92598451
2019-01-02-0.234414-1.1946741.08077952
2019-01-03-0.1415720.0581181.10224853
2019-01-040.3050880.535920-0.97843454
2019-01-050.3133830.2340410.16315555
2019-01-060.250613-0.904400-0.85824056
  • 修改index和columns
df.index = [i for i in range(len(df))]
df

ABCDE
00.0000000.0000000.92598451
1-0.234414-1.1946741.08077952
2-0.1415720.0581181.10224853
30.3050880.535920-0.97843454
40.3133830.2340410.16315555
50.250613-0.904400-0.85824056
df.columns = [i for i in range(df.shape[1])]
df

01234
00.0000000.0000000.92598451
1-0.234414-1.1946741.08077952
2-0.1415720.0581181.10224853
30.3050880.535920-0.97843454
40.3133830.2340410.16315555
50.250613-0.904400-0.85824056

11.3 数值运算及统计分析

image-20221002211052367

1、数据的查看

import pandas as pd
import numpy as np

dates = pd.date_range(start='2019-01-01', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=["A", "B", "C", "D"])
df

ABCD
2019-01-01-0.8540430.412345-2.296051-0.048964
2019-01-021.371364-0.121454-0.2996531.095375
2019-01-03-0.714591-1.1032240.9792500.319455
2019-01-04-1.3975570.4260080.233861-1.651887
2019-01-050.4340260.459830-0.0954441.220302
2019-01-06-0.1338760.074500-1.0281470.605402

(1)查看前面的行

df.head()    # 默认5行,也可以进行设置

ABCD
2019-01-01-0.8540430.412345-2.296051-0.048964
2019-01-021.371364-0.121454-0.2996531.095375
2019-01-03-0.714591-1.1032240.9792500.319455
2019-01-04-1.3975570.4260080.233861-1.651887
2019-01-050.4340260.459830-0.0954441.220302
df.head(2)

ABCD
2019-01-01-0.8540430.412345-2.296051-0.048964
2019-01-021.371364-0.121454-0.2996531.095375

(2)查看后面的行

df.tail()    # 默认5行

ABCD
2019-01-021.371364-0.121454-0.2996531.095375
2019-01-03-0.714591-1.1032240.9792500.319455
2019-01-04-1.3975570.4260080.233861-1.651887
2019-01-050.4340260.459830-0.0954441.220302
2019-01-06-0.1338760.074500-1.0281470.605402
df.tail(3) 

ABCD
2019-01-04-1.3975570.4260080.233861-1.651887
2019-01-050.4340260.459830-0.0954441.220302
2019-01-06-0.1338760.074500-1.0281470.605402

(3)查看总体信息

df.iloc[0, 3] = np.nan
df

ABCD
2019-01-01-0.8540430.412345-2.296051NaN
2019-01-021.371364-0.121454-0.2996531.095375
2019-01-03-0.714591-1.1032240.9792500.319455
2019-01-04-1.3975570.4260080.233861-1.651887
2019-01-050.4340260.459830-0.0954441.220302
2019-01-06-0.1338760.074500-1.0281470.605402
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6 entries, 2019-01-01 to 2019-01-06
Freq: D
Data columns (total 4 columns):
A    6 non-null float64
B    6 non-null float64
C    6 non-null float64
D    5 non-null float64
dtypes: float64(4)
memory usage: 240.0 bytes

2、Numpy通用函数同样适用于Pandas

(1)向量化运算

x = pd.DataFrame(np.arange(4).reshape(1, 4))
x

0123
00123
x+5

0123
05678
np.exp(x)

0123
01.02.7182827.38905620.085537
y = pd.DataFrame(np.arange(4,8).reshape(1, 4))
y

0123
04567
x\*y

0123
0051221

(2)矩阵化运算

np.random.seed(42)
x = pd.DataFrame(np.random.randint(10, size=(30, 30)))
x

012345678920212223242526272829
063746926744095809263
182426486132031731559
235191937686870772072
320496986870242049668
499260334669686008838
526578402972040700115
664002149565085233292
723638076173010446882
822375707301152830304
937762002564232004528
1047042034605619190708
1156969218796528959950
1239554074460729694946
1384099015875840349946
1430469954316103712002
1542007912126394173848
1639487202318003852038
1728632944286942618990
1856798191443525699262
1919378602804322381800
2045526897573508043251
2124819714670182046504
2245246444991769915521
2305480644128507692043
2497090374153782219224
2541954504899307023759
2667197262610659803839
2728135177028045455637
2868622743751792459532
2930300954321304808756

30 rows × 30 columns

  • 转置
z = x.T
z

012345678920212223242526272829
068329262234240946263
132509643275455717880
274142506375824091163
342996703762148959320
466160828526960047520
594993410708746352179
628383047009144706745
766764296727444442734
871686951355691186073
943876767067792591252
1038713530754026419910
1171406768357505105835
1279162878559041292431
1328465301723185882557
1459771056359001698359
1544949079277421686904
1611828942819931584176
1773874336241802759759
1856055619804501376521
1917823158108073708487
2042609253143018390811
2100826000125175736073
2293748481530860805490
2351026050228297279524
2487700724800096208448
2503740034304419120550
2691298036043652933598
2725068128352550278657
2865763198025024253335
2939288522481413499726

30 rows × 30 columns

np.random.seed(1)
y = pd.DataFrame(np.random.randint(10, size=(30, 30)))
y

012345678920212223242526272829
058950017691706997691
101883987369204927798
269377459367711308645
362578447740198231272
460926627701540789570
593914468891870342035
612430607284336735324
740338356751731669696
800296067036795495256
968777260527062436763
1006476295994939125408
1123994482160598660473
1201606164258807207119
1351596498752432004250
1403853147328557591393
1533613050527177383063
1606596466223686513263
1767280186005625430621
1894409877617997114656
1941151262330009859340
2098639908162901394888
2128649055616756874240
2203590365116253939519
2377086120441960283725
2460423105701127529473
2550214946935535927416
2698181626182512533618
2718646954729315117126
2807743278520283739238
2980268364976785725345

30 rows × 30 columns

x.dot(y)

012345678920212223242526272829
0616560723739612457681799575590523739613580668602733585657700
1520438691600612455666764707592555681503679641506779494633590
2557570786807690469804828704573563675712758793672754550756638
3605507664701660496698806651575582685668586629534678484591626
4599681753873721563754770620654633747661677726649716610735706
5422354602627613396617627489423456572559537499384589436574507
6359446599599481357577572451464449550495532633554663476565602
7531520698590607537665696571472576588551665652527742528650599
8449322547533593399584638587424402596523523447362561386529484
9373433525601522345551521434447508498438478459418488407503496
10500427574607667477652656615477622702531610558532598471582561
11664694772841779574730810711608591760616638721676846678754708
12545547687701721576689724710532674684648694710564757571671656
13574586723750691494696787667523618681568682715644756557690604
14502382645557570403538677500501369650507576546531554437616463
15510505736651649510719733694557605717574642678576755455598654
16567376614612643514598724547464456639520560569442596517659532
17626716828765740603809852692591664716655721742612819593744712
18600559667664641556624815638564581701559677710554748597614657
19445431661681641552690719602474515637576620572512599455622538
20523569784725713501740772638640589775664686726672747548723645
21487465553639517449592609454398492567534404554417561466498492
22479449574686583377566614563455453539491501596520722478565501
23483386476526550426492585536482322541438456487408502426474481
24523551658767537444663731576577522590525664691548635526641538
25652656738753853508752815669576694833693606575616704559728672
26578577744856699497779800733587630754704834760680765592731629
27554494665689630574695703636599554685532658649554693577634668
28498552659784552492690775544551567636518599742521733533605604
29513491563642477367589647516484428574504548553483540407547455

30 rows × 30 columns

%timeit x.dot(y)

218 µs ± 18.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit np.dot(x, y)

81.1 µs ± 2.85 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

  • 执行相同运算,Numpy与Pandas的对比
x1 = np.array(x)
x1

y1 = np.array(y)
y1

%timeit x1.dot(y1)

22.1 µs ± 992 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit np.dot(x1, y1)

22.6 µs ± 766 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit np.dot(x.values, y.values)

42.9 µs ± 1.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

x2 = list(x1)
y2 = list(y1)
x3 = []
y3 = []
for i in x2:
    res = []
    for j in i:
        res.append(int(j))
    x3.append(res)
for i in y2:
    res = []
    for j in i:
        res.append(int(j))
    y3.append(res)

def f(x, y):
    res = []
    for i in range(len(x)):
        row = []
        for j in range(len(y[0])):
            sum_row = 0
            for k in range(len(x[0])):
                sum_row += x[i][k]\*y[k][j]
            row.append(sum_row)
        res.append(row)
    return res          

%timeit f(x3, y3)

4.29 ms ± 207 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

一般来说,纯粹的计算在Numpy里执行的更快

Numpy更侧重于计算,Pandas更侧重于数据处理

(3)广播运算

np.random.seed(42)
x = pd.DataFrame(np.random.randint(10, size=(3, 3)), columns=list("ABC"))
x

ABC
0637
1469
2267
  • 按行广播
x.iloc[0]

A    6
B    3
C    7
Name: 0, dtype: int32

x/x.iloc[0]

ABC
01.0000001.01.000000
10.6666672.01.285714
20.3333332.01.000000
  • 按列广播
x.A

0    6
1    4
2    2
Name: A, dtype: int32

x.div(x.A, axis=0)             # add sub div mul

ABC
01.00.51.166667
11.01.52.250000
21.03.03.500000
x.div(x.iloc[0], axis=1)

ABC
01.0000001.01.000000
10.6666672.01.285714
20.3333332.01.000000

3、新的用法

(1)索引对齐

A = pd.DataFrame(np.random.randint(0, 20, size=(2, 2)), columns=list("AB"))
A

AB
037
121
B = pd.DataFrame(np.random.randint(0, 10, size=(3, 3)), columns=list("ABC"))
B

ABC
0751
1409
2580
  • pandas会自动对齐两个对象的索引,没有的值用np.nan表示
A+B

ABC
010.012.0NaN
16.01.0NaN
2NaNNaNNaN
  • 缺省值也可用fill_value来填充
A.add(B, fill_value=0)

ABC
010.012.01.0
16.01.09.0
25.08.00.0
A\*B

ABC
021.035.0NaN
18.00.0NaN
2NaNNaNNaN

(2)统计相关

  • 数据种类统计
y = np.random.randint(3, size=20)
y

array([2, 2, 2, 1, 2, 1, 1, 2, 1, 2, 2, 0, 2, 0, 2, 2, 0, 0, 2, 1])

np.unique(y)

array([0, 1, 2])

用Counter方法统计数据

from collections import Counter
Counter(y)

Counter({2: 11, 1: 5, 0: 4})

y1 = pd.DataFrame(y, columns=["A"])
y1

A
02
12
22
31
42
51
61
72
81
92
102
110
122
130
142
152
160
170
182
191

np.unique(y1)

有value counter的方法

y1["A"].value_counts()

2    11
1     5
0     4
Name: A, dtype: int64

  • 产生新的结果,并进行排序
population_dict = {"BeiJing": 2154,
                   "ShangHai": 2424,
                   "ShenZhen": 1303,
                   "HangZhou": 981 }
population = pd.Series(population_dict) 

GDP_dict = {"BeiJing": 30320,
            "ShangHai": 32680,
            "ShenZhen": 24222,
            "HangZhou": 13468 }
GDP = pd.Series(GDP_dict)

city_info = pd.DataFrame({"population": population,"GDP": GDP})
city_info

populationGDP
BeiJing215430320
ShangHai242432680
ShenZhen130324222
HangZhou98113468
city_info["per\_GDP"] = city_info["GDP"]/city_info["population"]
city_info

populationGDPper_GDP
BeiJing21543032014.076137
ShangHai24243268013.481848
ShenZhen13032422218.589409
HangZhou9811346813.728848

递增排序

city_info.sort_values(by="per\_GDP")

populationGDPper_GDP
ShangHai24243268013.481848
HangZhou9811346813.728848
BeiJing21543032014.076137
ShenZhen13032422218.589409

递减排序

city_info.sort_values(by="per\_GDP", ascending=False)

populationGDPper_GDP
ShenZhen13032422218.589409
BeiJing21543032014.076137
HangZhou9811346813.728848
ShangHai24243268013.481848

按轴进行排序

data = pd.DataFrame(np.random.randint(20, size=(3, 4)), index=[2, 1, 0], columns=list("CBAD"))
data

CBAD
2313178
1119146
0117142

行排序

data.sort_index()

CBAD
0117142
1119146
2313178

列排序

data.sort_index(axis=1)

ABCD
2171338
1141916
0147112
data.sort_index(axis=1, ascending=False)

DCBA
2831317
1611914
0211714
  • 统计方法
df = pd.DataFrame(np.random.normal(2, 4, size=(6, 4)),columns=list("ABCD"))
df

ABCD
01.0821983.557396-3.0604766.367969
113.1132526.7745592.8745535.527044
2-2.036341-4.3331775.094802-0.152567
3-3.386712-1.522365-2.5222092.537716
44.3284915.5509945.5773295.019991
51.171336-0.493910-4.0326136.398588

非空个数

df.count()

A    6
B    6
C    6
D    6
dtype: int64

求和

df.sum()

A    14.272224
B     9.533497
C     3.931385
D    25.698741
dtype: float64

df.sum(axis=1)

0     7.947086
1    28.289408
2    -1.427283
3    -4.893571
4    20.476806
5     3.043402
dtype: float64

最大值 最小值

df.min()

A   -3.386712
B   -4.333177
C   -4.032613
D   -0.152567
dtype: float64

df.max(axis=1)

0     6.367969
1    13.113252
2     5.094802
3     2.537716
4     5.577329
5     6.398588
dtype: float64

df

ABCD
01.0821983.557396-3.0604766.367969
113.1132526.7745592.8745535.527044
2-2.036341-4.3331775.094802-0.152567
3-3.386712-1.522365-2.5222092.537716
44.3284915.5509945.5773295.019991
51.171336-0.493910-4.0326136.398588
df.idxmax()

A    1
B    1
C    4
D    5
dtype: int64

均值

df.mean()

A    2.378704
B    1.588916
C    0.655231
D    4.283124
dtype: float64

方差

df.var()

A    34.980702
B    19.110656
C    18.948144
D     6.726776
dtype: float64

标准差

df.std()

A    5.914449
B    4.371574
C    4.352947
D    2.593603
dtype: float64

中位数

df.median()

A    1.126767
B    1.531743
C    0.176172
D    5.273518
dtype: float64

众数

data = pd.DataFrame(np.random.randint(5, size=(10, 2)), columns=list("AB"))
data

AB
042
132
220
324
420
541
620
711
834
920
data.mode()

AB
020

75%分位数

df.quantile(0.75)

A    3.539202
B    5.052594
C    4.539740
D    6.157738
Name: 0.75, dtype: float64

  • 用describe()可以获取所有属性
df.describe()

ABCD
count6.0000006.0000006.0000006.000000
mean2.3787041.5889160.6552314.283124
std5.9144494.3715744.3529472.593603
min-3.386712-4.333177-4.032613-0.152567
25%-1.256706-1.265251-2.9259103.158284
50%1.1267671.5317430.1761725.273518
75%3.5392025.0525944.5397406.157738
max13.1132526.7745595.5773296.398588
data_2 = pd.DataFrame([["a", "a", "c", "d"],
                       ["c", "a", "c", "b"],
                       ["a", "a", "d", "c"]], columns=list("ABCD"))
data_2

ABCD
0aacd
1cacb
2aadc
  • 字符串类型的describe
data_2.describe()

ABCD
count3333
unique2123
topaacd
freq2321

相关性系数和协方差

df.corr()

ABCD
A1.0000000.8310630.3310600.510821
B0.8310631.0000000.1792440.719112
C0.3310600.1792441.000000-0.450365
D0.5108210.719112-0.4503651.000000
df.corrwith(df["A"])

A    1.000000
B    0.831063
C    0.331060
D    0.510821
dtype: float64

自定义输出

apply(method)的用法:使用method方法默认对每一列进行相应的操作

df

ABCD
01.0821983.557396-3.0604766.367969
113.1132526.7745592.8745535.527044
2-2.036341-4.3331775.094802-0.152567
3-3.386712-1.522365-2.5222092.537716
44.3284915.5509945.5773295.019991
51.171336-0.493910-4.0326136.398588
df.apply(np.cumsum)

ABCD
01.0821983.557396-3.0604766.367969
114.19545010.331955-0.18592311.895013
212.1591095.9987784.90887811.742447
38.7723974.4764132.38666914.280162
413.10088810.0274067.96399919.300153
514.2722249.5334973.93138525.698741
df.apply(np.cumsum, axis=1)

ABCD
01.0821984.6395941.5791177.947086
113.11325219.88781122.76236428.289408
2-2.036341-6.369518-1.274717-1.427283
3-3.386712-4.909077-7.431287-4.893571
44.3284919.87948515.45681420.476806
51.1713360.677427-3.3551863.043402
df.apply(sum)

A    14.272224
B     9.533497
C     3.931385
D    25.698741
dtype: float64

df.sum()

A    14.272224
B     9.533497
C     3.931385
D    25.698741
dtype: float64

df.apply(lambda x: x.max()-x.min())

A    16.499965
B    11.107736
C     9.609942
D     6.551155
dtype: float64

def my\_describe(x):
    return pd.Series([x.count(), x.mean(), x.max(), x.idxmin(), x.std()], \
                     index=["Count", "mean", "max", "idxmin", "std"])
df.apply(my_describe)

ABCD
Count6.0000006.0000006.0000006.000000
mean2.3787041.5889160.6552314.283124
max13.1132526.7745595.5773296.398588
idxmin3.0000002.0000005.0000002.000000
std5.9144494.3715744.3529472.593603

11.4 缺失值处理

1、发现缺失值

import pandas as pd
import numpy as np

data = pd.DataFrame(np.array([[1, np.nan, 2],
                              [np.nan, 3, 4],
                              [5, 6, None]]), columns=["A", "B", "C"])
data

ABC
01NaN2
1NaN34
256None

注意:有None、字符串等,数据类型全部变为object,它比int和float更消耗资源

np.nan是一个特殊的浮点数,类型是浮点类型,所以表示缺失值时最好使用NaN。

data.dtypes

A    object
B    object
C    object
dtype: object

data.isnull()

ABC
0FalseTrueFalse
1TrueFalseFalse
2FalseFalseTrue
data.notnull()

ABC
0TrueFalseTrue
1FalseTrueTrue
2TrueTrueFalse

2、删除缺失值

data = pd.DataFrame(np.array([[1, np.nan, 2, 3],
                              [np.nan, 4, 5, 6],
                              [7, 8, np.nan, 9],
                              [10, 11 , 12, 13]]), columns=["A", "B", "C", "D"])
data

ABCD
01.0NaN2.03.0
1NaN4.05.06.0
27.08.0NaN9.0
310.011.012.013.0

注意:np.nan是一种特殊的浮点数

data.dtypes

A    float64
B    float64
C    float64
D    float64
dtype: object

(1)删除整行

data.dropna()

ABCD
310.011.012.013.0

(2)删除整列

data.dropna(axis="columns")

D
03.0
16.0
29.0
313.0
data["D"] = np.nan
data

ABCD
01.0NaN2.0NaN
1NaN4.05.0NaN
27.08.0NaNNaN
310.011.012.0NaN
data.dropna(axis="columns", how="all")

ABC
01.0NaN2.0
1NaN4.05.0
27.08.0NaN
310.011.012.0

all表示都是缺失值时才删除。

data.dropna(axis="columns", how="any")

0
1
2
3
data.loc[3] = np.nan
data

ABCD
01.0NaN2.0NaN
1NaN4.05.0NaN
27.08.0NaNNaN
3NaNNaNNaNNaN
data.dropna(how="all")

ABCD
01.0NaN2.0NaN
1NaN4.05.0NaN
27.08.0NaNNaN

3、填充缺失值

data = pd.DataFrame(np.array([[1, np.nan, 2, 3],
                              [np.nan, 4, 5, 6],
                              [7, 8, np.nan, 9],
                              [10, 11 , 12, 13]]), columns=["A", "B", "C", "D"])
data

ABCD
01.0NaN2.03.0
1NaN4.05.06.0
27.08.0NaN9.0
310.011.012.013.0
data.fillna(value=5)

ABCD
01.05.02.03.0
15.04.05.06.0
27.08.05.09.0
310.011.012.013.0
  • 用均值进行替换
fill = data.mean()
fill

A    6.000000
B    7.666667
C    6.333333
D    7.750000
dtype: float64

data.fillna(value=fill)

ABCD
01.07.6666672.0000003.0
16.04.0000005.0000006.0
27.08.0000006.3333339.0
310.011.00000012.00000013.0

全部数据的平均值,先进行摊平,再进行填充即可。

fill = data.stack().mean()
fill

7.0

data.fillna(value=fill)

ABCD
01.07.02.03.0
17.04.05.06.0
27.08.07.09.0
310.011.012.013.0

11.5 合并数据

  • 构造一个生产DataFrame的函数
import pandas as pd
import numpy as np

def make\_df(cols, ind):
    "一个简单的DataFrame"
    data = {c: [str(c)+str(i) for i in ind]  for c in cols}
    return pd.DataFrame(data, ind)

make_df("ABC", range(3))

ABC
0A0B0C0
1A1B1C1
2A2B2C2
  • 垂直合并
df_1 = make_df("AB", [1, 2])
df_2 = make_df("AB", [3, 4])
print(df_1)
print(df_2)

    A   B
1  A1  B1
2  A2  B2
    A   B
3  A3  B3
4  A4  B4

pd.concat([df_1, df_2])

AB
1A1B1
2A2B2
3A3B3
4A4B4
  • 水平合并
df_3 = make_df("AB", [0, 1])
df_4 = make_df("CD", [0, 1])
print(df_3)
print(df_4)

    A   B
0  A0  B0
1  A1  B1
    C   D
0  C0  D0
1  C1  D1

pd.concat([df_3, df_4], axis=1)

ABCD
0A0B0C0D0
1A1B1C1D1

如果你也是看准了Python,想自学Python,在这里为大家准备了丰厚的免费学习大礼包,带大家一起学习,给大家剖析Python兼职、就业行情前景的这些事儿。

一、Python所有方向的学习路线

Python所有方向路线就是把Python常用的技术点做整理,形成各个领域的知识点汇总,它的用处就在于,你可以按照上面的知识点去找对应的学习资源,保证自己学得较为全面。

二、学习软件

工欲善其必先利其器。学习Python常用的开发软件都在这里了,给大家节省了很多时间。

三、全套PDF电子书

书籍的好处就在于权威和体系健全,刚开始学习的时候你可以只看视频或者听某个人讲课,但等你学完之后,你觉得你掌握了,这时候建议还是得去看一下书籍,看权威技术书籍也是每个程序员必经之路。

四、入门学习视频

我们在看视频学习的时候,不能光动眼动脑不动手,比较科学的学习方法是在理解之后运用它们,这时候练手项目就很适合了。

四、实战案例

光学理论是没用的,要学会跟着一起敲,要动手实操,才能将自己的所学运用到实际当中去,这时候可以搞点实战案例来学习。

五、面试资料

我们学习Python必然是为了找到高薪的工作,下面这些面试题是来自阿里、腾讯、字节等一线互联网大厂最新的面试资料,并且有阿里大佬给出了权威的解答,刷完这一套面试资料相信大家都能找到满意的工作。

成为一个Python程序员专家或许需要花费数年时间,但是打下坚实的基础只要几周就可以,如果你按照我提供的学习路线以及资料有意识地去实践,你就有很大可能成功!
最后祝你好运!!!

网上学习资料一大堆,但如果学到的知识不成体系,遇到问题时只是浅尝辄止,不再深入研究,那么很难做到真正的技术提升。

需要这份系统化学习资料的朋友,可以戳这里无偿获取

一个人可以走的很快,但一群人才能走的更远!不论你是正从事IT行业的老鸟或是对IT行业感兴趣的新人,都欢迎加入我们的的圈子(技术交流、学习资源、职场吐槽、大厂内推、面试辅导),让我们一起学习成长!

  • 3
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值