pandas快速入门(拜师网视频)(仅供参考)

这篇博客介绍了pandas库的基础知识,包括Series和DataFrame的创建,数据查看、选择与修改,处理缺失值,统计分析,数据合并、分组、整形、透视和时间序列操作。还涉及到了数据的读写和可视化。
摘要由CSDN通过智能技术生成

series和DataFrame的创建

import numpy as np
import pandas as pd

pandas中有两种基本数据结构,seriesdataframe

s = pd.Series([1, 3, 5, 7, np.NaN])
print(s)
0    1.0
1    3.0
2    5.0
3    7.0
4    NaN
dtype: float64

运行结果中,第一列是自动生成的索引,第二列是值
下面创建一个日期序列

date = pd.date_range('20200704', periods = 10)
print(date)
DatetimeIndex(['2020-07-04', '2020-07-05', '2020-07-06', '2020-07-07',
               '2020-07-08', '2020-07-09', '2020-07-10', '2020-07-11',
               '2020-07-12', '2020-07-13'],
              dtype='datetime64[ns]', freq='D')

下面创建一个dataframe,dataframe有三个参数,值,行索引,列索引

data = pd.DataFrame(np.random.randn(10, 4)
                    ,index=date
                    ,columns=list('ABCD'))
print(data)
                   A         B         C         D
2020-07-04  1.461007  0.330546  0.069813  0.753270
2020-07-05 -1.467785  0.847787  0.689302 -0.272599
2020-07-06  0.607181 -0.681571  0.480336  1.924793
2020-07-07  0.993651 -1.429534  1.063160 -0.477975
2020-07-08  0.317019  1.150236  1.235036  1.173084
2020-07-09  2.302594  0.536270 -0.806829  0.786662
2020-07-10 -0.101381 -0.477274 -0.988975 -2.492971
2020-07-11 -1.408476 -0.708248  1.382285 -1.137585
2020-07-12  0.099601  0.570422  0.183447 -0.985196
2020-07-13  2.066843 -0.036730 -0.114365  0.233777

另一个创建DataFrame的方法是利用字典来创建,如下所示

dic = {'A':range(4), 'B':list('abcd')}
df = pd.DataFrame(dic)
print(df)
   A  B
0  0  a
1  1  b
2  2  c
3  3  d

DataFrame的基本操作

查看数据

可以输入列名访问列的数据

data.A
2020-07-04    1.461007
2020-07-05   -1.467785
2020-07-06    0.607181
2020-07-07    0.993651
2020-07-08    0.317019
2020-07-09    2.302594
2020-07-10   -0.101381
2020-07-11   -1.408476
2020-07-12    0.099601
2020-07-13    2.066843
Freq: D, Name: A, dtype: float64
data.B
2020-07-04    0.330546
2020-07-05    0.847787
2020-07-06   -0.681571
2020-07-07   -1.429534
2020-07-08    1.150236
2020-07-09    0.536270
2020-07-10   -0.477274
2020-07-11   -0.708248
2020-07-12    0.570422
2020-07-13   -0.036730
Freq: D, Name: B, dtype: float64

查看列的数据类型,发现它是一个series

type(data.A)
pandas.core.series.Series

可以查看dataframe的头部和尾部,行标签,列标签还有值

DataFrame的基本操作

查看数据

# 查看头部
data.head()
ABCD
2020-07-041.4610070.3305460.0698130.753270
2020-07-05-1.4677850.8477870.689302-0.272599
2020-07-060.607181-0.6815710.4803361.924793
2020-07-070.993651-1.4295341.063160-0.477975
2020-07-080.3170191.1502361.2350361.173084
# 查看尾部
data.tail(2)
ABCD
2020-07-120.0996010.5704220.183447-0.985196
2020-07-132.066843-0.036730-0.1143650.233777
# 查看行索引
df.index
RangeIndex(start=0, stop=4, step=1)
# 查看列索引
data.columns
Index(['A', 'B', 'C', 'D'], dtype='object')
# 查看值
data.values
array([[ 1.46100717,  0.33054584,  0.06981279,  0.75326991],
       [-1.46778504,  0.84778667,  0.6893022 , -0.27259934],
       [ 0.60718094, -0.68157073,  0.4803361 ,  1.92479259],
       [ 0.99365125, -1.42953449,  1.06315959, -0.47797516],
       [ 0.31701933,  1.1502355 ,  1.23503615,  1.17308395],
       [ 2.30259364,  0.5362702 , -0.80682917,  0.7866624 ],
       [-0.10138056, -0.47727408, -0.98897458, -2.49297128],
       [-1.4084756 , -0.7082477 ,  1.38228526, -1.13758454],
       [ 0.0996013 ,  0.57042191,  0.18344733, -0.98519631],
       [ 2.06684278, -0.03672988, -0.11436519,  0.23377719]])
# 查看描述性统计结果
data.describe()
ABCD
count10.00000010.00000010.00000010.000000
mean0.4870260.0101900.319321-0.049474
std1.2888960.8166210.8119381.294423
min-1.467785-1.429534-0.988975-2.492971
25%-0.051135-0.630497-0.068321-0.858391
50%0.4621000.1469080.331892-0.019411
75%1.3441680.5618840.9696950.778314
max2.3025941.1502361.3822851.924793
# 转置DataFrame
data.T
2020-07-042020-07-052020-07-062020-07-072020-07-082020-07-092020-07-102020-07-112020-07-122020-07-13
A1.461007-1.4677850.6071810.9936510.3170192.302594-0.101381-1.4084760.0996012.066843
B0.3305460.847787-0.681571-1.4295341.1502360.536270-0.477274-0.7082480.570422-0.036730
C0.0698130.6893020.4803361.0631601.235036-0.806829-0.9889751.3822850.183447-0.114365
D0.753270-0.2725991.924793-0.4779751.1730840.786662-2.492971-1.137585-0.9851960.233777
# 对行标签进行排序,默认是升序排列
data.sort_index(axis=0)
ABCD
2020-07-041.4610070.3305460.0698130.753270
2020-07-05-1.4677850.8477870.689302-0.272599
2020-07-060.607181-0.6815710.4803361.924793
2020-07-070.993651-1.4295341.063160-0.477975
2020-07-080.3170191.1502361.2350361.173084
2020-07-092.3025940.536270-0.8068290.786662
2020-07-10-0.101381-0.477274-0.988975-2.492971
2020-07-11-1.408476-0.7082481.382285-1.137585
2020-07-120.0996010.5704220.183447-0.985196
2020-07-132.066843-0.036730-0.1143650.233777
# 对列标签进行排序,这里用降序排列
data.sort_index(axis=1, ascending=False)
DCBA
2020-07-040.7532700.0698130.3305461.461007
2020-07-05-0.2725990.6893020.847787-1.467785
2020-07-061.9247930.480336-0.6815710.607181
2020-07-07-0.4779751.063160-1.4295340.993651
2020-07-081.1730841.2350361.1502360.317019
2020-07-090.786662-0.8068290.5362702.302594
2020-07-10-2.492971-0.988975-0.477274-0.101381
2020-07-11-1.1375851.382285-0.708248-1.408476
2020-07-12-0.9851960.1834470.5704220.099601
2020-07-130.233777-0.114365-0.0367302.066843
# 按照A列对值排序
data.sort_values(by='A')
ABCD
2020-07-05-1.4677850.8477870.689302-0.272599
2020-07-11-1.408476-0.7082481.382285-1.137585
2020-07-10-0.101381-0.477274-0.988975-2.492971
2020-07-120.0996010.5704220.183447-0.985196
2020-07-080.3170191.1502361.2350361.173084
2020-07-060.607181-0.6815710.4803361.924793
2020-07-070.993651-1.4295341.063160-0.477975
2020-07-041.4610070.3305460.0698130.753270
2020-07-132.066843-0.036730-0.1143650.233777
2020-07-092.3025940.536270-0.8068290.786662

数据的选择

# 选择A列
# 这行代码也可写作data.A
data['A']
2020-07-04    1.461007
2020-07-05   -1.467785
2020-07-06    0.607181
2020-07-07    0.993651
2020-07-08    0.317019
2020-07-09    2.302594
2020-07-10   -0.101381
2020-07-11   -1.408476
2020-07-12    0.099601
2020-07-13    2.066843
Freq: D, Name: A, dtype: float64
# 选取行比较推荐的方法是loc的方法
data.loc['20200704':'20200713']
ABCD
2020-07-041.4610070.3305460.0698130.753270
2020-07-05-1.4677850.8477870.689302-0.272599
2020-07-060.607181-0.6815710.4803361.924793
2020-07-070.993651-1.4295341.063160-0.477975
2020-07-080.3170191.1502361.2350361.173084
2020-07-092.3025940.536270-0.8068290.786662
2020-07-10-0.101381-0.477274-0.988975-2.492971
2020-07-11-1.408476-0.7082481.382285-1.137585
2020-07-120.0996010.5704220.183447-0.985196
2020-07-132.066843-0.036730-0.1143650.233777
# loc方法可以同时选择需要的行和列
data.loc['20200707':'20200709', ['B', 'C']]
BC
2020-07-07-1.4295341.063160
2020-07-081.1502361.235036
2020-07-090.536270-0.806829
# loc方法可以访问具体的值
data.loc['20200707', 'C']
1.0631595862205687
# 用iloc根据位置选取行
data.iloc[1:3]
ABCD
2020-07-05-1.4677850.8477870.689302-0.272599
2020-07-060.607181-0.6815710.4803361.924793
# iloc可以根据位置选取行和列
data.iloc[0:3, 1:4]
BCD
2020-07-040.3305460.0698130.753270
2020-07-050.8477870.689302-0.272599
2020-07-06-0.6815710.4803361.924793
# 也可以用at方法访问具体的值,效率更高
data.at[pd.Timestamp('20200707'), 'C']
1.0631595862205687
# 可以用iat根据位置访问具体的值,比iloc效率更高
data.iat[2, 3]
1.9247925915867081
# 可以用表达式过滤出符合要求的值,也就是布尔索引
data[data > 0]
ABCD
2020-07-041.4610070.3305460.0698130.753270
2020-07-05NaN0.8477870.689302NaN
2020-07-060.607181NaN0.4803361.924793
2020-07-070.993651NaN1.063160NaN
2020-07-080.3170191.1502361.2350361.173084
2020-07-092.3025940.536270NaN0.786662
2020-07-10NaNNaNNaNNaN
2020-07-11NaNNaN1.382285NaN
2020-07-120.0996010.5704220.183447NaN
2020-07-132.066843NaNNaN0.233777

修改数据

data2 = data.copy()
data2
ABCD
2020-07-041.4610070.3305460.0698130.753270
2020-07-05-1.4677850.8477870.689302-0.272599
2020-07-060.607181-0.6815710.4803361.924793
2020-07-070.993651-1.4295341.063160-0.477975
2020-07-080.3170191.1502361.2350361.173084
2020-07-092.3025940.536270-0.8068290.786662
2020-07-10-0.101381-0.477274-0.988975-2.492971
2020-07-11-1.408476-0.7082481.382285-1.137585
2020-07-120.0996010.5704220.183447-0.985196
2020-07-132.066843-0.036730-0.1143650.233777
# 在data2中增加一列
new = list('abcdefghij')
data2['NEW'] = new
data2
ABCDNEW
2020-07-041.4610070.3305460.0698130.753270a
2020-07-05-1.4677850.8477870.689302-0.272599b
2020-07-060.607181-0.6815710.4803361.924793c
2020-07-070.993651-1.4295341.063160-0.477975d
2020-07-080.3170191.1502361.2350361.173084e
2020-07-092.3025940.536270-0.8068290.786662f
2020-07-10-0.101381-0.477274-0.988975-2.492971g
2020-07-11-1.408476-0.7082481.382285-1.137585h
2020-07-120.0996010.5704220.183447-0.985196i
2020-07-132.066843-0.036730-0.1143650.233777j
# 可以用isin来过滤值为a或c项
data2[data2.NEW.isin(['a', 'c'])]
ABCDNEW
2020-07-041.4610070.3305460.0698130.753270a
2020-07-060.607181-0.6815710.4803361.924793c
# 修改表的值,可以修改单个元素,一行,一列,或者一块区域
# 如果不是常数,行数和列数一定要和修改前的一样
# 否则会报错
# 修改第一行第一列的元素
data2.iat[0, 0] = 100
data2
ABCDNEW
2020-07-04100.0000000.3305460.0698130.753270a
2020-07-05-1.4677850.8477870.689302-0.272599b
2020-07-060.607181-0.6815710.4803361.924793c
2020-07-070.993651-1.4295341.063160-0.477975d
2020-07-080.3170191.1502361.2350361.173084e
2020-07-092.3025940.536270-0.8068290.786662f
2020-07-10-0.101381-0.477274-0.988975-2.492971g
2020-07-11-1.408476-0.7082481.382285-1.137585h
2020-07-120.0996010.5704220.183447-0.985196i
2020-07-132.066843-0.036730-0.1143650.233777j
# 修改第二行的元素
data2.B = range(10)
data2
ABCDNEW
2020-07-04100.00000000.0698130.753270a
2020-07-05-1.46778510.689302-0.272599b
2020-07-060.60718120.4803361.924793c
2020-07-070.99365131.063160-0.477975d
2020-07-080.31701941.2350361.173084e
2020-07-092.3025945-0.8068290.786662f
2020-07-10-0.1013816-0.988975-2.492971g
2020-07-11-1.40847671.382285-1.137585h
2020-07-120.09960180.183447-0.985196i
2020-07-132.0668439-0.1143650.233777j
# 修改一片区域的元素
data2.iloc[8:, 2:4] = np.arange(4).reshape(2, 2)
data2
ABCDNEW
2020-07-04100.00000000.0698130.753270a
2020-07-05-1.46778510.689302-0.272599b
2020-07-060.60718120.4803361.924793c
2020-07-070.99365131.063160-0.477975d
2020-07-080.31701941.2350361.173084e
2020-07-092.3025945-0.8068290.786662f
2020-07-10-0.1013816-0.988975-2.492971g
2020-07-11-1.40847671.382285-1.137585h
2020-07-120.09960180.0000001.000000i
2020-07-132.06684392.0000003.000000j
dates = pd.date_range('20200705', periods=6)
df2 = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df2
ABCD
2020-07-051.1623050.4679421.393984-0.513032
2020-07-061.014034-0.8780290.457580-0.326289
2020-07-07-0.806825-0.7579131.2657891.237268
2020-07-080.3514790.5913880.860177-1.946902
2020-07-09-0.9383361.229066-0.2439451.210629
2020-07-10-2.2332760.2438130.4902140.591144

处理缺失值

# 对上面的dataframe重新索引
# index选取前四行,columns新增加一列
# 新加的一列默认为缺失值
df3 = df2.reindex(index=dates[0:4], columns=list(df2.columns) + ['E'])
df3
ABCDE
2020-07-051.1623050.4679421.393984-0.513032NaN
2020-07-061.014034-0.8780290.457580-0.326289NaN
2020-07-07-0.806825-0.7579131.2657891.237268NaN
2020-07-080.3514790.5913880.860177-1.946902NaN
df3.loc[dates[1:3], 'E'] = 2
df3
ABCDE
2020-07-051.1623050.4679421.393984-0.513032NaN
2020-07-061.014034-0.8780290.457580-0.3262892.0
2020-07-07-0.806825-0.7579131.2657891.2372682.0
2020-07-080.3514790.5913880.860177-1.946902NaN
# 处理缺失值有两种方法
# 第一种是直接删掉
df3.dropna()
ABCDE
2020-07-061.014034-0.8780290.457580-0.3262892.0
2020-07-07-0.806825-0.7579131.2657891.2372682.0
# 第二种方式是填充
df3.fillna(5)
ABCDE
2020-07-051.1623050.4679421.393984-0.5130325.0
2020-07-061.014034-0.8780290.457580-0.3262892.0
2020-07-07-0.806825-0.7579131.2657891.2372682.0
2020-07-080.3514790.5913880.860177-1.9469025.0
# 判断dataframe中是否有空数据
df3.isnull()
ABCDE
2020-07-05FalseFalseFalseFalseTrue
2020-07-06FalseFalseFalseFalseFalse
2020-07-07FalseFalseFalseFalseFalse
2020-07-08FalseFalseFalseFalseTrue
df3.isnull().any().any()
True

基本统计

# 求平均值,缺失值不参与运算
print(df3)
# 默认按列求平均值
print(df3.mean())
                   A         B         C         D    E
2020-07-05  1.162305  0.467942  1.393984 -0.513032  NaN
2020-07-06  1.014034 -0.878029  0.457580 -0.326289  2.0
2020-07-07 -0.806825 -0.757913  1.265789  1.237268  2.0
2020-07-08  0.351479  0.591388  0.860177 -1.946902  NaN
A    0.430248
B   -0.144153
C    0.994383
D   -0.387239
E    2.000000
dtype: float64
# 可以按行求平均值
print(df3.mean(axis=1))
2020-07-05    0.627800
2020-07-06    0.453459
2020-07-07    0.587664
2020-07-08   -0.035965
Freq: D, dtype: float64
# 求累加值,同样地,缺失值不参与计算
print(df3.cumsum())
print(df3.cumsum(axis=1))
                   A         B         C         D    E
2020-07-05  1.162305  0.467942  1.393984 -0.513032  NaN
2020-07-06  2.176339 -0.410087  1.851565 -0.839322  2.0
2020-07-07  1.369514 -1.167999  3.117354  0.397947  4.0
2020-07-08  1.720993 -0.576611  3.977531 -1.548955  NaN
                   A         B         C         D         E
2020-07-05  1.162305  1.630247  3.024232  2.511199       NaN
2020-07-06  1.014034  0.136005  0.593585  0.267296  2.267296
2020-07-07 -0.806825 -1.564737 -0.298948  0.938320  2.938320
2020-07-08  0.351479  0.942867  1.803044 -0.143858       NaN
s2 = pd.Series([1, 3, 5, np.nan, 6, np.nan], index=dates)
s2
2020-07-05    1.0
2020-07-06    3.0
2020-07-07    5.0
2020-07-08    NaN
2020-07-09    6.0
2020-07-10    NaN
Freq: D, dtype: float64
# dataframe和一维的序列(series)相减,一维的序列会复制成4列,然后相减
# 缺失值不参与运算
print(df2)
# 这一句也可以写df2.sub(s2, axis = 'index')
df2.sub(s2, axis=0)
                   A         B         C         D
2020-07-05  1.162305  0.467942  1.393984 -0.513032
2020-07-06  1.014034 -0.878029  0.457580 -0.326289
2020-07-07 -0.806825 -0.757913  1.265789  1.237268
2020-07-08  0.351479  0.591388  0.860177 -1.946902
2020-07-09 -0.938336  1.229066 -0.243945  1.210629
2020-07-10 -2.233276  0.243813  0.490214  0.591144
ABCD
2020-07-050.162305-0.5320580.393984-1.513032
2020-07-06-1.985966-3.878029-2.542420-3.326289
2020-07-07-5.806825-5.757913-3.734211-3.762732
2020-07-08NaNNaNNaNNaN
2020-07-09-6.938336-4.770934-6.243945-4.789371
2020-07-10NaNNaNNaNNaN
# apply可以把列作为参数传给括号内的函数
df2.apply(np.cumsum)
ABCD
2020-07-051.1623050.4679421.393984-0.513032
2020-07-062.176339-0.4100871.851565-0.839322
2020-07-071.369514-1.1679993.1173540.397947
2020-07-081.720993-0.5766113.977531-1.548955
2020-07-090.7826570.6524543.733586-0.338326
2020-07-10-1.4506190.8962674.2238000.252817
# apply的括号内的函数可以自定义
def printtype(x):
    return type(x)
df2.apply(printtype)
A    <class 'pandas.core.series.Series'>
B    <class 'pandas.core.series.Series'>
C    <class 'pandas.core.series.Series'>
D    <class 'pandas.core.series.Series'>
dtype: object
# 产生10到20的10个随机数
s3 = pd.Series(np.random.randint(10, 20, size=10))
s3
0    15
1    15
2    12
3    18
4    12
5    13
6    18
7    15
8    18
9    18
dtype: int32
# 计算每个数字出现的次数
s3.value_counts()
18    4
15    3
12    2
13    1
dtype: int64
# 输出出现次数最多的那个数
s3.mode()
0    18
dtype: int32

数据合并

df4 = pd.DataFrame(np.random.randn(10,4), columns=list('ABCD'))
df4
ABCD
00.0053480.351821-0.708831-0.227980
1-0.4229460.1538910.4187010.477547
2-0.3846770.677650-0.741967-0.060229
31.604719-0.3006001.207413-0.064982
4-0.701497-0.862793-1.111455-1.332034
5-1.2653650.9388970.5343740.143901
6-0.176843-0.549029-0.951518-1.567208
7-0.265452-0.5534001.133660-0.593252
8-2.5573260.0214110.4447630.073160
90.7948720.796273-0.366471-0.434226
# 把df4拆成两部分
print(df4.iloc[0:5])
print(df4.iloc[5:10])
# 将拆开的两部分合并
df5 = pd.concat([df4.iloc[0:5], df4.iloc[5:10]])
# 判断df5和df4是否相等
(df5 == df4).all().all()
          A         B         C         D
0  0.005348  0.351821 -0.708831 -0.227980
1 -0.422946  0.153891  0.418701  0.477547
2 -0.384677  0.677650 -0.741967 -0.060229
3  1.604719 -0.300600  1.207413 -0.064982
4 -0.701497 -0.862793 -1.111455 -1.332034
          A         B         C         D
5 -1.265365  0.938897  0.534374  0.143901
6 -0.176843 -0.549029 -0.951518 -1.567208
7 -0.265452 -0.553400  1.133660 -0.593252
8 -2.557326  0.021411  0.444763  0.073160
9  0.794872  0.796273 -0.366471 -0.434226





True
dic2 = {'key':['foo', 'foo'], 'lval':[1, 2]}
dic3 = {'key':['foo', 'foo'], 'rval':[4, 5]}
df5 = pd.DataFrame(dic2)
df6 = pd.DataFrame(dic3)
df5
keylval
0foo1
1foo2
df6
keyrval
0foo4
1foo5
pd.merge(df5, df6, on = 'key')
keylvalrval
0foo14
1foo15
2foo24
3foo25
s = pd.Series(np.random.randint(1, 5, size = 5))
s
0    4
1    3
2    4
3    2
4    2
dtype: int32
print(df2)
# append不影响原来的dataframe
df2.append(s, ignore_index = True)
                   A         B         C         D
2020-07-05  1.162305  0.467942  1.393984 -0.513032
2020-07-06  1.014034 -0.878029  0.457580 -0.326289
2020-07-07 -0.806825 -0.757913  1.265789  1.237268
2020-07-08  0.351479  0.591388  0.860177 -1.946902
2020-07-09 -0.938336  1.229066 -0.243945  1.210629
2020-07-10 -2.233276  0.243813  0.490214  0.591144
ABCD01234
01.1623050.4679421.393984-0.513032NaNNaNNaNNaNNaN
11.014034-0.8780290.457580-0.326289NaNNaNNaNNaNNaN
2-0.806825-0.7579131.2657891.237268NaNNaNNaNNaNNaN
30.3514790.5913880.860177-1.946902NaNNaNNaNNaNNaN
4-0.9383361.229066-0.2439451.210629NaNNaNNaNNaNNaN
5-2.2332760.2438130.4902140.591144NaNNaNNaNNaNNaN
6NaNNaNNaNNaN4.03.04.02.02.0

分组统计

df7 = pd.DataFrame({'A':['foo', 'bar', 'foo', 'bar',
                         'foo', 'bar', 'foo', 'foo'],
                    'B':['one', 'one', 'two', 'three',
                         'two', 'two', 'one', 'three'],
                    'C':np.random.randn(8),
                    'D':np.random.randn(8)})
df7
ABCD
0fooone-2.1856530.495551
1barone0.1780441.334638
2footwo0.1343761.051614
3barthree-0.533640-0.109897
4footwo-0.184697-0.930691
5bartwo-0.4083080.667295
6fooone-2.382098-0.896218
7foothree-0.114952-1.740475
# 根据A来分组,算出foo的累加和和bar的累加和
df7.groupby('A').sum()
CD
A
bar-0.7639041.892036
foo-4.733024-2.020218
# 按照A和B来分组
# 注意这里的写法,不要漏掉了[]
df7.groupby(['A', 'B']).sum()
CD
AB
barone0.1780441.334638
three-0.533640-0.109897
two-0.4083080.667295
fooone-4.567751-0.400666
three-0.114952-1.740475
two-0.0503210.120923

数据整形

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                     'foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two',
                     'one', 'two', 'one', 'two']]))
# 双层索引,这里没有详细讲
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df
AB
firstsecond
barone0.580665-1.265347
two-0.844164-0.803420
bazone0.1444072.093790
two-0.738359-0.243598
fooone-0.5220040.542619
two0.077826-0.919478
quxone0.7190280.484342
two0.8049150.208629
df.loc['bar']
AB
second
one0.580665-1.265347
two-0.844164-0.803420
df.loc['bar'].loc['one']
A    0.580665
B   -1.265347
Name: one, dtype: float64
df.loc['bar'].loc['one']
A    0.580665
B   -1.265347
Name: one, dtype: float64
stacked = df.stack()
stacked
first  second   
bar    one     A    0.580665
               B   -1.265347
       two     A   -0.844164
               B   -0.803420
baz    one     A    0.144407
               B    2.093790
       two     A   -0.738359
               B   -0.243598
foo    one     A   -0.522004
               B    0.542619
       two     A    0.077826
               B   -0.919478
qux    one     A    0.719028
               B    0.484342
       two     A    0.804915
               B    0.208629
dtype: float64
stacked.index
MultiIndex([('bar', 'one', 'A'),
            ('bar', 'one', 'B'),
            ('bar', 'two', 'A'),
            ('bar', 'two', 'B'),
            ('baz', 'one', 'A'),
            ('baz', 'one', 'B'),
            ('baz', 'two', 'A'),
            ('baz', 'two', 'B'),
            ('foo', 'one', 'A'),
            ('foo', 'one', 'B'),
            ('foo', 'two', 'A'),
            ('foo', 'two', 'B'),
            ('qux', 'one', 'A'),
            ('qux', 'one', 'B'),
            ('qux', 'two', 'A'),
            ('qux', 'two', 'B')],
           names=['first', 'second', None])
stacked.unstack().unstack()
AB
secondonetwoonetwo
first
bar0.580665-0.844164-1.265347-0.803420
baz0.144407-0.7383592.093790-0.243598
foo-0.5220040.0778260.542619-0.919478
qux0.7190280.8049150.4843420.208629
stacked.unstack(1)
secondonetwo
first
barA0.580665-0.844164
B-1.265347-0.803420
bazA0.144407-0.738359
B2.093790-0.243598
fooA-0.5220040.077826
B0.542619-0.919478
quxA0.7190280.804915
B0.4843420.208629

数据透视

df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                    'B' : ['A', 'B', 'C'] * 4,
                    'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                    'D' : np.random.randn(12),
                    'E' : np.random.randn(12)})
df
ABCDE
0oneAfoo-1.115244-1.477782
1oneBfoo0.417247-0.389403
2twoCfoo-0.5347431.876607
3threeAbar-0.2224091.227989
4oneBbar-0.9049030.965748
5oneCbar-0.950457-0.727518
6twoAfoo0.1107050.760229
7threeBfoo-0.8937220.098459
8oneCfoo1.2702201.366383
9oneAbar-1.050292-0.758947
10twoBbar1.070277-1.388617
11threeCbar-0.7523690.311584
# 画出数据透视表,选择D列的数据(values)
# A的三个分类(one,two,three)和B的三个分类(A,B,C)的组合作为双重行索引(index)
# c的两个分类(foo, bar)作为列索引(columns)
# 不存在的数据会显示为缺失值,有多个值会求平均值
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
Cbarfoo
AB
oneA-1.050292-1.115244
B-0.9049030.417247
C-0.9504571.270220
threeA-0.222409NaN
BNaN-0.893722
C-0.752369NaN
twoANaN0.110705
B1.070277NaN
CNaN-0.534743

时间序列

rng = pd.date_range('20160301', periods=600, freq='s')
rng
DatetimeIndex(['2016-03-01 00:00:00', '2016-03-01 00:00:01',
               '2016-03-01 00:00:02', '2016-03-01 00:00:03',
               '2016-03-01 00:00:04', '2016-03-01 00:00:05',
               '2016-03-01 00:00:06', '2016-03-01 00:00:07',
               '2016-03-01 00:00:08', '2016-03-01 00:00:09',
               ...
               '2016-03-01 00:09:50', '2016-03-01 00:09:51',
               '2016-03-01 00:09:52', '2016-03-01 00:09:53',
               '2016-03-01 00:09:54', '2016-03-01 00:09:55',
               '2016-03-01 00:09:56', '2016-03-01 00:09:57',
               '2016-03-01 00:09:58', '2016-03-01 00:09:59'],
              dtype='datetime64[ns]', length=600, freq='S')
# 生成时间序列
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts
2016-03-01 00:00:00    356
2016-03-01 00:00:01    169
2016-03-01 00:00:02    189
2016-03-01 00:00:03     42
2016-03-01 00:00:04    295
                      ... 
2016-03-01 00:09:55    377
2016-03-01 00:09:56    248
2016-03-01 00:09:57     70
2016-03-01 00:09:58    490
2016-03-01 00:09:59      6
Freq: S, Length: 600, dtype: int32
ts.resample('2Min', how='sum')
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-139-e8bcfecc97a5> in <module>
----> 1 ts.resample('2Min', how='sum')


TypeError: resample() got an unexpected keyword argument 'how'
rng = pd.date_range('20160301', periods=5, freq='M')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts
2016-03-31   -0.501975
2016-04-30   -1.704262
2016-05-31   -2.127817
2016-06-30   -1.115246
2016-07-31   -0.750171
Freq: M, dtype: float64
ps = ts.to_period()
ps
2016-03   -0.501975
2016-04   -1.704262
2016-05   -2.127817
2016-06   -1.115246
2016-07   -0.750171
Freq: M, dtype: float64
ps = ts.to_period()
ps
2016-03   -0.501975
2016-04   -1.704262
2016-05   -2.127817
2016-06   -1.115246
2016-07   -0.750171
Freq: M, dtype: float64
# 时间的加减法
print(pd.Timestamp('20160301') - pd.Timestamp('20160201'))
print(pd.Timestamp('20160301') + pd.Timedelta(days=5))
29 days 00:00:00
2016-03-06 00:00:00

类别数据

df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
df
idraw_grade
01a
12b
23b
34a
45a
56e
# 新增加一个grade列
# 这里没有详细解释
df["grade"] = df["raw_grade"].astype("category")
df
idraw_gradegrade
01aa
12bb
23bb
34aa
45aa
56ee
df.grade
0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a, b, e]
# 这里也没有详细解释
df["grade"].cat.categories
Index(['a', 'b', 'e'], dtype='object')
df["grade"].cat.categories = ["very good", "good", "very bad"]
df
idraw_gradegrade
01avery good
12bgood
23bgood
34avery good
45avery good
56every bad
# 实际上是按照raw_grade排序的
# 这里也没有详细解释
df.sort_values(by='grade', ascending=True)
idraw_gradegrade
01avery good
34avery good
45avery good
12bgood
23bgood
56every bad
df.groupby("grade").size()
grade
very good    3
good         2
very bad     1
dtype: int64

画图

ts = pd.Series(np.random.randn(1000), index=pd.date_range('20000101', periods=1000))
ts = ts.cumsum()
ts
2000-01-01     1.465772
2000-01-02     0.590570
2000-01-03     1.463768
2000-01-04     2.059164
2000-01-05     3.997711
                ...    
2002-09-22    21.165409
2002-09-23    21.138746
2002-09-24    21.833248
2002-09-25    22.712132
2002-09-26    23.005704
Freq: D, Length: 1000, dtype: float64
ts.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x16bb1ec3668>

在这里插入图片描述

数据读写

df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
df
ABCD
0-0.276452-0.0438810.8872620.573018
10.118645-0.812897-2.1112070.335636
21.424447-1.279426-0.207269-0.290550
3-1.023856-0.1906590.371154-0.437665
4-0.745313-1.551667-0.6469870.094488
...............
95-0.3952400.5778421.439321-0.606708
960.3738160.2829141.361671-0.080537
97-0.8526960.0885530.4909960.064302
98-1.4264360.619357-1.0113220.367762
990.367029-0.756744-0.405919-0.834062

100 rows × 4 columns

df.to_csv('data.csv')
# 这里大概是读取文件列表
%ls
 驱动器 C 中的卷没有标签。
 卷的序列号是 4240-7EF8

 C:\Users\Administrator.DESKTOP-EUSQJR5 的目录

2020/07/05  07:53    <DIR>          .
2020/07/05  07:53    <DIR>          ..
2019/07/08  14:29    <DIR>          .anaconda
2020/06/22  08:45    <DIR>          .astropy
2020/07/05  02:38    <DIR>          .conda
2020/07/05  02:07               130 .condarc
2020/06/22  08:49    <DIR>          .config
2020/04/19  19:28    <DIR>          .eclipse
2019/10/07  21:26    <DIR>          .idlerc
2020/07/05  04:49    <DIR>          .ipynb_checkpoints
2019/07/08  15:00    <DIR>          .ipython
2020/07/05  03:49    <DIR>          .jupyter
2020/02/27  17:32    <DIR>          .matplotlib
2020/06/26  21:05    <DIR>          .p2
2019/12/15  23:43                64 .pgAdmin4.1549456812.addr
2019/12/15  23:42                 0 .pgAdmin4.1549456812.log
2019/12/15  23:43             1,597 .pgAdmin4.startup.log
2020/02/29  16:33    <DIR>          .PyCharm2019.1
2020/03/04  17:42    <DIR>          .pylint.d
2020/03/04  12:26                 7 .python_history
2019/10/15  15:47    <DIR>          .spss
2020/03/07  12:23    <DIR>          .spyder-py3
2020/04/19  19:28    <DIR>          .tooling
2020/02/27  17:14    <DIR>          .vscode
2020/07/05  04:51            77,246 03.pandas_intro_p1.ipynb
2020/07/05  04:51            64,477 04.pandas_intro_p2.ipynb
2020/07/05  04:49           112,051 05.pandas_intro_p3.ipynb
2019/10/26  19:10         3,047,162 20191026_191017.mp4
2020/06/11  21:30    <DIR>          3D Objects
2020/03/31  16:09             3,618 config.txt
2020/06/11  21:30    <DIR>          Contacts
2020/07/05  07:53             8,258 data.csv
2020/07/05  02:42    <DIR>          Desktop
2019/10/26  22:12                41 doc2any.ini
2020/06/11  21:30    <DIR>          Documents
2020/06/11  21:30    <DIR>          Downloads
2020/06/11  21:30    <DIR>          Favorites
2019/07/31  14:49    <DIR>          Funshion
2020/06/11  21:30    <DIR>          Links
2020/06/11  21:30    <DIR>          Music
2019/08/31  22:01    <DIR>          OneDrive
2020/07/05  07:53           190,020 pandas基础.ipynb
2020/03/31  16:03               426 persist
2020/06/11  21:30    <DIR>          Pictures
2020/06/11  21:30    <DIR>          Saved Games
2020/06/11  21:30    <DIR>          Searches
2019/07/04  22:17    <DIR>          UIDowner
2020/06/29  14:54            52,430 Untitled.ipynb
2020/03/04  18:32             1,475 Untitled1.ipynb
2020/06/11  21:30    <DIR>          Videos
2019/10/25  19:50    <DIR>          Yinxiang Biji
2019/11/26  14:40    <DIR>          Zotero
              16 个文件      3,559,002 字节
              36 个目录 142,991,962,112 可用字节
print(pd.read_csv('data.csv'))
# 以零这一列作为索引
print(pd.read_csv('data.csv', index_col=0))
    Unnamed: 0         A         B         C         D
0            0 -0.276452 -0.043881  0.887262  0.573018
1            1  0.118645 -0.812897 -2.111207  0.335636
2            2  1.424447 -1.279426 -0.207269 -0.290550
3            3 -1.023856 -0.190659  0.371154 -0.437665
4            4 -0.745313 -1.551667 -0.646987  0.094488
..         ...       ...       ...       ...       ...
95          95 -0.395240  0.577842  1.439321 -0.606708
96          96  0.373816  0.282914  1.361671 -0.080537
97          97 -0.852696  0.088553  0.490996  0.064302
98          98 -1.426436  0.619357 -1.011322  0.367762
99          99  0.367029 -0.756744 -0.405919 -0.834062

[100 rows x 5 columns]
           A         B         C         D
0  -0.276452 -0.043881  0.887262  0.573018
1   0.118645 -0.812897 -2.111207  0.335636
2   1.424447 -1.279426 -0.207269 -0.290550
3  -1.023856 -0.190659  0.371154 -0.437665
4  -0.745313 -1.551667 -0.646987  0.094488
..       ...       ...       ...       ...
95 -0.395240  0.577842  1.439321 -0.606708
96  0.373816  0.282914  1.361671 -0.080537
97 -0.852696  0.088553  0.490996  0.064302
98 -1.426436  0.619357 -1.011322  0.367762
99  0.367029 -0.756744 -0.405919 -0.834062

[100 rows x 4 columns]

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值