【可与Python】DataFrame用法详解 超详细!!!

今天整理了一些pandas里,DataFrame的用法,有帮助的话请点赞关注哦~

0. 预备操作

import  pandas as pd
import  numpy as np

1. 生成一维数组

x=pd.Series([1,2,3,np.nan])

timelist = pd.date_range(start='20200101',end='20201231',freq='W')
print(timelist)
DatetimeIndex(['2020-01-05', '2020-01-12', '2020-01-19', '2020-01-26',
               '2020-02-02', '2020-02-09', '2020-02-16', '2020-02-23',
               '2020-03-01', '2020-03-08', '2020-03-15', '2020-03-22',
               '2020-03-29', '2020-04-05', '2020-04-12', '2020-04-19',
               '2020-04-26', '2020-05-03', '2020-05-10', '2020-05-17',
               '2020-05-24', '2020-05-31', '2020-06-07', '2020-06-14',
               '2020-06-21', '2020-06-28', '2020-07-05', '2020-07-12',
               '2020-07-19', '2020-07-26', '2020-08-02', '2020-08-09',
               '2020-08-16', '2020-08-23', '2020-08-30', '2020-09-06',
               '2020-09-13', '2020-09-20', '2020-09-27', '2020-10-04',
               '2020-10-11', '2020-10-18', '2020-10-25', '2020-11-01',
               '2020-11-08', '2020-11-15', '2020-11-22', '2020-11-29',
               '2020-12-06', '2020-12-13', '2020-12-20', '2020-12-27'],
              dtype='datetime64[ns]', freq='W-SUN')

2. 生成DataFrame

a1=pd.DataFrame(np.random.randn(12,4),index=list(range(5,17)),columns=list('ABCD'))
print(a1)

a2=pd.DataFrame([np.random.randint(1,100,4) for i in range(12)],columns=list('ABCD'))
print(a2)

#字典转df
a3= pd.DataFrame({'A':np.random.randint(1,100,4),'B':pd.date_range(start='20130101',periods=4,freq='D'),
  'C': pd.Series([1,2,3,4],dtype='float64'),
   'D':np.array([3]*4,dtype='int32'),
                  'E':pd.Categorical(['test','train']*2),
            'F':'f00'})
print(a3)

# 如果行数或列数太多,会自动压缩显示一部分,其余用省略号,可以通过下面的代码修改:

pd.set_option('display.max_rows',5)
pd.set_option('display.max_columns',5)
a3= pd.DataFrame({'A':np.random.randint(1,100,4),'B':pd.date_range(start='20130101',periods=4,freq='D'),
  'C': pd.Series([1,2,3,4],dtype='float64'),
   'D':np.array([3]*4,dtype='int32'),
                  'E':pd.Categorical(['test','train']*2),
            'F':'f00'})
print(a3)
           A         B         C         D
5   0.704611  0.431369 -0.828452  0.169185
6   0.780436  0.938247 -1.656147 -0.243374
..       ...       ...       ...       ...
15  0.333885  1.823517  0.483210 -0.536295
16 -0.424291  0.305974  0.587208 -0.639165

[12 rows x 4 columns]
     A   B   C   D
0   13  45  74  32
1   60  43  45  50
..  ..  ..  ..  ..
10  61  98  95   7
11  74   8  24  44

[12 rows x 4 columns]
    A          B  ...      E    F
0  56 2013-01-01  ...   test  f00
1  86 2013-01-02  ...  train  f00
2  12 2013-01-03  ...   test  f00
3  45 2013-01-04  ...  train  f00

[4 rows x 6 columns]
    A          B  ...      E    F
0  71 2013-01-01  ...   test  f00
1  74 2013-01-02  ...  train  f00
2  13 2013-01-03  ...   test  f00
3  92 2013-01-04  ...  train  f00

[4 rows x 6 columns]

3.二维数据查看

3.1 查看行

#实例数据
df= pd.DataFrame({'A':np.random.randint(1,100,4),'B':pd.date_range(start='20130101',periods=4,freq='D'),
  'C': pd.Series([1,2,3,4],dtype='float64'),
   'D':np.array([3]*4,dtype='int32'),
                  'E':pd.Categorical(['test','train']*2),
            'F':'f00'})
df.head()
AB...EF
0812013-01-01...testf00
162013-01-02...trainf00
2272013-01-03...testf00
3972013-01-04...trainf00

4 rows × 6 columns

df.head(3)
AB...EF
0812013-01-01...testf00
162013-01-02...trainf00
2272013-01-03...testf00

3 rows × 6 columns

df.tail(2)
AB...EF
2272013-01-03...testf00
3972013-01-04...trainf00

2 rows × 6 columns

3.2 查看二位数据的索引、列名和数据

df.index
RangeIndex(start=0, stop=4, step=1)
df.columns
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
df.values
array([[81, Timestamp('2013-01-01 00:00:00'), 1.0, 3, 'test', 'f00'],
       [6, Timestamp('2013-01-02 00:00:00'), 2.0, 3, 'train', 'f00'],
       [27, Timestamp('2013-01-03 00:00:00'), 3.0, 3, 'test', 'f00'],
       [97, Timestamp('2013-01-04 00:00:00'), 4.0, 3, 'train', 'f00']],
      dtype=object)

3.3 查看数据的统计信息

df.describe()
ACD
count4.004.004.0
mean52.752.503.0
............
75%85.003.253.0
max97.004.003.0

8 rows × 3 columns

4. 二维数组操作

4.1二维数组转置

df.T
0123
A8162797
B2013-01-01 00:00:002013-01-02 00:00:002013-01-03 00:00:002013-01-04 00:00:00
...............
Etesttraintesttrain
Ff00f00f00f00

6 rows × 4 columns

4.2 二维数组排序

df.sort_index(axis=0,ascending=False)        #对index排序
AB...EF
3972013-01-04...trainf00
2272013-01-03...testf00
162013-01-02...trainf00
0812013-01-01...testf00

4 rows × 6 columns

df.sort_index(axis=1,ascending=False)        #对column排序
FE...BA
0f00test...2013-01-0181
1f00train...2013-01-026
2f00test...2013-01-0327
3f00train...2013-01-0497

4 rows × 6 columns

df.sort_values(by='A')    #对数据进行排序
AB...EF
162013-01-02...trainf00
2272013-01-03...testf00
0812013-01-01...testf00
3972013-01-04...trainf00

4 rows × 6 columns

df.sort_values(by=['A','B'],ascending=['True','False'])    #对数据进行排序
AB...EF
162013-01-02...trainf00
2272013-01-03...testf00
0812013-01-01...testf00
3972013-01-04...trainf00

4 rows × 6 columns

4.3 数据选择

4.3.1选择列

df.A
0    81
1     6
2    27
3    97
Name: A, dtype: int32
df['A']
0    81
1     6
2    27
3    97
Name: A, dtype: int32
6 in df['A']          #此时df['A'] 是一个Series对象
False
6 in df['A'].values
True
df.loc[:,['A','C']]    
AC
0811.0
162.0
2273.0
3974.0

4.3.2 选择行

df[0:2]     
AB...EF
0812013-01-01...testf00
162013-01-02...trainf00

2 rows × 6 columns

df.iloc[3]     #查询第三行数据
A                     97
B    2013-01-04 00:00:00
            ...         
E                  train
F                    f00
Name: 3, Length: 6, dtype: object

4.3.3 选择行和列

df.loc[[0,2],['A','B']]
AB
0812013-01-01
2272013-01-03
df.iloc[0:3,0:4]           #用index号码来查询
ABCD
0812013-01-011.03
162013-01-022.03
2272013-01-033.03
df.iloc[[1,3],[2,4]]
CE
12.0train
34.0train

4.3.4查询值

df.at[0,'A']
81
df.iloc[0,0]
81

4.3.5 按给定条件查询

4.3.5.1 简单查询
df[df.A>50]
AB...EF
0812013-01-01...testf00
3972013-01-04...trainf00

2 rows × 6 columns

df[df['E']=='test']
AB...EF
0812013-01-01...testf00
2272013-01-03...testf00

2 rows × 6 columns

df[df['A'].isin([20,81])]      #是否为20或81
AB...EF
0812013-01-01...testf00

1 rows × 6 columns

pd.set_option('display.max_columns',10)     #改变一下输出的列数

df.nlargest(3,['C'])        #返回指定列最大的前3行
ABCDEF
3972013-01-044.03trainf00
2272013-01-033.03testf00
162013-01-022.03trainf00
4.3.5.2 按求和值查询
#数据准备
dff = pd.DataFrame({'A':[1,2,3,4],'B':[10,20,8,40]})
dff
AB
0110
1220
238
3440
dff[dff.sum(axis=1)==11]      #横向求和
AB
0110
238

5. 数据修改

5.1 数据原地简单修改

df       #显示原数据
ABCDEF
0812013-01-013.050testf00
162013-01-022.056trainf00
2272013-01-033.058testf00
3972013-01-044.050trainf00
df.iat[0,2]=3    #修改指定行、列的数值
df.loc[:,'D']=np.random.randint(50,60,4)    #修改一列的值
df['C']=-df['C']   #数据取反
df        #修改后结果
ABCDEF
0812013-01-01-3.058testf00
162013-01-02-2.058trainf00
2272013-01-03-3.055testf00
3972013-01-04-4.054trainf00
#复制一份数据命名为dff
from copy import deepcopy
dff = deepcopy(df)
dff
ABCDEF
0812013-01-01-3.058testf00
162013-01-02-2.058trainf00
2272013-01-03-3.055testf00
3972013-01-04-4.054trainf00
dff["C"]=dff['C']**2       
dff
ABCDEF
0812013-01-019.058testf00
162013-01-024.058trainf00
2272013-01-039.055testf00
3972013-01-0416.054trainf00
dff.loc[dff.C==9,'D']=100      #修改特定行的指定列
dff
ABCDEF
0812013-01-019.0100testf00
162013-01-024.058trainf00
2272013-01-039.0100testf00
3972013-01-0416.054trainf00

5.2 数据替换

#数据准备
data = pd.DataFrame({"k1":['one']*3+['two']*4,'k2':[1,1,2,3,3,4,4]})
pd.set_option('display.max_rows',10)
data.replace(1,5)         f#此语句有返回值,但不是原地替换data仍是原来的值
k1k2
0one5
1one5
2one2
3two3
4two3
5two4
6two4
data.replace([1,2],[5,6])  # 1->5,2->6 此语句有返回值,但不是原地替换data仍是原来的值
k1k2
0one5
1one5
2one6
3two3
4two3
5two4
6two4
data.replace({1:5,'one':'ONE'})  # 利用字典指定替换关系 此语句有返回值,但不是原地替换data仍是原来的值

6.数据删除

#数据准备
data = pd.DataFrame({"k1":['one']*3+['two']*4,'k2':[1,1,2,3,3,4,4]})
data
k1k2
0one1
1one1
2one2
3two3
4two3
5two4
6two4
data.drop(5,axis=0)      #删除指定行    此语句有返回值,但不是原地删除data仍是原来的值
k1k2
0one1
1one1
2one2
3two3
4two3
6two4
data.drop(3,inplace=True)    #原地删除    但此语句无返回值
data       #显示删除后的结果
k1k2
0one1
1one1
2one2
4two3
5two4
6two4
data.drop('k1',axis=1)   #删除指定列
k2
01
11
22
43
54
64

7.数据增添

#数据准备 
data = pd.DataFrame({'姓名':['张三','李四','王五','赵六','刘七','孙八'],
                     'age':np.random.randint(20,50,6),
                      '成绩':[86,92,86,60,78,78]})

7.1增加位次号

data['rank'] = data['age'].rank()
data
姓名age成绩rank
0张三27861.0
1李四42923.0
2王五36862.0
3赵六49606.0
4刘七48785.0
5孙八46784.0
data['倒数排名'] = data['成绩'].rank(method='min') # 倒数名次,并列的取最小值
data
姓名age成绩rank倒数排名
0张三27861.04.0
1李四42923.06.0
2王五36862.04.0
3赵六49606.01.0
4刘七48785.02.0
5孙八46784.02.0
data['正数排名'] = data['成绩'].rank(method='min',ascending=False) # 正数名次,并列的取最小值
data
姓名age成绩rank倒数排名正数排名
0张三27861.04.02.0
1李四42923.06.01.0
2王五36862.04.02.0
3赵六49606.01.06.0
4刘七48785.02.04.0
5孙八46784.02.04.0
data['正数排名2'] = data['成绩'].rank(method='max',ascending=False) # 正数名次,并列的取最大值
data
姓名age成绩rank倒数排名正数排名正数排名2
0张三27861.04.02.03.0
1李四42923.06.01.01.0
2王五36862.04.02.03.0
3赵六49606.01.06.06.0
4刘七48785.02.04.05.0
5孙八46784.02.04.05.0
data['排名3'] = data['成绩'].rank(method='average')
data               # 倒数名次,并列的名次取平均值

姓名age成绩rank倒数排名正数排名正数排名2排名3
0张三27861.04.02.03.04.5
1李四42923.06.01.01.06.0
2王五36862.04.02.03.04.5
3赵六49606.01.06.06.01.0
4刘七48785.02.04.05.02.5
5孙八46784.02.04.05.02.5

7.2 行、列求和

#数据准备
dff = pd.DataFrame({'A':[1,2,3,4], 'B':[10,20,8,40]})
dff
AB
0110
1220
238
3440
dff['col_Sum']=dff.apply(sum,axis=1)  #对行求和,增加1列     axis=1指向column
dff
ABcol_Sum
011011
122022
23811
344044

DataFrame.apply(self, func, axis=0, raw=False, result_type=None, args=(), **kwds)API

用于 DataFrame 和 Series 对象。主要用于数据聚合运算,可以很方便的对分组进行现有的运算和自定义的运算。

dff
ABcol_Sum
011011
122022
23811
344044
dff.loc['row_Sum'] = dff.apply(sum,axis=0)    
dff
ABcol_Sum
011011
122022
23811
344044
row_Sum20156176

8. 重排和增加列名

#数据准备
da=pd.DataFrame({'A3':[2,3,4,5],'A1':[1,2,5,6]})
da
A3A1
021
132
245
356
da.reindex(columns=['A1',"A2","A3",'A4'])
A1A2A3A4
01NaN2NaN
12NaN3NaN
25NaN4NaN
36NaN5NaN
pd.concat([da, pd.DataFrame(columns=list('DE'))])
A3A1DE
02.01.0NaNNaN
13.02.0NaNNaN
24.05.0NaNNaN
35.06.0NaNNaN

9.缺失值处理

#原始数据
df.index=['zhang','li','zhou','wang'] #在之前使用过的数据的基础上修改了index
df
ABCDEF
zhang812013-01-01-3.058testf00
li62013-01-02-2.058trainf00
zhou272013-01-03-3.055testf00
wang972013-01-04-4.054trainf00
df1 = df.reindex(columns=list(df.columns)+['G'])
df1
ABCDEFG
zhang812013-01-01-3.058testf00NaN
li62013-01-02-2.058trainf00NaN
zhou272013-01-03-3.055testf00NaN
wang972013-01-04-4.054trainf00NaN
df1.iat[0,6] = 3            #修改指定元素的值

9.1 测试缺失值

pd.isnull(df1)      #测试缺失值
ABCDEFG
zhangFalseFalseFalseFalseFalseFalseFalse
liFalseFalseFalseFalseFalseFalseTrue
zhouFalseFalseFalseFalseFalseFalseTrue
wangFalseFalseFalseFalseFalseFalseTrue
df1.dropna()#返回不包含缺失值的行    不原地改变df1
ABCDEFG
zhang812013-01-01-3.058testf003.0
li62013-01-02-2.058trainf005.0
zhou272013-01-03-3.055testf005.0
wang972013-01-04-4.054trainf005.0

9.2 指定值填充

#复制一份
df2 = deepcopy(df1)
df1['G'].fillna(5,inplace=True)     #使用指定值填充缺失值
df1
ABCDEFG
zhang812013-01-01-3.058testf003.0
li62013-01-02-2.058trainf005.0
zhou272013-01-03-3.055testf005.0
wang972013-01-04-4.054trainf005.0
df2.iat[2,5] = np.NAN
df2
ABCDEFG
zhang812013-01-01-3.058testf003.0
li62013-01-02-2.058trainf00NaN
zhou272013-01-03-3.055testNaNNaN
wang972013-01-04-4.054trainf00NaN
df2.dropna(thresh=6)  # 返回包含6个有效值以上的数据
ABCDEFG
zhang812013-01-01-3.058testf003.0
li62013-01-02-2.058trainf00NaN
wang972013-01-04-4.054trainf00NaN
df2.iat[3, 6] = 8 
df2
ABCDEFG
zhang812013-01-01-3.058testf003.0
li62013-01-02-2.058trainf00NaN
zhou272013-01-03-3.055testNaNNaN
wang972013-01-04-4.054trainf008.0

9.3 平均值填充

df2.fillna({'F':'foo', 'G':df2['G'].mean()}) # 平均值填充缺失值
ABCDEFG
zhang812013-01-01-3.058testf003.0
li62013-01-02-2.058trainf005.5
zhou272013-01-03-3.055testfoo5.5
wang972013-01-04-4.054trainf008.0

9.4 前后值填充

#数据准备
dft = pd.DataFrame({'a':[1,np.NaN, np.NaN,3]})
dft
a
01.0
1NaN
2NaN
33.0
dft.fillna(method='pad')            # 使用缺失值前最后一个有效值进行填充
a
01.0
11.0
21.0
33.0
dft.fillna(method='bfill')    #使用缺失值后第一个有效值往回填充
a
01.0
13.0
23.0
33.0
dft.fillna(method='bfill',limit = 1)    #只填充一个
a
01.0
1NaN
23.0
33.0

10. 重复值处理

#数据准备
data = pd.DataFrame({'k1':['one'] * 3 + ['two'] * 4,
		      'k2':[1, 1, 2, 3, 3, 4, 4]})
data
k1k2
0one1
1one1
2one2
3two3
4two3
5two4
6two4
data.duplicated()   #检查重复行
0    False
1     True
2    False
3    False
4     True
5    False
6     True
dtype: bool
data.drop_duplicates()                     # 返回新数组,删除重复行
k1k2
0one1
2one2
3two3
5two4
data.drop_duplicates(['k1'])               # 删除k1列的重复数据,只保留第一项
k1k2
0one1
3two3
data.drop_duplicates(['k1'], keep='last')  # 保留最后一项
k1k2
2one2
6two4
data = pd.Series([3,3,3,2,1,1,1,0])
data.drop_duplicates(keep=False)  # 只保留出现一次的数字
3    2
7    0
dtype: int64

11.异常值处理

#数据准备
data = pd.DataFrame(np.random.randn(500, 4))
data.describe()
0123
count500.000000500.000000500.000000500.000000
mean0.000003-0.055441-0.1444640.115541
std1.0401131.0508161.0243640.993715
min-2.682324-3.093345-3.251424-2.436486
25%-0.714944-0.764660-0.833083-0.571401
50%0.000479-0.094025-0.1741630.085931
75%0.7667280.6335350.5802190.798915
max2.8680743.4443753.0262163.491998
col2 = data[2]      #第二列
col2[col2>2.5]
454    3.026216
Name: 2, dtype: float64
data[(data>3).any(1)]      #任意一列中有大于3的行
0123
90.2151470.1424310.9756793.491998
318-1.0180643.367820-0.875176-0.007749
350-0.8075843.4443751.0552230.759119
454-0.2604010.4782563.026216-1.251252
4560.951752-0.646577-1.0062803.082105
data[np.abs(data)>2.5] = np.sign(data) * 2.5
data.describe()
0123
count500.000000500.000000500.000000500.000000
mean-0.000456-0.056737-0.1432120.111439
std1.0341361.0320421.0154100.982327
min-2.500000-2.500000-2.500000-2.436486
25%-0.714944-0.764660-0.833083-0.571401
50%0.000479-0.094025-0.1741630.085931
75%0.7667280.6335350.5802190.798915
max2.5000002.5000002.5000002.500000

sign()函数功能介绍
sign()是Python的Numpy中的取数字符号(数字前的正负号)的函数。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rhoCDq7f-1633943733731)(sign函数.png)]

12.映射

#数据准备
data = pd.DataFrame({'k1':['one'] * 3 + ['two'] * 4,
		      'k2':[1, 1, 2, 3, 3, 4, 4]})
data
k1k2
0one1
1one1
2one2
3two3
4two3
5two4
6two4

12.1映射改值

data['k1'] = data['k1'].map(str.upper)   #使用函数进行映射
data
k1k2
0ONE6
1ONE6
2ONE7
3TWO8
4TWO8
5TWO9
6TWO9

map() 是一个Series的函数,DataFrame结构中没有map()。map()将一个自定义函数应用于Series结构中的每个元素(elements)。

data['k1'] = data['k1'].map({'ONE':'one','TWO':'two'})   #使用字典关系
data
k1k2
0one1
1one1
2one2
3two3
4two3
5two4
6two4
data['k2'] = data['k2'].map(lambda x: x+5)    #lambda表达式
data
k1k2
0one6
1one6
2one7
3two8
4two8
5two9
6two9

12.2映射修改索引、列名

data.index = data.index.map(lambda x:x+5)   #修改索引
data
k1k2
5one6
6one6
7one7
8two8
9two8
10two9
11two9
data.columns = data.columns.map(str.upper)  #修改列名
data
K1K2
5one6
6one6
7one7
8two8
9two8
10two9
11two9
data.rename(index=lambda x:x+5 , columns= str.lower ,inplace=True)   #原地修改  
data
k1k2
10one6
11one6
12one7
13two8
14two8
15two9
16two9

Pandas rename()方法用于重命名任何索引,列或行。列的重命名也可以通过dataframe.columns = [#list]。但在上述情况下,自由度不高。即使必须更改一列,也必须传递完整的列列表。另外,上述方法不适用于索引标签。

用法: DataFrame.rename(mapper=None, index=None, columns=None, axis=None, copy=True, inplace=False, level=None)

12.3 映射计算离差

#数据准备
dff = pd.DataFrame({'A':[1,2,3,4], 'B':[10,20,8,40]})
dff
AB
0110
1220
238
3440
dff.apply(lambda x:x-x.mean(),axis=1)   #column计算离差
AB
0-4.54.5
1-9.09.0
2-2.52.5
3-18.018.0
dff.apply(lambda x:x-x.mean(),axis=0)   #rows计算离差
AB
0-1.5-9.5
1-0.50.5
20.5-11.5
31.520.5

12.4 批量格式化数据

dff.applymap(lambda x: '%.1f'%x)        #有返回值,不修改原来的值
AB
01.010.0
12.020.0
23.08.0
34.040.0
dff.B=dff.B.map(lambda x: '%.0f'%x)      #原地修改
dff
AB
0110
1220
238
3440

13. 数据离散化

13.1划分区间

#数据准备
data13 = np.random.randint(0,100,10)
data13
array([33, 92, 42, 42, 57, 99,  4, 11, 57, 94])
category = [0,25,50,100]
pd.cut(data13,category)
[(25, 50], (50, 100], (25, 50], (25, 50], (50, 100], (50, 100], (0, 25], (0, 25], (50, 100], (50, 100]]
Categories (3, interval[int64]): [(0, 25] < (25, 50] < (50, 100]]
pd.cut(data13, category, right=False)  # 左闭右开区间
[[25, 50), [50, 100), [25, 50), [25, 50), [50, 100), [50, 100), [0, 25), [0, 25), [50, 100), [50, 100)]
Categories (3, interval[int64]): [[0, 25) < [25, 50) < [50, 100)]
labels = ['low', 'middle', 'high']
pd.cut(data13, category, right=False, labels=labels)
[middle, high, middle, middle, high, high, low, low, high, high]
Categories (3, object): [low < middle < high]
pd.cut(data13, 4)                           # 四分位数区间
[(27.75, 51.5], (75.25, 99.0], (27.75, 51.5], (27.75, 51.5], (51.5, 75.25], (75.25, 99.0], (3.905, 27.75], (3.905, 27.75], (51.5, 75.25], (75.25, 99.0]]
Categories (4, interval[float64]): [(3.905, 27.75] < (27.75, 51.5] < (51.5, 75.25] < (75.25, 99.0]]
pd.qcut(data13, 4)                          # 四分位数区间
[(3.999, 35.25], (83.25, 99.0], (35.25, 49.5], (35.25, 49.5], (49.5, 83.25], (83.25, 99.0], (3.999, 35.25], (3.999, 35.25], (49.5, 83.25], (83.25, 99.0]]
Categories (4, interval[float64]): [(3.999, 35.25] < (35.25, 49.5] < (49.5, 83.25] < (83.25, 99.0]]

14.频次统计与移位

pd.value_counts([1,1,3,3,3,3,2,1])
3    4
1    3
2    1
dtype: int64
pd.value_counts([1,1,3,3,3,3,2,1], sort=False)      #按值
1    3
2    1
3    4
dtype: int64
pd.value_counts([1,1,3,3,3,3,2,1], ascending=True)       #按个数
2    1
1    3
3    4
dtype: int64
df1.shift(1)                      # 数据下移一行,负数表示上移
ABCDEFG
zhangNaNNaTNaNNaNNaNNaNNaN
li81.02013-01-01-3.058.0testf003.0
zhou6.02013-01-02-2.058.0trainf005.0
wang27.02013-01-03-3.055.0testf005.0
df1
ABCDEFG
zhang812013-01-01-3.058testf003.0
li62013-01-02-2.058trainf005.0
zhou272013-01-03-3.055testf005.0
wang972013-01-04-4.054trainf005.0
df1['D'].value_counts()           # 直方图统计
58    2
55    1
54    1
Name: D, dtype: int64

15.差分与合并/连接

df14 = pd.DataFrame(np.random.randn(10, 4))
df14
0123
01.356177-1.594548-0.744250-0.561444
1-0.359151-0.638286-0.2972790.598555
2-1.528712-0.893813-1.9221990.444479
3-0.159882-0.0825860.467607-2.322559
40.696892-0.015499-0.587565-0.612759
51.5392940.2200611.3136420.621169
6-1.078099-0.9656040.5809881.752221
7-0.557355-0.4401180.8740220.910304
82.523452-0.010282-0.078567-0.588690
90.1779790.432169-0.4330721.379981
p1 = df14[:3]                   # 数据行拆分
p1
0123
01.356177-1.594548-0.744250-0.561444
1-0.359151-0.638286-0.2972790.598555
2-1.528712-0.893813-1.9221990.444479
p2 = df14[3:7]
p2
0123
3-0.159882-0.0825860.467607-2.322559
40.696892-0.015499-0.587565-0.612759
51.5392940.2200611.3136420.621169
6-1.078099-0.9656040.5809881.752221
p3 = df14[7:]
df14_ = pd.concat([p1, p2, p3])  # 数据行合并
df14_ == df14     # 测试两个二维数据是否相等,返回True/False阵列
0123
0TrueTrueTrueTrue
1TrueTrueTrueTrue
2TrueTrueTrueTrue
3TrueTrueTrueTrue
4TrueTrueTrueTrue
5TrueTrueTrueTrue
6TrueTrueTrueTrue
7TrueTrueTrueTrue
8TrueTrueTrueTrue
9TrueTrueTrueTrue

16. 分组计算

#数据准备
df4 = pd.DataFrame({'A':np.random.randint(1,5,8),
		     'B':np.random.randint(10,15,8),
		     'C':np.random.randint(20,30,8),
		     'D':np.random.randint(80,100,8)})
df4
ABCD
03132186
14122785
24122595
34112698
41112085
54142792
62102782
72142788
df4.groupby('A').sum()          # 数据分组计算
BCD
A
1112085
22454170
3132186
449105370
df4.groupby(by=['A', 'B']).mean()
CD
AB
1112085
2102782
142788
3132186
4112698
122690
142792
df4.groupby(by=['A','B'],as_index=False).mean()
ABCD
01112085
12102782
22142788
33132186
44112698
54122690
64142792
df4.groupby(by=['A', 'B']).aggregate({'C':np.mean, 'D':np.min})
                               # 分组后,C列使用平均值,D列使用最小值 

CD
AB
1112085
2102782
142788
3132186
4112698
122685
142792

17.哑变量矩阵/指标矩阵

#数据准备
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                       'data':[3,4,5,6,7,8]})
df
keydata
0b3
1b4
2a5
3c6
4a7
5b8
pd.get_dummies(df)               # 指标矩阵1
datakey_akey_bkey_c
03010
14010
25100
36001
47100
58010
pd.get_dummies(df['key'])        # 指标矩阵2   数据存在性
abc
0010
1010
2100
3001
4100
5010
dummies = pd.get_dummies(df['key'], prefix='key')  # 指定列名前缀
dummies
key_akey_bkey_c
0010
1010
2100
3001
4100
5010
df[['data']].join(dummies)                         # 连接
datakey_akey_bkey_c
03010
14010
25100
36001
47100
58010

18.透视转换与交叉表(列联表)

#数据准备
df = pd.DataFrame({'a':[1,2,3,4],
		    'b':[2,3,4,5],
		    'c':[3,4,5,6],
		    'd':[3,3,3,3]})
df
abcd
01233
12343
23453
34563

18.1 透视表

df.pivot(index='a', columns='b', values='c')
b2345
a
13.0NaNNaNNaN
2NaN4.0NaNNaN
3NaNNaN5.0NaN
4NaNNaNNaN6.0
df.pivot(index='a', columns='b', values='d')
b2345
a
13.0NaNNaNNaN
2NaN3.0NaNNaN
3NaNNaN3.0NaN
4NaNNaNNaN3.0
df.pivot(index='a', columns='b')
cd
b23452345
a
13.0NaNNaNNaN3.0NaNNaNNaN
2NaN4.0NaNNaNNaN3.0NaNNaN
3NaNNaN5.0NaNNaNNaN3.0NaN
4NaNNaNNaN6.0NaNNaNNaN3.0
df.pivot(index='a', columns='b')['c']
b2345
a
13.0NaNNaNNaN
2NaN4.0NaNNaN
3NaNNaN5.0NaN
4NaNNaNNaN6.0

18.2 交叉表

pd.crosstab(index=df.a, columns=df.b)
b2345
a
11000
20100
30010
40001
pd.crosstab(index=df.a, columns=df.b, margins=True)
b2345All
a
110001
201001
300101
400011
All11114
pd.crosstab(index=df.a, columns=df.b, values=df.c, aggfunc='sum', margins=True)
b2345All
a
13.0NaNNaNNaN3
2NaN4.0NaNNaN4
3NaNNaN5.0NaN5
4NaNNaNNaN6.06
All3.04.05.06.018
pd.crosstab(index=df.a, columns=df.b, values=df.c, aggfunc='mean', margins=True)
b2345All
a
13.0NaNNaNNaN3.0
2NaN4.0NaNNaN4.0
3NaNNaN5.0NaN5.0
4NaNNaNNaN6.06.0
All3.04.05.06.04.5

19. 数据差分

#数据准备
df = pd.DataFrame({'a':np.random.randint(1, 100, 10),
		    'b':np.random.randint(1, 100, 10)},
		    index=map(str, range(10)))
df
ab
01463
12060
28912
32878
42199
52741
68364
79194
86051
96929
df.diff()            # 纵向一阶差分(前项减后项)
ab
0NaNNaN
16.0-3.0
269.0-48.0
3-61.066.0
4-7.021.0
56.0-58.0
656.023.0
78.030.0
8-31.0-43.0
99.0-22.0
df.diff(axis=1)       # 横向一阶差分
ab
0NaN49.0
1NaN40.0
2NaN-77.0
3NaN50.0
4NaN78.0
5NaN14.0
6NaN-19.0
7NaN3.0
8NaN-9.0
9NaN-40.0
df.diff(periods=2)      # 纵向二阶差分
ab
0NaNNaN
1NaNNaN
275.0-51.0
38.018.0
4-68.087.0
5-1.0-37.0
662.0-35.0
764.053.0
8-23.0-13.0
9-22.0-65.0

20.相关系数

#数据准备
df = pd.DataFrame({'A':np.random.randint(1, 100, 10),
		    'B':np.random.randint(1, 100, 10),
		    'C':np.random.randint(1, 100, 10)})
df
df.corr()                       # pearson相关系数
ABC
A1.000000-0.3601410.232945
B-0.3601411.000000-0.476603
C0.232945-0.4766031.000000
df.corr('kendall')              # Kendall Tau相关系数
ABC
A1.000000-0.3409090.179787
B-0.3409091.000000-0.269680
C0.179787-0.2696801.000000
df.corr('spearman')             # spearman秩相关
ABC
A1.000000-0.5030490.255320
B-0.5030491.000000-0.468087
C0.255320-0.4680871.000000

21. matplotlib 绘图

import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(1000, 2), columns=['B', 'C']).cumsum()   
#这个函数的功能是返回给定axis上的累计和
df
BC
0-1.434415-1.363257
1-1.988009-2.407375
2-1.722312-3.656220
3-1.622777-2.629169
4-3.002289-2.623288
.........
995-39.633091-14.478918
996-40.224034-14.200355
997-40.242355-14.070692
998-40.567508-12.884639
999-40.407534-13.452417

1000 rows × 2 columns

df['A'] = pd.Series(list(range(len(df))))
plt.figure()
df.plot(x='A')
plt.show()
<Figure size 432x288 with 0 Axes>

在这里插入图片描述

df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
df.plot(kind='bar')
plt.show()

df = pd.DataFrame({'height':[180,170,172,183,179,178,160],
		    'weight':[85,80,85,75,78,78,70]})
df.plot(x='height', y='weight', kind='scatter',
            marker='*', s=60, label='height-weight') #绘制散点图
plt.show()

在这里插入图片描述

df['weight'].plot(kind='pie', autopct='%.2f%%',
                      labels=df['weight'].values,
                      shadow=True)       # 饼状图
plt.show()

在这里插入图片描述

22.文件读写

df.to_excel('pd_to_xlsx.xlsx', sheet_name='dfg')       # 将数据保存为Excel文件
df = pd.read_excel('pd_to_xlsx.xlsx', 'dfg', index_col=None, na_values=['NA'])
df3 = pd.read_excel('pd_to_xlsx.xlsx', 'dfg',skiprows=3)     # 读取a表,跳过前3行
df3
217285
0318375
1417978
2517878
3616070
df44 = pd.read_excel('pd_to_xlsx.xlsx', 'dfg',skiprows=[2,4]) # 跳过下标为2、4的行
df44
Unnamed: 0heightweight
0018085
1217285
2417978
3517878
4616070
df.to_csv('df_2_csv.csv')                            # 将数据保存为csv文件
dfdd = pd.read_csv('df_2_csv.csv')                     # 读取csv文件中的数据
dfdd
Unnamed: 0Unnamed: 0.1heightweight
00018085
11117080
22217285
33318375
44417978
55517878
66616070

  • 6
    点赞
  • 24
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

可与很ok

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值