pandas 基础操作 更新

pandas 基础操作 更新


import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

创建一个Series,同时让pandas自动生成索引列


s = pd.Series([1,3,5,np.nan,6,8])

# 查看s
s
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

创建一个DataFrame数据框


### 创建一个DataFrame ,可以传入一个numpy array 可以自己构建索引以及列标
dates = pd.date_range('2018-11-01',periods=7)
#### 比如说生成一个时间序列,以20181101 为起始位置的,7个日期组成的时间序列,数据的类型为datetime64[ns]

dates
DatetimeIndex(['2018-11-01', '2018-11-02', '2018-11-03', '2018-11-04',
               '2018-11-05', '2018-11-06', '2018-11-07'],
              dtype='datetime64[ns]', freq='D')

df = pd.DataFrame(np.random.randn(7,4),index= dates,columns=list('ABCD'))
df
# 产生随机正态分布的数据,7行4列,分别对应的index的长度以及column的长度
 ABCD
2018-11-01-0.170364-0.2375410.5299030.660073
2018-11-02-0.158446-0.4885350.082960-1.913573
2018-11-03-0.5184260.730866-1.0338300.712624
2018-11-041.0135270.2701670.0818050.178193
2018-11-05-0.897497-0.016279-0.2349930.081208
2018-11-06-0.0305800.5455611.091127-0.131579
2018-11-07-0.313342-0.688179-0.4177540.855027

### 同时用可以使用dict的实行创建DataFrame
df2 = pd.DataFrame({"A":1,
                   "B":"20181101",
                   'C':np.array([3]*4,dtype='int32'),
                   'D':pd.Categorical(['test','train','test','train']),
                   "E":1.5},
                  )
df2
 ABCDE
01201811013test1.5
11201811013train1.5
21201811013test1.5
31201811013train1.5

df2.dtypes
### 查看数据框中的数据类型,常见的数据类型还有时间类型以及float类型
A       int64
B      object
C       int32
D    category
E     float64
dtype: object

查看数据



# 比如说看前5行
df.head()
 ABCD
2018-11-01-0.170364-0.2375410.5299030.660073
2018-11-02-0.158446-0.4885350.082960-1.913573
2018-11-03-0.5184260.730866-1.0338300.712624
2018-11-041.0135270.2701670.0818050.178193
2018-11-05-0.897497-0.016279-0.2349930.081208

# 后4行
df.tail(4)
 ABCD
2018-11-041.0135270.2701670.0818050.178193
2018-11-05-0.897497-0.016279-0.2349930.081208
2018-11-06-0.0305800.5455611.091127-0.131579
2018-11-07-0.313342-0.688179-0.4177540.855027

# 查看DataFrame的索引
df.index
DatetimeIndex(['2018-11-01', '2018-11-02', '2018-11-03', '2018-11-04',
               '2018-11-05', '2018-11-06', '2018-11-07'],
              dtype='datetime64[ns]', freq='D')

# 查看DataFrame的列索引
df.columns
Index(['A', 'B', 'C', 'D'], dtype='object')

# 查看DataFrame的数据,将DataFrame转化为numpy array 的数据形式
df.values
array([[-0.1703643 , -0.23754121,  0.52990284,  0.66007285],
       [-0.15844565, -0.48853537,  0.08296043, -1.91357255],
       [-0.51842554,  0.73086567, -1.03382969,  0.71262388],
       [ 1.01352712,  0.27016714,  0.08180539,  0.17819344],
       [-0.89749689, -0.01627937, -0.23499323,  0.08120819],
       [-0.03058032,  0.54556063,  1.09112723, -0.13157934],
       [-0.31334198, -0.68817881, -0.41775393,  0.85502652]])

数据的简单统计


# 可以使用describe函数对DataFrame中的数值型数据进行统计
df.describe()
 ABCD
count7.0000007.0000007.0000007.000000
mean-0.1535900.0165800.0141740.063139
std0.5901440.5278600.6809390.945526
min-0.897497-0.688179-1.033830-1.913573
25%-0.415884-0.363038-0.326374-0.025186
50%-0.170364-0.0162790.0818050.178193
75%-0.0945130.4078640.3064320.686348
max1.0135270.7308661.0911270.855027

df2.describe()
### 对于其他的数据类型的数据describe函数会自动过滤掉
 ACE
count4.04.04.0
mean1.03.01.5
std0.00.00.0
min1.03.01.5
25%1.03.01.5
50%1.03.01.5
75%1.03.01.5
max1.03.01.5

### DataFrame 的转置,将列索引与行索引进行调换,行数据与列数进行调换
df.T
 2018-11-01 00:00:002018-11-02 00:00:002018-11-03 00:00:002018-11-04 00:00:002018-11-05 00:00:002018-11-06 00:00:002018-11-07 00:00:00
A-0.170364-0.158446-0.5184261.013527-0.897497-0.030580-0.313342
B-0.237541-0.4885350.7308660.270167-0.0162790.545561-0.688179
C0.5299030.082960-1.0338300.081805-0.2349931.091127-0.417754
D0.660073-1.9135730.7126240.1781930.081208-0.1315790.855027

df
 ABCD
2018-11-01-0.170364-0.2375410.5299030.660073
2018-11-02-0.158446-0.4885350.082960-1.913573
2018-11-03-0.5184260.730866-1.0338300.712624
2018-11-041.0135270.2701670.0818050.178193
2018-11-05-0.897497-0.016279-0.2349930.081208
2018-11-06-0.0305800.5455611.091127-0.131579
2018-11-07-0.313342-0.688179-0.4177540.855027

数据的排序


df.sort_index(ascending=False)
### 降序,按照列进行降序,通过该索引列
 ABCD
2018-11-07-0.313342-0.688179-0.4177540.855027
2018-11-06-0.0305800.5455611.091127-0.131579
2018-11-05-0.897497-0.016279-0.2349930.081208
2018-11-041.0135270.2701670.0818050.178193
2018-11-03-0.5184260.730866-1.0338300.712624
2018-11-02-0.158446-0.4885350.082960-1.913573
2018-11-01-0.170364-0.2375410.5299030.660073


print(df.sort_values(by=['B','A']))
#  默认是升序,可以选择多指排序,先照B,后排A,如果B中的数据一样,则按照A中的大小进行排序
df.sort_values(by='B')
                   A         B         C         D
2018-11-07 -0.313342 -0.688179 -0.417754  0.855027
2018-11-02 -0.158446 -0.488535  0.082960 -1.913573
2018-11-01 -0.170364 -0.237541  0.529903  0.660073
2018-11-05 -0.897497 -0.016279 -0.234993  0.081208
2018-11-04  1.013527  0.270167  0.081805  0.178193
2018-11-06 -0.030580  0.545561  1.091127 -0.131579
2018-11-03 -0.518426  0.730866 -1.033830  0.712624
 ABCD
2018-11-07-0.313342-0.688179-0.4177540.855027
2018-11-02-0.158446-0.4885350.082960-1.913573
2018-11-01-0.170364-0.2375410.5299030.660073
2018-11-05-0.897497-0.016279-0.2349930.081208
2018-11-041.0135270.2701670.0818050.178193
2018-11-06-0.0305800.5455611.091127-0.131579
2018-11-03-0.5184260.730866-1.0338300.712624

选择数据(类似于数据库中sql语句)


df['A']
# 取出单独的一列数据,等价于df.A
2018-11-01   -0.170364
2018-11-02   -0.158446
2018-11-03   -0.518426
2018-11-04    1.013527
2018-11-05   -0.897497
2018-11-06   -0.030580
2018-11-07   -0.313342
Freq: D, Name: A, dtype: float64

# 通过[]进行行选择切片
df[0:3]
 ABCD
2018-11-01-0.170364-0.2375410.5299030.660073
2018-11-02-0.158446-0.4885350.082960-1.913573
2018-11-03-0.5184260.730866-1.0338300.712624

# 同时对于时间索引而言,可以直接使用比如
df['2018-11-01':'2018-11-04']
 ABCD
2018-11-01-0.170364-0.2375410.5299030.660073
2018-11-02-0.158446-0.4885350.082960-1.913573
2018-11-03-0.5184260.730866-1.0338300.712624
2018-11-041.0135270.2701670.0818050.178193

另外可以使用标签来选择



df.loc['2018-11-01']
A   -0.170364
B   -0.237541
C    0.529903
D    0.660073
Name: 2018-11-01 00:00:00, dtype: float64

#### 通过标签来进行多个轴上的进行选择
df.loc[:,["A","B"]] # 等价于df[["A","B"]]
 AB
2018-11-01-0.170364-0.237541
2018-11-02-0.158446-0.488535
2018-11-03-0.5184260.730866
2018-11-041.0135270.270167
2018-11-05-0.897497-0.016279
2018-11-06-0.0305800.545561
2018-11-07-0.313342-0.688179

df.loc["2018-11-01":"2018-11-03",["A","B"]]
 AB
2018-11-01-0.170364-0.237541
2018-11-02-0.158446-0.488535
2018-11-03-0.5184260.730866

#### 获得一个标量数据
df.loc['2018-11-01','A']
-0.17036430076617162

通过位置获取数据


df.iloc[3]  # 获得第四行的数据
A    1.013527
B    0.270167
C    0.081805
D    0.178193
Name: 2018-11-04 00:00:00, dtype: float64

df.iloc[1:3,1:4]  #  与numpy中的ndarray类似
 BCD
2018-11-02-0.4885350.08296-1.913573
2018-11-030.730866-1.033830.712624

# 可以选取不连续的行或者列进行取值
df.iloc[[1,3],[1,3]]
 BD
2018-11-02-0.488535-1.913573
2018-11-040.2701670.178193

#  对行进行切片处理
df.iloc[1:3,:]
 ABCD
2018-11-02-0.158446-0.4885350.08296-1.913573
2018-11-03-0.5184260.730866-1.033830.712624

# 对列进行切片
df.iloc[:,1:4]
 BCD
2018-11-01-0.2375410.5299030.660073
2018-11-02-0.4885350.082960-1.913573
2018-11-030.730866-1.0338300.712624
2018-11-040.2701670.0818050.178193
2018-11-05-0.016279-0.2349930.081208
2018-11-060.5455611.091127-0.131579
2018-11-07-0.688179-0.4177540.855027

# 获取特定的值
df.iloc[1,3]
-1.9135725473596013

布尔值索引


# 使用单列的数据作为条件进行筛选
df[df.A>0]
 ABCD
2018-11-041.0135270.2701670.0818050.178193

 #很少用到,很少使用这种大范围的条件进行筛选
df[df>0] 
 ABCD
2018-11-01NaNNaN0.5299030.660073
2018-11-02NaNNaN0.082960NaN
2018-11-03NaN0.730866NaN0.712624
2018-11-041.0135270.2701670.0818050.178193
2018-11-05NaNNaNNaN0.081208
2018-11-06NaN0.5455611.091127NaN
2018-11-07NaNNaNNaN0.855027

# 使用isin()方法过滤
df2.head()
 ABCDE
01201811013test1.5
11201811013train1.5
21201811013test1.5
31201811013train1.5

df2[df2['D'].isin(['test'])]
 ABCDE
01201811013test1.5
21201811013test1.5

设定数值(类似于sql update 或者add)

  • 设定一个新的列

df['E'] = [1,2,3,4,5,6,7]

df
 ABCDE
2018-11-01-0.170364-0.2375410.5299030.6600731
2018-11-02-0.158446-0.4885350.082960-1.9135732
2018-11-03-0.5184260.730866-1.0338300.7126243
2018-11-041.0135270.2701670.0818050.1781934
2018-11-05-0.897497-0.016279-0.2349930.0812085
2018-11-06-0.0305800.5455611.091127-0.1315796
2018-11-07-0.313342-0.688179-0.4177540.8550277
  • 通过标签设定新的值

df.loc['2018-11-01','E']= 10  # 第一行,E列的数据修改为10

df
 ABCDE
2018-11-01-0.170364-0.2375410.5299030.66007310
2018-11-02-0.158446-0.4885350.082960-1.9135732
2018-11-03-0.5184260.730866-1.0338300.7126243
2018-11-041.0135270.2701670.0818050.1781934
2018-11-05-0.897497-0.016279-0.2349930.0812085
2018-11-06-0.0305800.5455611.091127-0.1315796
2018-11-07-0.313342-0.688179-0.4177540.8550277

df.iloc[1,4]=5000  # 第二行第五列数据修改为5000

df
 ABCDE
2018-11-01-0.170364-0.2375410.5299030.66007310
2018-11-02-0.158446-0.4885350.082960-1.9135735000
2018-11-03-0.5184260.730866-1.0338300.7126243
2018-11-041.0135270.2701670.0818050.1781934
2018-11-05-0.897497-0.016279-0.2349930.0812085
2018-11-06-0.0305800.5455611.091127-0.1315796
2018-11-07-0.313342-0.688179-0.4177540.8550277

df3 =df.copy()
df3[df3<0]= -df3
df3  # 都变成非负数
 ABCDE
2018-11-010.1703640.2375410.5299030.66007310
2018-11-020.1584460.4885350.0829601.9135735000
2018-11-030.5184260.7308661.0338300.7126243
2018-11-041.0135270.2701670.0818050.1781934
2018-11-050.8974970.0162790.2349930.0812085
2018-11-060.0305800.5455611.0911270.1315796
2018-11-070.3133420.6881790.4177540.8550277

缺失值处理


df
 ABCDE
2018-11-01-0.170364-0.2375410.5299030.66007310
2018-11-02-0.158446-0.4885350.082960-1.9135735000
2018-11-03-0.5184260.730866-1.0338300.7126243
2018-11-041.0135270.2701670.0818050.1781934
2018-11-05-0.897497-0.016279-0.2349930.0812085
2018-11-06-0.0305800.5455611.091127-0.1315796
2018-11-07-0.313342-0.688179-0.4177540.8550277

df['E']=[1,np.nan,2,np.nan,4,np.nan,6]

df.loc['2018-11-01':'2018-11-03','D']=np.nan

df
 ABCDE
2018-11-01-0.170364-0.2375410.529903NaN1.0
2018-11-02-0.158446-0.4885350.082960NaNNaN
2018-11-03-0.5184260.730866-1.033830NaN2.0
2018-11-041.0135270.2701670.0818050.178193NaN
2018-11-05-0.897497-0.016279-0.2349930.0812084.0
2018-11-06-0.0305800.5455611.091127-0.131579NaN
2018-11-07-0.313342-0.688179-0.4177540.8550276.0
  • 去掉缺失值的行

df4 = df.copy()

df4.dropna(how='any')
 ABCDE
2018-11-05-0.897497-0.016279-0.2349930.0812084.0
2018-11-07-0.313342-0.688179-0.4177540.8550276.0

df4.dropna(how='all')
# """DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)""" 
# aixs 轴0或者1 index或者columns
# how 方式
# thresh 超过阈值个数的缺失值
# subset 那些字段的处理
# inplace 是否直接在原数据框中的替换
 ABCDE
2018-11-01-0.170364-0.2375410.529903NaN1.0
2018-11-02-0.158446-0.4885350.082960NaNNaN
2018-11-03-0.5184260.730866-1.033830NaN2.0
2018-11-041.0135270.2701670.0818050.178193NaN
2018-11-05-0.897497-0.016279-0.2349930.0812084.0
2018-11-06-0.0305800.5455611.091127-0.131579NaN
2018-11-07-0.313342-0.688179-0.4177540.8550276.0
  • 对缺失值就行填充

df4.fillna(1000)
 ABCDE
2018-11-01-0.170364-0.2375410.5299031000.0000001.0
2018-11-02-0.158446-0.4885350.0829601000.0000001000.0
2018-11-03-0.5184260.730866-1.0338301000.0000002.0
2018-11-041.0135270.2701670.0818050.1781931000.0
2018-11-05-0.897497-0.016279-0.2349930.0812084.0
2018-11-06-0.0305800.5455611.091127-0.1315791000.0
2018-11-07-0.313342-0.688179-0.4177540.8550276.0
  • 对数据进行布尔值进行填充

pd.isnull(df4)
 ABCDE
2018-11-01FalseFalseFalseTrueFalse
2018-11-02FalseFalseFalseTrueTrue
2018-11-03FalseFalseFalseTrueFalse
2018-11-04FalseFalseFalseFalseTrue
2018-11-05FalseFalseFalseFalseFalse
2018-11-06FalseFalseFalseFalseTrue
2018-11-07FalseFalseFalseFalseFalse

数据操作


#统计的工作一般情况下都不包含缺失值,
df4.mean() 
#  默认是对列进行求平均,沿着行方向也就是axis=0
A   -0.153590
B    0.016580
C    0.014174
D    0.245712
E    3.250000
dtype: float64

df4.mean(axis=1)
#  沿着列方向求每行的平均
2018-11-01    0.280499
2018-11-02   -0.188007
2018-11-03    0.294653
2018-11-04    0.385923
2018-11-05    0.586488
2018-11-06    0.368632
2018-11-07    1.087150
Freq: D, dtype: float64

 # 对于拥有不同维度,需要对齐的对象进行操作。Pandas会自动的沿着指定的维度进行广播:
s = pd.Series([1,3,4,np.nan,6,7,8],index=dates)
s
2018-11-01    1.0
2018-11-02    3.0
2018-11-03    4.0
2018-11-04    NaN
2018-11-05    6.0
2018-11-06    7.0
2018-11-07    8.0
Freq: D, dtype: float64

df4.sub(s,axis='index')
 ABCDE
2018-11-01-1.170364-1.237541-0.470097NaN0.0
2018-11-02-3.158446-3.488535-2.917040NaNNaN
2018-11-03-4.518426-3.269134-5.033830NaN-2.0
2018-11-04NaNNaNNaNNaNNaN
2018-11-05-6.897497-6.016279-6.234993-5.918792-2.0
2018-11-06-7.030580-6.454439-5.908873-7.131579NaN
2018-11-07-8.313342-8.688179-8.417754-7.144973-2.0

df4
 ABCDE
2018-11-01-0.170364-0.2375410.529903NaN1.0
2018-11-02-0.158446-0.4885350.082960NaNNaN
2018-11-03-0.5184260.730866-1.033830NaN2.0
2018-11-041.0135270.2701670.0818050.178193NaN
2018-11-05-0.897497-0.016279-0.2349930.0812084.0
2018-11-06-0.0305800.5455611.091127-0.131579NaN
2018-11-07-0.313342-0.688179-0.4177540.8550276.0

df4.apply(np.cumsum)
 ABCDE
2018-11-01-0.170364-0.2375410.529903NaN1.0
2018-11-02-0.328810-0.7260770.612863NaNNaN
2018-11-03-0.8472350.004789-0.420966NaN3.0
2018-11-040.1662920.274956-0.3391610.178193NaN
2018-11-05-0.7312050.258677-0.5741540.2594027.0
2018-11-06-0.7617860.8042370.5169730.127822NaN
2018-11-07-1.0751280.1160590.0992190.98284913.0

df4.apply(lambda x: x.max()-x.min())
A    1.911024
B    1.419044
C    2.124957
D    0.986606
E    5.000000
dtype: float64

统计个数与离散化


s = pd.Series(np.random.randint(0,7,size=15))
s
0     5
1     4
2     1
3     2
4     1
5     0
6     2
7     6
8     4
9     3
10    1
11    1
12    1
13    3
14    2
dtype: int32

s.value_counts()
# 统计元素的个数,并按照元素统计量进行排序,未出现的元素不会显示出来
1    5
2    3
4    2
3    2
6    1
5    1
0    1
dtype: int64

s.reindex(range(0,7))
# 按照固定的顺序输出元素的个数统计
0    5
1    4
2    1
3    2
4    1
5    0
6    2
dtype: int32

s.mode()
#  众数 
0    1
dtype: int32
  • 离散化

# 连续值转化为离散值,可以使用cut函数进行操作(bins based on vlaues) qcut (bins based on sample
# quantiles) 函数
arr = np.random.randint(0,20,size=15)  # 正态分布
arr
array([ 5, 18, 13, 16, 16,  1, 15, 11,  0, 17, 16, 18, 15, 12, 13])

factor = pd.cut(arr,3)
factor
[(-0.018, 6.0], (12.0, 18.0], (12.0, 18.0], (12.0, 18.0], (12.0, 18.0], ..., (12.0, 18.0], (12.0, 18.0], (12.0, 18.0], (6.0, 12.0], (12.0, 18.0]]
Length: 15
Categories (3, interval[float64]): [(-0.018, 6.0] < (6.0, 12.0] < (12.0, 18.0]]

pd.value_counts(factor)
(12.0, 18.0]     10
(-0.018, 6.0]     3
(6.0, 12.0]       2
dtype: int64

factor1 = pd.cut(arr,[-1,5,10,15,20])

pd.value_counts(factor1)
(15, 20]    6
(10, 15]    6
(-1, 5]     3
(5, 10]     0
dtype: int64

factor2 = pd.qcut(arr,[0,0.25,0.5,0.75,1])

pd.value_counts(factor2)
(11.5, 15.0]      5
(-0.001, 11.5]    4
(16.0, 18.0]      3
(15.0, 16.0]      3
dtype: int64

pandas 处理字符串(单独一个大的章节,这人不做详述)

数据合并

  • concat
  • merge(类似于sql数据库中的join)
  • append

首先看concat合并数据框


df = pd.DataFrame(np.random.randn(10,4))  #  10行列的标准正态分布数据框
df
 0123
00.949746-0.0507671.478622-0.239901
1-0.297120-0.5625890.3718371.180715
20.9538560.4922950.821156-0.323328
30.0161531.554225-1.166304-0.904040
40.204763-0.951291-1.3176200.672900
52.241006-0.925746-1.9614080.853367
62.217133-0.4308120.5189261.741445
7-0.571104-0.437305-0.9022410.786231
8-2.5113870.5237601.811622-0.777296
90.2526900.9019520.619614-0.006631

d1,d2,d3  = df[:3],df[3:7],df[7:]
d1,d2,d3
(          0         1         2         3
 0  0.949746 -0.050767  1.478622 -0.239901
 1 -0.297120 -0.562589  0.371837  1.180715
 2  0.953856  0.492295  0.821156 -0.323328,
           0         1         2         3
 3  0.016153  1.554225 -1.166304 -0.904040
 4  0.204763 -0.951291 -1.317620  0.672900
 5  2.241006 -0.925746 -1.961408  0.853367
 6  2.217133 -0.430812  0.518926  1.741445,
           0         1         2         3
 7 -0.571104 -0.437305 -0.902241  0.786231
 8 -2.511387  0.523760  1.811622 -0.777296
 9  0.252690  0.901952  0.619614 -0.006631)

pd.concat([d1,d2,d3])
#合并三个数据框,数据结构相同,通常合并相同结构的数据,数据框中的字段一致,类似于数据添加新的数据来源
 0123
00.949746-0.0507671.478622-0.239901
1-0.297120-0.5625890.3718371.180715
20.9538560.4922950.821156-0.323328
30.0161531.554225-1.166304-0.904040
40.204763-0.951291-1.3176200.672900
52.241006-0.925746-1.9614080.853367
62.217133-0.4308120.5189261.741445
7-0.571104-0.437305-0.9022410.786231
8-2.5113870.5237601.811622-0.777296
90.2526900.9019520.619614-0.006631

merge方式合并(数据库中的join)


left = pd.DataFrame({'key':['foo','foo'],"lval":[1,2]})
right = pd.DataFrame({'key':['foo','foo'],'rval':[4,5]})

left
 keylval
0foo1
1foo2

right
 keyrval
0foo4
1foo5

pd.merge(left,right,on='key')
 keylvalrval
0foo14
1foo15
2foo24
3foo25

left = pd.DataFrame({'key':['foo','bar'],"lval":[1,2]})
right = pd.DataFrame({'key':['foo','bar'],'rval':[4,5]})
pd.merge(left,right,on='key')
 keylvalrval
0foo14
1bar25

left
 keylval
0foo1
1bar2

right
 keyrval
0foo4
1bar5

Append方式合并数据


#  与concat 类似,常用的方法可以参考一下日子
df = pd.DataFrame(np.random.randn(8,4),columns=['A','B','C','D'])
df
 ABCD
01.825997-0.331086-0.0671430.747226
1-0.0274970.8616390.928621-2.549617
2-0.546645-0.072253-0.7884830.484140
3-0.472240-1.776993-1.6474070.170596
4-0.0994530.380143-0.8905101.233741
50.3519150.137522-1.1659381.128146
60.558442-1.047060-0.598197-1.979876
70.067321-1.037666-1.140675-0.098562

## 
d1 = df.iloc[3]
df.append(d1,ignore_index= True)
 ABCD
01.825997-0.331086-0.0671430.747226
1-0.0274970.8616390.928621-2.549617
2-0.546645-0.072253-0.7884830.484140
3-0.472240-1.776993-1.6474070.170596
4-0.0994530.380143-0.8905101.233741
50.3519150.137522-1.1659381.128146
60.558442-1.047060-0.598197-1.979876
70.067321-1.037666-1.140675-0.098562
8-0.472240-1.776993-1.6474070.170596

分组操作Groupby操作


df = pd.DataFrame({"A":['foo','bar','foo','bar'],
                  "B":['one','one','two','three'],
                  "C":np.random.randn(4),
                  "D":np.random.randn(4)})
df
 ABCD
0fooone0.9389100.505163
1barone0.6605430.353860
2footwo0.5203091.157462
3barthree-1.0549270.290693

df.groupby('A').sum()
 CD
A  
bar-0.3943840.644553
foo1.4592191.662625

df.groupby('A').size()
A
bar    2
foo    2
dtype: int64

df.groupby(['A',"B"]).sum()
  CD
AB  
barone0.6605430.353860
three-1.0549270.290693
fooone0.9389100.505163
two0.5203091.157462

df.groupby(['A',"B"]).size()
A    B    
bar  one      1
     three    1
foo  one      1
     two      1
dtype: int64

reshape操作


tuples = list(zip(*[['bar','bar','baz','baz','foo','foo','qux','qux'],
                   ['one','two','one','two','one','two','one','two']]))

index = pd.MultiIndex.from_tuples(tuples,names=['first','second'])
df = pd.DataFrame(np.random.randn(8,2),index= index,columns=['A','B'])
df2 =  df[:4]

df2
  AB
firstsecond  
barone0.5107580.641370
two0.481230-0.470894
bazone-0.0762940.121247
two0.378507-1.358932

df
  AB
firstsecond  
barone0.5107580.641370
two0.481230-0.470894
bazone-0.0762940.121247
two0.378507-1.358932
fooone-0.8730120.531595
two0.266968-0.393124
quxone0.9818661.205994
two0.2657720.132489

stack 与unstack 方法


df2_stacked = df2.stack()  
#  将column也作为index

df2_stacked
first  second   
bar    one     A    0.510758
               B    0.641370
       two     A    0.481230
               B   -0.470894
baz    one     A   -0.076294
               B    0.121247
       two     A    0.378507
               B   -1.358932
dtype: float64

df2_stacked.unstack()  #  回复到原来的状态
  AB
firstsecond  
barone0.5107580.641370
two0.481230-0.470894
bazone-0.0762940.121247
two0.378507-1.358932

df2_stacked
first  second   
bar    one     A    0.510758
               B    0.641370
       two     A    0.481230
               B   -0.470894
baz    one     A   -0.076294
               B    0.121247
       two     A    0.378507
               B   -1.358932
dtype: float64

df2_stacked.unstack(1)
 secondonetwo
first   
barA0.5107580.481230
B0.641370-0.470894
bazA-0.0762940.378507
B0.121247-1.358932

df2_stacked.unstack(0)
 firstbarbaz
second   
oneA0.510758-0.076294
B0.6413700.121247
twoA0.4812300.378507
B-0.470894-1.358932

pivot_table 透视表


df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,                    'B' : ['A', 'B', 'C'] * 4,
                 'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                  'D' : np.random.randn(12),
                 'E' : np.random.randn(12)})

df
 ABCDE
0oneAfoo0.006247-0.894827
1oneBfoo1.653974-0.340107
2twoCfoo-1.627485-1.011403
3threeAbar-0.7160021.533422
4oneBbar0.422688-0.807675
5oneCbar0.2648180.249770
6twoAfoo0.643288-1.166616
7threeBfoo0.348041-0.659099
8oneCfoo1.593486-1.098731
9oneAbar-0.3893440.919528
10twoBbar-1.4074501.269716
11threeCbar-0.1726720.883970

pd.pivot_table(df,values='D',index=['A','B'],columns=['C'],aggfunc=np.mean)
 Cbarfoo
AB  
oneA-0.3893440.006247
B0.4226881.653974
C0.2648181.593486
threeA-0.716002NaN
BNaN0.348041
C-0.172672NaN
twoANaN0.643288
B-1.407450NaN
CNaN-1.627485

pd.pivot_table(df,values='D',index=['A','B'],columns=['C'],aggfunc=np.sum)
 Cbarfoo
AB  
oneA-0.3893440.006247
B0.4226881.653974
C0.2648181.593486
threeA-0.716002NaN
BNaN0.348041
C-0.172672NaN
twoANaN0.643288
B-1.407450NaN
CNaN-1.627485

pd.pivot_table(df,values='D',index=['A','B'],columns=['C'],aggfunc=np.mean,fill_value=0)
 Cbarfoo
AB  
oneA-0.3893440.006247
B0.4226881.653974
C0.2648181.593486
threeA-0.7160020.000000
B0.0000000.348041
C-0.1726720.000000
twoA0.0000000.643288
B-1.4074500.000000
C0.000000-1.627485

df1 = pd.pivot_table(df,values='D',index=['A','B'],columns=['C'],aggfunc=np.mean,fill_value=0)

df1.index
MultiIndex(levels=[['one', 'three', 'two'], ['A', 'B', 'C']],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2], [0, 1, 2, 0, 1, 2, 0, 1, 2]],
           names=['A', 'B'])

df1.stack()
A      B  C  
one    A  bar   -0.389344
          foo    0.006247
       B  bar    0.422688
          foo    1.653974
       C  bar    0.264818
          foo    1.593486
three  A  bar   -0.716002
          foo    0.000000
       B  bar    0.000000
          foo    0.348041
       C  bar   -0.172672
          foo    0.000000
two    A  bar    0.000000
          foo    0.643288
       B  bar   -1.407450
          foo    0.000000
       C  bar    0.000000
          foo   -1.627485
dtype: float64

df1.unstack()
Cbarfoo
BABCABC
A      
one-0.3893440.4226880.2648180.0062471.6539741.593486
three-0.7160020.000000-0.1726720.0000000.3480410.000000
two0.000000-1.4074500.0000000.6432880.000000-1.627485

df1.unstack(1)
Cbarfoo
BABCABC
A      
one-0.3893440.4226880.2648180.0062471.6539741.593486
three-0.7160020.000000-0.1726720.0000000.3480410.000000
two0.000000-1.4074500.0000000.6432880.000000-1.627485

df1.unstack(0)
Cbarfoo
Aonethreetwoonethreetwo
B      
A-0.389344-0.7160020.000000.0062470.0000000.643288
B0.4226880.000000-1.407451.6539740.3480410.000000
C0.264818-0.1726720.000001.5934860.000000-1.627485

至此,pandas的基础的使用介绍也就结束了,后续会有专题性质的分析,包括(字符串处理,apply的使用,数据合并,透视表,时间序列的分析)

posted on 2018-12-09 20:42 多一点 阅读(...) 评论(...) 编辑 收藏

pandas是一个Python库,用于数据分析和数据处理的工具。它提供了高效的数据结构和数据操作功能,特别适用于处理结构化数据。 以下是一些常见的pandas基础操作: 1. 导入pandas库:使用`import pandas as pd`导入pandas库。 2. 创建DataFrame:DataFrame是pandas中最重要的数据结构,类似于一张表格。可以通过多种方式创建DataFrame,例如从CSV文件、Excel文件、Python字典、NumPy数组等。 3. 查看数据:使用`df.head()`方法可以查看DataFrame的前几行,默认显示前5行。还可以使用`df.tail()`方法查看DataFrame的后几行。 4. 数据选择:可以使用列名称或列索引选择特定的列或多列数据。例如,使用`df['列名']`或`df.loc[:, '列名']`选择单列数据,使用`df[['列名1', '列名2']]`或`df.loc[:, ['列名1', '列名2']]`选择多列数据。 5. 数据过滤:可以使用条件语句过滤DataFrame中的数据。例如,使用`df[df['列名'] > 值]`可以选择满足条件的行。 6. 数据排序:可以使用`df.sort_values(by='列名')`对DataFrame按照指定的列进行排序。 7. 数据聚合:可以使用`df.groupby('列名').函数()`对数据进行分组并进行各种聚合操作,如求和、平均值、计数等。 8. 缺失值处理:可以使用`df.dropna()`删除包含缺失值的行或列,使用`df.fillna(value)`将缺失值填充为指定的值。 9. 数据统计:可以使用`df.describe()`获取DataFrame中数值列的统计信息,如均值、标准差、最小值、最大值等。 这些只是pandas基础操作的一部分,还有很多其他功能和方法可以用于数据处理和分析。你可以查阅pandas官方文档或参考相关教程来深入学习和掌握pandas
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值