Python数据分析(2)-pandas数据结构操作

最新推荐文章于 2022-04-07 21:33:35 发布

追蜗牛的coder

最新推荐文章于 2022-04-07 21:33:35 发布

阅读量4.5w

点赞数 2

分类专栏： Python数据分析文章标签： python 数据分析 pandas dataframe

本文链接：https://blog.csdn.net/jinxiaonian11/article/details/53143359

版权

Python数据分析专栏收录该内容

11 篇文章 1 订阅

订阅专栏

pandas是一个提供快速、灵活、表达力强的数据结构的Python库，适合处理‘有关系’或者‘有标签’的数据。在利用Python做数据分析的时候，pandas是一个强有力的工具。
pandas库有两种数据结构，Series和DataFrame。前者适合处理一维数据，也就是单变量；后者适合分析多维数据，不过也仅仅只能是二维。在掌握DataFrame的操作后，自然也就熟悉了Series的操作，因而不描述如何操作Series。

1. DataFrame数据结构

DataFrame十分类似于Excel数据表，列以index索引，行以columns进行索引，这样（index，columns）能定位任意一个数据。构建原型为：class pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
参数说明：

参数名称	说明
data	字典或者二维矩阵。如果是字典，则字典的keys默认为dataframe的columns
index	列表，如果不指定则默认产生np.arange(n)
columns	列表，如果不指定则默认产生np.arange(n)
dtype	数据类型
copy	是否从输入复制

创建一个dataframe：

import numpy as np
import pandas as pd
np.random.seed(1234) #种随机种子，保证每次产生的随机数一样
values = np.random.randint(1, 20, (10,10)) #100个数据，10行10列
index = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
columns = ['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten']
df = pd.DataFrame(values, index=index, columns=columns, dtype=float)
print(df)

  out：
    one   two  three  four  five   six  seven  eight  nine   ten
a  16.0   7.0   13.0  16.0  18.0  10.0   12.0   13.0  17.0   6.0
b  17.0  10.0   16.0  19.0  17.0  13.0    6.0    3.0   7.0   4.0
c   8.0  12.0    1.0  10.0  12.0  17.0    4.0    3.0  13.0   2.0
d  12.0  12.0   18.0  15.0   8.0  11.0   12.0   15.0  18.0  14.0
e   1.0  13.0    6.0  18.0   6.0  14.0   17.0   10.0   9.0  13.0
f   7.0  13.0   16.0  18.0  19.0  15.0    3.0    6.0  14.0   7.0
g   8.0   5.0    4.0   6.0  15.0  16.0   16.0   16.0   3.0  11.0
h   5.0  19.0    8.0  12.0  15.0  19.0   10.0    1.0   3.0   2.0
i  19.0  18.0    8.0   5.0   8.0  18.0    1.0   10.0  19.0  10.0
j   2.0  15.0    4.0  13.0  10.0  14.0    1.0    5.0   5.0   1.0

查看数据基本信息：

print(df.index)
print(df.columns)
print(df.values)

out：
Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')
Index(['one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine','ten'],dtype='object')
[[ 16.   7.  13.  16.  18.  10.  12.  13.  17.   6.]
 [ 17.  10.  16.  19.  17.  13.   6.   3.   7.   4.]
 [  8.  12.   1.  10.  12.  17.   4.   3.  13.   2.]
 [ 12.  12.  18.  15.   8.  11.  12.  15.  18.  14.]
 [  1.  13.   6.  18.   6.  14.  17.  10.   9.  13.]
 [  7.  13.  16.  18.  19.  15.   3.   6.  14.   7.]
 [  8.   5.   4.   6.  15.  16.  16.  16.   3.  11.]
 [  5.  19.   8.  12.  15.  19.  10.   1.   3.   2.]
 [ 19.  18.   8.   5.   8.  18.   1.  10.  19.  10.]
 [  2.  15.   4.  13.  10.  14.   1.   5.   5.   1.]]

2. DataFrame操作

在写程序的时候，涉及到对对象的操作无外乎一下几种：增、删、查、改

2.1 查看数据（索引数据）

DataFrame最常用的索引数据的方法是.loc[index,columns],或者是.iloc[numbers, numbers]。可以看出loc是靠索引值来索引，iloc靠数据在矩阵中的位置标号来索引（位置标号从0开始），例如：

df.loc['b', 'two'] 和 df.iloc[1,1]  对应同一数：8

索引多个数据时：

ind = ['b', 'c']
col = ['four', 'six', 'two']
print(df.loc[ind, col])

out：
   four   six   two
b  19.0  13.0  10.0
c  10.0  17.0  12.0

特别的，以columns索引数据时还有其他方法：

df[['three','five']]

通过条件索引：

df[df>5]

out:
    one   two  three  four  five   six  seven  eight  nine   ten
a  16.0   7.0   13.0  16.0  18.0  10.0   12.0   13.0  17.0   6.0
b  17.0  10.0   16.0  19.0  17.0  13.0    6.0    NaN   7.0   NaN
c   8.0  12.0    NaN  10.0  12.0  17.0    NaN    NaN  13.0   NaN
d  12.0  12.0   18.0  15.0   8.0  11.0   12.0   15.0  18.0  14.0
e   NaN  13.0    6.0  18.0   6.0  14.0   17.0   10.0   9.0  13.0
f   7.0  13.0   16.0  18.0  19.0  15.0    NaN    6.0  14.0   7.0
g   8.0   NaN    NaN   6.0  15.0  16.0   16.0   16.0   NaN  11.0
h   NaN  19.0    8.0  12.0  15.0  19.0   10.0    NaN   NaN   NaN
i  19.0  18.0    8.0   NaN   8.0  18.0    NaN   10.0  19.0  10.0
j   NaN  15.0    NaN  13.0  10.0  14.0    NaN    NaN   NaN   NaN

从例子中发现，当条件为真时，保留数据，条件为假，该处数据被改为nan，即为缺省值

2.2 增加数据

增加数据涉及到增加行，增加列，以及多个dataframe合并
2.2.1 增加行：append

ind1 = ['a', 'b']
col1 = ['one', 'three', 'two']
df1 = df.loc[ind1, col1]
s = df.loc['f']
df1 = df1.append(s)
print(df1)

    one  three   two  eight  five  four  nine  seven   six  ten
a  16.0   13.0   7.0    NaN   NaN   NaN   NaN    NaN   NaN  NaN
b  17.0   16.0  10.0    NaN   NaN   NaN   NaN    NaN   NaN  NaN
f   7.0   16.0  13.0    6.0  19.0  18.0  14.0    3.0  15.0  7.0

发现默认会增加原来数据的列。其实就相当于合并了两个dataframe，取了并集。所以在增加行的时候需要保证列能够参数对齐。

2.2.2 增加列
增加列数据比较容易，直接赋值就好：

ind1 = ['a', 'b']
col1 = ['one', 'three', 'two']
df1 = df.loc[ind1, col1]
df1['four'] = [1, 2]
print(df1)

    one  three   two  four
a  16.0   13.0   7.0     1
b  17.0   16.0  10.0     2

2.2.3 dataframe合并：concat
concat是用的最多的合并方法，原型是：

pandas.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, copy=True)

重要参数：

参数名称	参数说明
obj	需要合并的对象，一般为[df1,df2,…dfn]
axis	合并方向，0为按照index合并，即从下面添加；1为按照columns合并，即从右边添加
join	合并方式。{‘inner’, ‘outer’}可选，inner为取并集，outer为取交集

按照columns合并，取并集：axis=1, join=’outer’

ind1 = ['a', 'b']
col1 = ['one', 'three', 'two']
df1 = df.loc[ind1, col1]
df1_col = df.loc[ind1 + ['c'], ['four']]
df1_row = df.loc[['c'], col1]
print(pd.concat([df1, df1_col], axis=1, join='outer'))

    one  three   two  four
a  16.0   13.0   7.0  16.0
b  17.0   16.0  10.0  19.0
c   NaN    NaN   NaN  10.0

将代码修改为：join=’inner’

print(pd.concat([df1, df1_col], axis=1, join='inner'))

    one  three   two  four
a  16.0   13.0   7.0  16.0
b  17.0   16.0  10.0  19.0

按照row合并，取交集：

ind1 = ['a', 'b']
col1 = ['one', 'three', 'two']
df1 = df.loc[ind1, col1]
df1_row = df.loc[['c'], col1+['five']]
print(pd.concat([df1, df1_row], axis=0, join='inner'))

    one  three   two
a  16.0   13.0   7.0
b  17.0   16.0  10.0
c   8.0    1.0  12.0

2.3 修改数据

修改数据特别简单，直接赋值即可：
df.loc[index, columns] = values

2.4 删除数据

2.4.1 删除columns：pop

df = pd.DataFrame(values, index=index, columns=columns, dtype=float)
ind1 = ['a', 'b']
col1 = ['one', 'three', 'two']
df1 = df.loc[ind1, col1]
df1.pop('one')
print(df1)

   three   two
a   13.0   7.0
b   16.0  10.0

删除columns用pop很好理解，因为columns和数据的关系本来就是字典中的keys和values的关系，字典中删除keys用的就是pop
删除列不止这一种方法，还可以用drop：

ind1 = ['a', 'b']
col1 = ['one', 'three', 'two']
df1 = df.loc[ind1, col1]
df1 = df1.drop('one', axis=1)
print(df1)

   three   two
a   13.0   7.0
b   16.0  10.0

drop和pop的区别是：pop改变原数据，drop不改变原数据。pop只用于删除列，drop可以用来删除行和列（axis参数控制）
2.4.2 删除index：

当drop中的axis参数为0时，即删除行：

ind1 = ['a', 'b']
col1 = ['one', 'three', 'two']
df1 = df.loc[ind1, col1]
df1 = df1.drop('a', axis=0)
print(df1)

    one  three   two
b  17.0   16.0  10.0

3. 缺省值处理

dataframe中没有数据或者数据为nan（非数字）时，都用nan表示。

values = [[1, 2, 3], ['jj', np.nan, 8], [np.inf, np.nan, 5]]
index = ['a', 'b', 'c']
columns = ['one', 'two', 'three']

df = pd.DataFrame(values, index=index, columns=columns)
print(df)

   one  two  three
a    1  2.0      3
b   jj  NaN      8
c  inf  NaN      5

对于缺省值的处理有：
判断是否为缺省值：isnull

print(df.isnull())

     one    two  three
a  False  False  False
b  False   True  False
c  False   True  False

缺省值的填充：fillna

df = df.fillna(value=6)

   one  two  three
a    1  2.0    3.0
b   jj  6.0    8.0
c  inf  6.0    5.0

4. 统计学指标计算

包含的统计学指标计算API有：

函数名	功能	说明
describe	简单指标计算	按col计算常用简单指标，例如均值等
mean	均值	与axis有关
cov	协方差	包括nan
count()	col中非nan的数据个数	None
sum()	求和	None
median()	中位数	None
min()	最小值	None
max()	最大值	None
std()	标准差	None
var()	方差	None
skew()	样本偏度	None
kurt()	样本峰度	None
quantile()	样本分位数	None
corr()	相关系数	三种方法

import numpy as np
import pandas as pd
np.random.seed(1234)
values = np.random.randint(1, 10, (5, 5))
columns = ['col'+str(n) for n in np.arange(len(values))]
index = ['index'+str(n) for n in np.arange(len(values))]

df = pd.DataFrame(values, index=index, columns=columns)

# 常用的统计学指标计算：describe
stat = df.describe()
print(stat)

           col0      col1     col2     col3      col4
count  5.000000  5.000000  5.00000  5.00000  5.000000
mean   3.800000  4.000000  6.20000  4.80000  4.600000
std    2.280351  3.316625  1.30384  2.48998  3.781534
min    1.000000  1.000000  4.00000  3.00000  1.000000
25%    2.000000  1.000000  6.00000  3.00000  1.000000
50%    4.000000  3.000000  7.00000  4.00000  4.000000
75%    6.000000  7.000000  7.00000  5.00000  8.000000
max    6.000000  8.000000  7.00000  9.00000  9.000000