Pandas基础学习二

最新推荐文章于 2024-07-25 15:45:20 发布

Wind_know

最新推荐文章于 2024-07-25 15:45:20 发布

阅读量677

点赞数

分类专栏： pandas 科学计算库文章标签： python

本文链接：https://blog.csdn.net/Wind_know/article/details/106128540

版权

科学计算库同时被 2 个专栏收录

5 篇文章 0 订阅

订阅专栏

pandas

3 篇文章 0 订阅

订阅专栏

Pandas是科学计算中很重要的一个库，pandas提供的数据结构DataFrame与json的契合度高，转化起来很方便。

时间日期在 Pandas 里的作用
Python datetime

Randn()函数

作用：（标准正态分布是以0为均数、以1为标准差的正态分布，记为N（0，1））,通过本函数可以返回一个或一组服从标准正态分布的随机样本值。

np.random.randn(d0,d1,d2……dn)
1)当函数括号内没有参数时，则返回一个浮点数；
2）当函数括号内有一个参数时，则返回秩为1的数组，不能表示向量和矩阵；
3）当函数括号内有两个及以上参数时，则返回对应维度的数组，能表示向量或矩阵；
4）np.random.standard_normal（）函数与np.random.randn()类似，但是np.random.standard_normal（）
的输入参数为元组（tuple）.
5)np.random.randn()的输入通常为整数，但是如果为浮点数，则会自动直接截断转换为整数。

合并(CONCAT )

在连接/合并类型操作的情况下，pandas提供了各种功能，可轻松地将Series和DataFrame对象与各种用于索引和关系代数功能的集合逻辑组合在一起。

In [73]: df = pd.DataFrame(np.random.randn(10, 4))#输出4行4列 符合randn()函数的数

In [74]: df
Out[74]: 
          0         1         2         3
0 -0.548702  1.467327 -1.015962 -0.483075
1  1.637550 -1.217659 -0.291519 -1.745505
2 -0.263952  0.991460 -0.919069  0.266046
3 -0.709661  1.669052  1.037882 -1.705775
4 -0.919854 -0.042379  1.247642 -0.009920
5  0.290213  0.495767  0.362949  1.548106
6 -1.131345 -0.089329  0.337863 -0.945867
7 -0.932132  1.956030  0.017587 -0.016692
8 -0.575247  0.254161 -1.143704  0.215897
9  1.193555 -0.077118 -0.408530 -0.862495

In [75]: pieces = [df[:3], df[3:7], df[7:]]#分片

In [76]: pd.concat(pieces)# 将分片的数据合并起来
Out[76]: 
          0         1         2         3
0 -0.548702  1.467327 -1.015962 -0.483075
1  1.637550 -1.217659 -0.291519 -1.745505
2 -0.263952  0.991460 -0.919069  0.266046
3 -0.709661  1.669052  1.037882 -1.705775
4 -0.919854 -0.042379  1.247642 -0.009920
5  0.290213  0.495767  0.362949  1.548106
6 -1.131345 -0.089329  0.337863 -0.945867
7 -0.932132  1.956030  0.017587 -0.016692
8 -0.575247  0.254161 -1.143704  0.215897
9  1.193555 -0.077118 -0.408530 -0.862495

加入（SQL）

SQL样式合并

In [77]: left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})

In [78]: right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})

In [79]: left
Out[79]: 
   key  lval
0  foo     1
1  foo     2

In [80]: right
Out[80]: 
   key  rval
0  foo     4
1  foo     5

In [81]: pd.merge(left, right, on='key')
Out[81]: 
   key  lval  rval
0  foo     1     4
1  foo     1     5
2  foo     2     4
3  foo     2     5

分组计算

分组计算三步曲：拆分 -> 应用 -> 合并

拆分：根据什么进行分组？
应用：每个分组进行什么样的计算？
合并：把每个分组的计算结果合并起来。

df = pd.DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],
                  'key2': ['one', 'two', 'one', 'two', 'one'],
                  'data1': np.random.randint(1, 10, 5),
                  'data2': np.random.randint(1, 10, 5)})
df


data1	data2	key1	key2
0	1	6	a	one
1	5	9	a	two
2	4	7	b	one
3	3	7	b	two
4	3	5	a	one


#对 Series 进行分组
#通过索引对齐关联起来

grouped = df['data1'].groupby(df['key1'])
grouped = df['data1'].groupby(df['key1'])
grouped.mean()
key1
a    3.0
b    3.5
Name: data1, dtype: float64
df['data1'].groupby([df['key1'], df['key2']]).mean()
key1  key2
a     one     2
      two     5
b     one     4
      two     3
Name: data1, dtype: int32
#对 DataFrame 进行分组
df.groupby('key1').mean()#对分组后的值求均值
data1	data2
key1		
a	3.0	6.666667
b	3.5	7.000000
means = df.groupby(['key1', 'key2']).mean()['data1']
means
key1  key2
a     one     2
      two     5
b     one     4
      two     3
Name: data1, dtype: float64
means.unstack()
key2	one	two
key1		
a	2	5
b	4	3
df.groupby(['key1', 'key2'])['data1'].mean()
key1  key2
a     one     2
      two     5
b     one     4
      two     3
Name: data1, dtype: int32
#每个分组的元素个数
df.groupby(['key1', 'key2']).size()
key1  key2
a     one     2
      two     1
b     one     1
      two     1
dtype: int64
#对分组进行迭代
for name, group in df.groupby('key1'):
    print name
    print group

a
   data1  data2 key1 key2
0      1      6    a  one
1      5      9    a  two
4      3      5    a  one
b
   data1  data2 key1 key2
2      4      7    b  one
3      3      7    b  two
for name, group in df.groupby(['key1', 'key2']):
    print name
    print group
('a', 'one')
   data1  data2 key1 key2
0      1      6    a  one
4      3      5    a  one
('a', 'two')
   data1  data2 key1 key2
1      5      9    a  two
('b', 'one')
   data1  data2 key1 key2
2      4      7    b  one
('b', 'two')
   data1  data2 key1 key2
3      3      7    b  two

通过多列分组形成一个层次结构索引，我们可以再次应用该sum功能。

In [90]: df.groupby(['A', 'B']).sum()
Out[90]: 
                  C         D
A   B                        
bar one    1.511763  0.396823
    three -0.990582 -0.532532
    two    1.211526  1.208843
foo one    1.614581 -1.658537
    three  0.024580 -0.264610
    two    1.185429  1.348368

数据聚合

分组运算，先根据一定规则拆分后的数据，然后对数据进行聚合运算，如前面见到的 mean(), sum() 等就是聚合的例子。聚合时，拆分后的第一个索引指定的数据都会依次传给聚合函数进行运算。最后再把运算结果合并起来，生成最终结果。

聚合函数除了内置的 sum(), min(), max(), mean() 等等之外，还可以自定义聚合函数。自定义聚合函数时，使用 agg() 或 aggregate() 函数。

#内置聚合函数
df = pd.DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],
                  'key2': ['one', 'two', 'one', 'two', 'one'],
                  'data1': np.random.randint(1, 10, 5),
                  'data2': np.random.randint(1, 10, 5)})
df
data1	data2	key1	key2
0	9	3	a	one
1	3	8	a	two
2	9	5	b	one
3	8	5	b	two
4	9	2	a	one
df['data1'].groupby(df['key1']).sum()
key1
a    21
b    17
Name: data1, dtype: int32
#自定义聚合函数
def peak_verbose(s):
    print type(s)
    return s.max() - s.min()

def peak(s):
    return s.max() - s.min()
grouped = df.groupby('key1')
grouped.agg(peak_verbose)
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
data1	data2
key1		
a	6	6
b	1	0
#应用多个聚合函数
grouped['data1', 'data2'].agg(['mean', 'std', peak])
data1	data2
mean	std	peak	mean	std	peak
key1						
a	7.0	3.464102	6	4.333333	3.21455	6
b	8.5	0.707107	1	5.000000	0.00000	0
# 给聚合后的列取名
grouped['data1'].agg([('agerage', 'mean'), ('max-range', peak)])
agerage	max-range
key1		
a	7.0	6
b	8.5	1
#给不同的列应用不同的聚合函数
#使用 dict 作为参数来实现

d = {'data1': ['mean', peak, 'max', 'min'],
     'data2': 'sum'}
grouped.agg(d)
data1	data2
mean	peak	max	min	sum
key1					
a	7.0	6	9	3	13
b	8.5	1	9	8	10

载入数据到 Pandas

索引：将一个列或多个列读取出来构成 DataFrame，其中涉及是否从文件中读取索引以及列名
类型推断和数据转换：包括用户自定义的转换以及缺失值标记
日期解析
迭代：针对大文件进行逐块迭代。这个是Pandas和Python原生的csv库的最大区别
不规整数据问题：跳过一些行，或注释等等

索引及列名

%more data/ex1.csv
df = pd.read_csv('data/ex1.csv')
df
a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo
df = pd.read_csv('data/ex1.csv', sep=',')
df
a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo
%more data/ex2.csv
# 列名缺失
pd.read_csv('data/ex2.csv', header=None)
0	1	2	3	4
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo
# 指定列名
pd.read_csv('data/ex2.csv', header=None, names=['a', 'b', 'c', 'd', 'msg'])
a	b	c	d	msg
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo
# 指定行索引
pd.read_csv('data/ex2.csv', header=None, names=['a', 'b', 'c', 'd', 'msg'], index_col='msg')
a	b	c	d
msg				
hello	1	2	3	4
world	5	6	7	8
foo	9	10	11	12
# 多层行索引
pd.read_csv('data/ex2.csv', header=None, names=['a', 'b', 'c', 'd', 'msg'], index_col=['msg', 'a'])
b	c	d
msg	a			
hello	1	2	3	4
world	5	6	7	8
foo	9	10	11	12

逐块读取数据

pd.read_csv('data/ex6.csv', nrows=10)

# 统计每个 key 出现的次数
tr = pd.read_csv('data/ex6.csv', chunksize=1000)

key_count = pd.Series([])
for pieces in tr:
    key_count = key_count.add(pieces['key'].value_counts(), fill_value=0)
key_count = key_count.sort_values(ascending=False)
key_count[:10]

保存数据到磁盘

df = pd.read_csv('data/ex5.csv')
df
df = pd.read_csv('data/ex5.csv')
df
something	a	b	c	d	message
0	one	1	2	3	4	NaN
1	two	5	6	NaN	8	world
2	three	9	10	11	12	foo

时间日期

时间戳 tiimestamp：固定的时刻 -> pd.Timestamp
固定时期 period：比如 2016年3月份，再如2015年销售额 -> pd.Period
时间间隔 interval：由起始时间和结束时间来表示，固定时期是时间间隔的一个特殊

时间日期在 Pandas 里的作用

分析金融数据，如股票交易数据
分析服务器日志

Python datetime

python 标准库里提供了时间日期的处理。这个是时间日期的基础。

from datetime import datetime
from datetime import timedelta

dates = [datetime(2016, 3, 1), datetime(2016, 3, 2), datetime(2016, 3, 3), datetime(2016, 3, 4)]
s = pd.Series(np.random.randn(4), index=dates)
s
#Pandas 里使用 Timestamp 来表达时间
2016-03-01    1.650889
2016-03-02   -0.328463
2016-03-03    1.674872
2016-03-04   -0.310849
dtype: float64
type(s.index)
pandas.tseries.index.DatetimeIndex

#生成日期范围
pd.date_range(start='20160320', periods=10)
DatetimeIndex(['2016-03-20', '2016-03-21', '2016-03-22', '2016-03-23',
               '2016-03-24', '2016-03-25', '2016-03-26', '2016-03-27',
               '2016-03-28', '2016-03-29'],
              dtype='datetime64[ns]', freq='D'
#时期序列
pd.period_range(start='2016-01', periods=12, freq='M')
PeriodIndex(['2016-01', '2016-02', '2016-03', '2016-04', '2016-05', '2016-06',
             '2016-07', '2016-08', '2016-09', '2016-10', '2016-11', '2016-12'],
            dtype='int64', freq='M')