Pandas简单使用1

最新推荐文章于 2023-09-26 20:49:58 发布

苦茶Fighting

最新推荐文章于 2023-09-26 20:49:58 发布

阅读量221

点赞数

分类专栏： Python Pandas 文章标签： Python

Python 同时被 2 个专栏收录

6 篇文章 0 订阅

订阅专栏

Pandas

3 篇文章 0 订阅

订阅专栏

文章目录

Pandas基本介绍
- Series
- DataFrame
Pandas选择数据
Pandas设置值

Pandas基本介绍

Numpy是列表形式的，没有数值标签，而Pandas是字典形式。Pandas是基于Numpy构建的，让Numpy为中心的应用变得更加简单。
Pandas主要有两个数据结构，Series和DataFrame。

Series

import pandas as pd
import numpy as np
s = pd.Series([1,3,6,np.nan,44,1])
print(s)
print(s[1])   #可以直接访问

Series的字符串表现形式为：索引在左边，值在右边。由于没有指定索引，默认创建0到N-1的整数型索引。下面是加上索引的Series

grade = pd.Series([100,59,80],index=["李明","李红","王美"])
print(grade.values)
print(grade.index)
print(grade["李明"])

DataFrame

dates = pd.date_range("20160101",periods=6)
df = pd.DataFrame(np.random.randn(6,4),index = dates,columns=['a','b','c','d'])  
print(df)

                   a         b         c         d
2016-01-01 -0.378199 -0.300236 -1.207843 -1.658223
2016-01-02 -1.031397 -0.834695 -0.417703 -0.318720
2016-01-03 -2.346667  1.615651  1.726296  1.152253
2016-01-04  1.389872  0.952453 -0.737092  1.555059
2016-01-05  0.735490  0.297005 -0.542341  0.559540
2016-01-06 -1.962791  1.776028 -1.917368 -0.679542

DataFrame是一个表格型的数据结构，它包含有一组有序的列，每列可以是不同的值类型(数值、字符串、布尔值等)。
DataFrame既有行索引也有列索引，它可以被看作由Series组成的大字典。

下面访问DataFrame中的数据，注意访问具体元素是先列标签后行标签

print(df['b'])

2016-01-01   -0.300236
2016-01-02   -0.834695
2016-01-03    1.615651
2016-01-04    0.952453
2016-01-05    0.297005
2016-01-06    1.776028
Freq: D, Name: b, dtype: float64

print(df['b']['2016-01-05'])

0.2970052798746942

创建一组没有给定行标签和列标签的数据并访问

df = pd.DataFrame(np.arange(12).reshape((3,4)))
print(df)
print(df[1][0])
   0  1   2   3
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11
1

自定义每列的类型

df1 = pd.DataFrame({
    'A':1,
    'B':pd.Timestamp('20180928'),
    'C':pd.Series(1,index=list(range(4)),dtype='float32'),
    'D':np.array([3]*4,dtype='int32'),
    'E':pd.Categorical(["test","train","test","train"]),
    'F':'foo'})
print(df1)
print(df1['B'])
print(df1['B'][1])

   A          B    C  D      E    F
0  1 2018-09-28  1.0  3   test  foo
1  1 2018-09-28  1.0  3  train  foo
2  1 2018-09-28  1.0  3   test  foo
3  1 2018-09-28  1.0  3  train  foo
0   2018-09-28
1   2018-09-28
2   2018-09-28
3   2018-09-28
Name: B, dtype: datetime64[ns]
2018-09-28 00:00:00

查看每行的名称

print(df1.index)

Int64Index([0, 1, 2, 3], dtype='int64')

查看每列的名称

print(df1.columns)
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')

查看所有的值

print(df1.values)

[[1 Timestamp('2018-09-28 00:00:00') 1.0 3 'test' 'foo']
 [1 Timestamp('2018-09-28 00:00:00') 1.0 3 'train' 'foo']
 [1 Timestamp('2018-09-28 00:00:00') 1.0 3 'test' 'foo']
 [1 Timestamp('2018-09-28 00:00:00') 1.0 3 'train' 'foo']]

查看数据总结

print(df1.describe())

         A    C    D
count  4.0  4.0  4.0
mean   1.0  1.0  3.0
std    0.0  0.0  0.0
min    1.0  1.0  3.0
25%    1.0  1.0  3.0
50%    1.0  1.0  3.0
75%    1.0  1.0  3.0
max    1.0  1.0  3.0

对数据的index进行排序输出

print(df1.sort_index(axis=1,ascending=False))
     F      E  D    C          B  A
0  foo   test  3  1.0 2018-09-28  1
1  foo  train  3  1.0 2018-09-28  1
2  foo   test  3  1.0 2018-09-28  1
3  foo  train  3  1.0 2018-09-28  1

对数据的value进行排序输出

print(df1.sort_values(by='B'))
   A          B    C  D      E    F
0  1 2018-09-28  1.0  3   test  foo
1  1 2018-09-28  1.0  3  train  foo
2  1 2018-09-28  1.0  3   test  foo
3  1 2018-09-28  1.0  3  train  foo

Pandas选择数据

简单筛选

print(df)
            A   B   C   D
2013-01-01   0   1   2   3
2013-01-02   4   5   6   7
2013-01-03   8   9  10  11
2013-01-04  12  13  14  15
2013-01-05  16  17  18  19
2013-01-06  20  21  22  23

print(df['A'])
2013-01-01     0
2013-01-02     4
2013-01-03     8
2013-01-04    12
2013-01-05    16
2013-01-06    20
Freq: D, Name: A, dtype: int64

print(df.A)
2013-01-01     0
2013-01-02     4
2013-01-03     8
2013-01-04    12
2013-01-05    16
2013-01-06    20
Freq: D, Name: A, dtype: int64

print(df[0:3])
            A  B   C   D
2013-01-01  0  1   2   3
2013-01-02  4  5   6   7
2013-01-03  8  9  10  11

print(df['20130102':'20130104'])
             A   B   C   D
2013-01-02   4   5   6   7
2013-01-03   8   9  10  11
2013-01-04  12  13  14  15

标签 loc 选择

print(df.loc['20130102'])
A    4
B    5
C    6
D    7
Name: 2013-01-02 00:00:00, dtype: int64

print(df.loc[:,['A','B']])
             A   B
2013-01-01   0   1
2013-01-02   4   5
2013-01-03   8   9
2013-01-04  12  13
2013-01-05  16  17
2013-01-06  20  21

print(df.loc['20130102',['A','B']])
A    4
B    5
Name: 2013-01-02 00:00:00, dtype: int64

序列 iloc 选择

print(df.iloc[3,1])
13

print(df.iloc[3:5,1:3])
             B   C
2013-01-04  13  14
2013-01-05  17  18

print(df.iloc[[1,3,5],1:3])
             B   C
2013-01-02   5   6
2013-01-04  13  14
2013-01-06  21  22

混合两种 ix 选择

print(df.ix[:3,['A','C']])   #混合选择
            A   C
2013-01-01  0   2
2013-01-02  4   6
2013-01-03  8  10

通过判断的筛选

print(df[df.A>8])
             A   B   C   D
2013-01-04  12  13  14  15
2013-01-05  16  17  18  19
2013-01-06  20  21  22  23

Pandas设置值


#Pandas设置值
dates = pd.date_range('20180901',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['A','B','C','D'])
print(df)
             A   B   C   D
2018-09-01   0   1   2   3
2018-09-02   4   5   6   7
2018-09-03   8   9  10  11
2018-09-04  12  13  14  15
2018-09-05  16  17  18  19
2018-09-06  20  21  22  23

#根据位置设置loc和iloc
df.loc['20180903','B'] = 100
df.iloc[5,3] = 200
print(df)
             A    B   C    D
2018-09-01   0    1   2    3
2018-09-02   4    5   6    7
2018-09-03   8  100  10   11
2018-09-04  12   13  14   15
2018-09-05  16   17  18   19
2018-09-06  20   21  22  200


#根据条件设置
df.B[df.A>9] = 0
print(df)
             A    B   C    D
2018-09-01   0    1   2    3
2018-09-02   4    5   6    7
2018-09-03   8  100  10   11
2018-09-04  12    0  14   15
2018-09-05  16    0  18   19
2018-09-06  20    0  22  200

#按行或列设置
df['F'] = 0
print(df)
             A    B   C    D  F
2018-09-01   0    1   2    3  0
2018-09-02   4    5   6    7  0
2018-09-03   8  100  10   11  0
2018-09-04  12    0  14   15  0
2018-09-05  16    0  18   19  0
2018-09-06  20    0  22  200  0

#添加数据
df['E'] = pd.Series([1,2,3,4,5,6],index = dates)
print(df)
             A    B   C    D  F  E
2018-09-01   0    1   2    3  0  1
2018-09-02   4    5   6    7  0  2
2018-09-03   8  100  10   11  0  3
2018-09-04  12    0  14   15  0  4
2018-09-05  16    0  18   19  0  5
2018-09-06  20    0  22  200  0  6