Pandas入门
两个数据类型:Series,DataFrame
建立数据与索引之间的关系
一、Series 类型
Series类型是一维带“标签”数组
import pandas as pd
a = pd.Series([9,8,7,6])
print(a)
'''
0 9
1 8
2 7
3 6
dtype: int64
'''
b = pd.Series([9,8,7,6],index =['a','b','c','d'] )
print(b)
'''
a 9
b 8
c 7
d 6
dtype: int64
'''
1.1 Series 的创建
- Python列表,index与列表元素个数一致
- 标量值,index表达Series类型的尺寸
- Python字典,键值对中的“键”是索引,index从字典中进行选择操作
- ndarray,索引和数据都可以通过ndarray类型创建
- 其他函数,range()函数等
1.1.1 标量值创建
import pandas as pd
s = pd.Series(25,index =['a','b','c'] )
print(s)
'''
a 25
b 25
c 25
dtype: int64
'''
1.1.2 字典创建
import pandas as pd
s = pd.Series({'a':25,'b':23,'c':1 })
print(s)
'''
a 25
b 23
c 1
dtype: int64
'''
s = pd.Series({'a':25,'b':23,'c':1 },index = ['a','d','b','c'])#注意索引顺序
print(s)
'''
a 25.0
d NaN
b 23.0
c 1.0
dtype: float64
'''
1.1.3 ndarray创建
import pandas as pd
import numpy as np
n = pd.Series(np.arange(5))
print(n)
'''
0 0
1 1
2 2
3 3
4 4
dtype: int32
'''
n = pd.Series(np.arange(5),index = np.arange(9,4,-1))
print(n)
'''
9 0
8 1
7 2
6 3
5 4
dtype: int32
'''
1.2 Series的基本操作
- Series类型包括index和values两部分。.index获得索引(index类型),.values获得数据(numpy类型)
- Series类型类似ndarray类型。
- Series类型的操作类似Python字典类型。
1.2.1 Series类型包括index和values两部分
import pandas as pd
b = pd.Series([9,8,7,6],index =['a','b','c','d'] )
print(b.index)#Index(['a', 'b', 'c', 'd'], dtype='object')
print(b.values)#[9 8 7 6]
print(b['b']) #自定义索引,8
print(b[1]) #自动索引,8
print(b[['c','d',0]])#两套索引并存,但不能混用
'''
c 7.0
d 6.0
0 NaN
'''
1.2.2 Series类型类似ndarray类型
- 索引方法相同,采用[]
- Numpy中运算和操作可用于Series类型
- 可以通过自定义索引的列表进行切片
- 可以通过自动索引进行切片,如果存在自定义索引,则一同被切片
import pandas as pd
b = pd.Series([9,8,7,6],index =['a','b','c','d'] )
print(b[:3])#获得0-3的数据
'''
a 9
b 8
c 7
'''
print(b[b>b.median()])#输出大于中位数的Series
'''
a 9
b 8
dtype: int64
'''
1.2.3 Series类型的操作类似Python字典类型
- 通过自定义索引
- 保留字in操作(判断数据是否在索引列表中)
- 使用.get()方法
import pandas as pd
b = pd.Series([9,8,7,6],index =['a','b','c','d'] )
print('c' in b) #True
print(0 in b)#False
print(b.get('f')) #None
print(b.get('f',100))#100,原则上为空,但此处有第二个参数,因此返回100
1.2.4 Series类型对齐操作
Series + Series
import pandas as pd
a = pd.Series([1,2,3],['c','d','e'])
b = pd.Series([9,8,7,6],index =['a','b','c','d'] )
print(a+b)#先找交集,再求和
'''
a NaN
b NaN
c 8.0
d 8.0
e NaN
dtype: float64
'''
1.2.5 Series类型的name属性
Series对象和索引都可以有一个名字,存储在属性.name中
import pandas as pd
b = pd.Series([9,8,7,6],index =['a','b','c','d'] )
b.name = 'Series对象'
b.index.name = '索引列'
print(b)
'''
索引列
a 9
b 8
c 7
d 6
Name: Series对象, dtype: int64
'''
1.2.6 Series类型的修改
Series对象可以随时修改并即刻生效
import pandas as pd
b = pd.Series([9,8,7,6],index =['a','b','c','d'] )
b.name = 'Series对象'
b.name = "new series"
b['b','c'] = 20
print(b)
'''
a 9
b 20
c 20
d 6
Name: new series, dtype: int64
'''
1.2.7 Series类型 数据清洗
查找空值,注意,浮点型数据查找空值用math.isnan(float_values),或用if判断np.isnan(float_values)==True
import pandas as pd
b = pd.Series([9,None,7,6],index =['a','b','c','d'] )
print(b)
'''
a 9.0
b NaN
c 7.0
d 6.0
dtype: float64
'''
notNullIndex = b[(b.isnull() == False)].index
print(notNullIndex)# 非空索引
'''
Index(['a', 'c', 'd'], dtype='object')
'''
firstNotNull = b.index.get_loc(notNullIndex[0])
print(firstNotNull)# 获得第一个非空值的行号
'''
0
'''
二、 DataFrame类型
DataFrame是二维带“标签”数组(索引+多列数据)
- DataFrame是一个表格型的数据类型,每列值类型可以不同
- DataFrame既有行索引(index),也有列索引(column)
- DataFrame常用于表达二维数据,但可以表达多维数据
2. DataFrame的创建
- 从二维ndarray对象创建
- 从由一维ndarray、列表、字典、元组或Series构成的字典创建
- Series类型
- 其他DataFrame类型
2.1 从二维ndarray对象创建
原始数据+自动生成的行索引和列索引
import pandas as pd
import numpy as np
d = pd.DataFrame(np.arange(10).reshape(2,5))
print(d)
'''
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
'''
2.2 从一维ndarray对象字典创建
import pandas as pd
import numpy as np
s = {'one':pd.Series([1,2,3],index = ['a','d','b']),
'two':pd.Series([9,8,7,5],index = ['a','d','b','c'])}
print(type(s))#<class 'dict'>
d = pd.DataFrame(s)
print(d)
'''
one two
a 1.0 9
b 3.0 7
c NaN 5
d 2.0 8
'''
e = pd.DataFrame(s,index = ['b','c','d'],columns=['two','three'])#添加行索引和列索引
print(e)
'''
two three
b 7 NaN
c 5 NaN
d 8 NaN
'''
2.3 从列表类型的字典创建
import pandas as pd
import numpy as np
d1 = {'one':[1,2,3,4],'two':[9,8,7,6]}
d = pd.DataFrame(d1, index=['a','b','c','d'])
print(d)
'''
one two
a 1 9
b 2 8
c 3 7
d 4 6
'''
print(d['two'])#根据列索引获得对应值
'''
a 9
b 8
c 7
d 6
'''
print(d.ix['b'])#根据行索引获得对应值
'''
one 2
two 8
'''
2.4 DataFrame可视化
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
d1 = {'one':[1,2,3,4],'two':[9,8,7,6]}
d = pd.DataFrame(d1, index=['a','b','c','d'])
'''
one two
a 1 9
b 2 8
c 3 7
d 4 6
'''
#线图
plt.figure()
#font1 = {'family' : 'SimHei','weight' : 'normal','size' : 6 }#设置字体
xtick = list(d.index)#设置x坐标轴
#plt.xticks(range(len(xtick)), xtick,rotation=60) #设置横轴格式,rotation为倾斜角度
colors = ["red","blue"]
d.plot(kind='line', label="拟合曲线", color=colors)
mpl.rcParams['font.sans-serif'] = ['SimHei'] #插入中文标题
mpl.rcParams['axes.unicode_minus'] = False
plt.title("线图")
plt.show()
#点图
plt.figure()
plt.scatter(d.index, d["one"].values, label="插入数据", color="green")
plt.title("点图")
plt.tick_params(labelsize=6)