Pandas
可以看作字典形式的numpy,因为你可以给它的行、它的列命名字。
两种主要数据类型:series和dataframe
Series:一维的数据类型。
组成:标签+数据
可看作带标签的元素组成的numpy数组
标签:数字、字符
DataFrame:二维、表格型的数据结构。
含有一组有序的列(类似于index)
大致可以看成共享同一个index的Series的集合
import pandas as pd
import numpy as np
Series
Series:一维的数据类型。
组成:标签+数据
可看作带标签的元素组成的numpy数组
标签:数字、字符
s=pd.Series([5,3,6,np.nan,'December'])
s
0 5
1 3
2 6
3 NaN
4 December
dtype: object
s1=pd.Series([5,3,6,np.nan,'December'], index=['a','b','c','d','e'])
s1
a 5
b 3
c 6
d NaN
e December
dtype: object
s1['e']
'December'
s1.index
Index([u'a', u'b', u'c', u'd', u'e'], dtype='object')
s1.values
array([5, 3, 6, nan, 'December'], dtype=object)
s1*2
a 10
b 6
c 12
d NaN
e DecemberDecember
dtype: object
data={'Beijing':8000, "xi'an":5000, 'Hangzhou':8000, "xi'an":6000}
data_index=["xi'an",'Shenzhen','Hangzhou']
s2=pd.Series(data, data_index)
s2
xi'an 6000.0
Shenzhen NaN
Hangzhou 8000.0
dtype: float64
NaN: not a number值缺失
pd.isnull(s2)
xi'an False
Shenzhen True
Hangzhou False
dtype: bool
s.head(4)
0 5
1 3
2 6
3 NaN
dtype: object
s.tail(3)
2 6
3 NaN
4 December
dtype: object
DataFrame
DataFrame对象可以由列表、元组、字典,ndarray,Series,文件,等创建
注意:默认列号、列号都是从0开始的DataFrame:二维、表格型的数据结构。
含有一组有序的列(类似于index)
大致可以看成共享同一个index的Series的集合由字典创建
data={'city':['beijing','shenzhen','hangzhou'],'rank':list(range(1,4)), 'pay':[9000,8000,7500]}
df0= pd.DataFrame(data)
df0
| city | pay | rank |
---|
0 | beijing | 9000 | 1 |
---|
1 | shenzhen | 8000 | 2 |
---|
2 | hangzhou | 7500 | 3 |
---|
df0.index=range(1,4)
df0
| city | pay | rank |
---|
1 | beijing | 9000 | 1 |
---|
2 | shenzhen | 8000 | 2 |
---|
3 | hangzhou | 7500 | 3 |
---|
df0.columns
Index([u'city', u'pay', u'rank'], dtype='object')
df0.values
array([['beijing', 9000L, 1L],
['shenzhen', 8000L, 2L],
['hangzhou', 7500L, 3L]], dtype=object)
df0['city']
1 beijing
2 shenzhen
3 hangzhou
Name: city, dtype: object
df0.city
1 beijing
2 shenzhen
3 hangzhou
Name: city, dtype: object
df0['occupation']='IT'
df0
| city | pay | rank | occupation |
---|
1 | beijing | 9000 | 1 | IT |
---|
2 | shenzhen | 8000 | 2 | IT |
---|
3 | hangzhou | 7500 | 3 | IT |
---|
df0['password']='admin'
df0
| city | pay | rank | occupation | password |
---|
1 | beijing | 9000 | 1 | IT | admin |
---|
2 | shenzhen | 8000 | 2 | IT | admin |
---|
3 | hangzhou | 7500 | 3 | IT | admin |
---|
df0['occupation']=['python','c++','java']
df0
| city | pay | rank | occupation | password |
---|
1 | beijing | 9000 | 1 | python | admin |
---|
2 | shenzhen | 8000 | 2 | c++ | admin |
---|
3 | hangzhou | 7500 | 3 | java | admin |
---|
del df0['password']
df0
| city | pay | rank | occupation |
---|
1 | beijing | 9000 | 1 | python |
---|
2 | shenzhen | 8000 | 2 | c++ |
---|
3 | hangzhou | 7500 | 3 | java |
---|
df0[1]
df0.iloc[2,]
city hangzhou
pay 7500
rank 3
occupation java
Name: 3, dtype: object
df0.iloc[2]
city hangzhou
pay 7500
rank 3
occupation java
Name: 3, dtype: object
df0.iloc[1, 0]
'shenzhen'
df0.iloc[:2]
| city | pay | rank | occupation |
---|
1 | beijing | 9000 | 1 | python |
---|
2 | shenzhen | 8000 | 2 | c++ |
---|
df0.head(2)
| city | pay | rank | occupation |
---|
1 | beijing | 9000 | 1 | python |
---|
2 | shenzhen | 8000 | 2 | c++ |
---|
df0.iloc[:2, -2:]
df0.T
| 1 | 2 | 3 |
---|
city | beijing | shenzhen | hangzhou |
---|
pay | 9000 | 8000 | 7500 |
---|
rank | 1 | 2 | 3 |
---|
occupation | python | c++ | java |
---|
DataFrame统计功能
df0
| city | pay | rank | occupation |
---|
1 | beijing | 9000 | 1 | python |
---|
2 | shenzhen | 8000 | 2 | c++ |
---|
3 | hangzhou | 7500 | 3 | java |
---|
df0['pay'].min()
7500
df0.pay>=8000
1 True
2 True
3 False
Name: pay, dtype: bool
df0[df0.pay>=8000]
| city | pay | rank | occupation |
---|
1 | beijing | 9000 | 1 | python |
---|
2 | shenzhen | 8000 | 2 | c++ |
---|
df0[df0.pay>=8000].city
1 beijing
2 shenzhen
Name: city, dtype: object
再看一个创建DataFrame的例子
df3=pd.DataFrame({'date':pd.Timestamp('20180909'),
'city':['beijing',"xi'an",'shanghai','shenzhen','hangzhou','chengdu'],
'number':pd.Series(100, list(range(1,7))),
'out':'False'})
df3
| city | date | number | out |
---|
1 | beijing | 2018-09-09 | 100 | False |
---|
2 | xi'an | 2018-09-09 | 100 | False |
---|
3 | shanghai | 2018-09-09 | 100 | False |
---|
4 | shenzhen | 2018-09-09 | 100 | False |
---|
5 | hangzhou | 2018-09-09 | 100 | False |
---|
6 | chengdu | 2018-09-09 | 100 | False |
---|
由numpy的array创建
df1=pd.DataFrame(np.random.randn(6,4))
df1
| 0 | 1 | 2 | 3 |
---|
0 | -0.600514 | 0.157380 | -1.855644 | -0.327762 |
---|
1 | -1.255345 | 0.334783 | -1.703631 | -1.702041 |
---|
2 | 1.465083 | 0.594967 | -2.239554 | 0.342270 |
---|
3 | -1.018990 | -1.922670 | -0.077321 | -1.323573 |
---|
4 | 0.109366 | -0.361239 | -0.030625 | -0.493886 |
---|
5 | 1.543987 | 0.668537 | -0.869816 | 1.172763 |
---|
df.values
df2=pd.DataFrame(np.random.randn(6,4),index=range(1,7), columns=['a','b','c','d'])
df2
| a | b | c | d |
---|
1 | 0.370422 | 0.761769 | -0.471956 | 0.760529 |
---|
2 | 0.561101 | -0.977747 | 0.143900 | 0.201052 |
---|
3 | -0.415303 | -0.143032 | 0.169098 | 0.320532 |
---|
4 | 0.714336 | 0.687138 | -2.041781 | -1.778007 |
---|
5 | 0.516654 | -0.294058 | -0.408824 | 0.196900 |
---|
6 | -0.499240 | 0.217150 | 0.680787 | 2.619825 |
---|