python pandas库datefarem_Pandas库基础分析——数据生成和访问

前言

Pandas是Python环境下最有名的数据统计包,是基于 Numpy 构建的含有更高级数据结构和工具的数据分析包。Pandas围绕着 Series 和 DataFrame 两个核心数据结构展开的。本文着重介绍这两种数据结构的生成和访问的基本方法。

Series

Series是一种类似于一维数组的对象,由一组数据(一维ndarray数组对象)和一组与之对应相关的数据标签(索引)组成。

注:numpy(Numerical Python)提供了python对多维数组对象的支持:ndarray,具有矢量运算能力,快速、节省空间。

""" One-dimensional ndarray with axis labels (including time series).

Labels need not be unique but must be a hashable type. The object

supports both integer- and label-based indexing and provides a host of

methods for performing operations involving the index. Statistical

methods from ndarray have been overridden to automatically exclude

missing data (currently represented as NaN).

Operations between Series (+, -, /, ,*) align values based on their

associated index values-- they need not be the same length. The result

index will be the sorted union of the two indexes.

Parameters

---------- data : array-like, dict, or scalar value

Contains data stored in Series index : array-like or Index (1d)

Values must be hashable and have the same length as `data`.

Non-unique index values are allowed. Will default to

RangeIndex(len(data)) if not provided. If both a dict and index

sequence are used, the index will override the keys found in the

dict. dtype : numpy.dtype or None

If None, dtype will be inferred copy : boolean, default False

Copy input data """

(2)创建Series的基本方法如下,数据可以是阵列(list、ndarray)、字典和常量值。s = pd.Series(data, index=index)

s = pd.Series([-1.55666192,-0.75414753,0.47251231,-1.37775038,-1.64899442], index=['a', 'b', 'c', 'd', 'e'],dtype='int8' )

a -1

b 0

c 0

d -1

e -1

dtype: int8

s = pd.Series(['a',-0.75414753,123,66666,-1.64899442], index=['a', 'b', 'c', 'd', 'e'],)

a a

b -0.754148

c 123

d 66666

e -1.64899

dtype: object

注:Series支持的数据类型包括整数、浮点数、复数、布尔值、字符串等numpy.dtype,与创建ndarray数组相同的是,如未指定类型,它会尝试推断出一个合适的数据类型,例程中数据包含数字和字符串时,推断为object类型;如指定int8类型时数据以int8显示。

s = pd.Series(np.random.randn(5))

0 0.485468

1 -0.912130

2 0.771970

3 -1.058117

4 0.926649

dtype: float64

s.index

RangeIndex(start=0, stop=5, step=1)

s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

a 0.485468

b -0.912130

c 0.771970

d -1.058117

e 0.926649

dtype: float64

注:当数据未指定索引时,Series会自动创建整数型索引

s = pd.Series({'a' : 0., 'b' : 1., 'c' : 2.})

a 0.0

b 1.0

c 2.0

dtype: float64

s = pd.Series({'a' : 0., 'b' : 1., 'c' : 2.}, index=['b', 'c', 'd', 'a'])

b 1.0

c 2.0

d NaN

a 0.0

dtype: float64

注:通过Python字典创建Series,可视为一个定长的有序字典。如果只传入一个字典,那么Series中的索引即是原字典的键。如果传入索引,那么会找到索引相匹配的值并放在相应的位置上,未找到对应值时结果为NaN。

s = pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])

a 5.0

b 5.0

c 5.0

d 5.0

e 5.0

dtype: float64

注:数值重复匹配以适应索引长度

(3)访问Series中的元素和索引

s = pd.Series({'a' : 0., 'b' : 1., 'c' : 2.}, index=['b', 'c', 'd', 'a'])

b 1.0

c 2.0

d NaN

a 0.0

dtype: float64

s.values

[ 1. 2. nan 0.]

s.index

Index([u'b', u'c', u'd', u'a'], dtype='object')

注:Series的values和index属性获取其数组表示形式和索引对象

s['a']

0.0

s[['a','b']]

a 0.0

b 1.0

dtype: float64

s[['a','b','c']]

a 0.0

b 1.0

c 2.0

dtype: float64

s[:2]

b 1.0

c 2.0

dtype: float64

注:可以通过索引的方式选取Series中的单个或一组值

DataFrame

DataFrame是一个表格型(二维)的数据结构,它含有一组有序的列,每列可以是不同的值类型(数值、字符串、布尔值等)。DataFrame既有行索引也有列索引,它可以看做由Series组成的字典(共用同一个索引)。

""" Two-dimensional size-mutable, potentially heterogeneous tabular

data structure with labeled axes (rows and columns). Arithmetic

operations align on both row and column labels. Can be thought of as a

dict-like container for Series objects. The primary pandas data

structure

Parameters

---------- data : numpy ndarray (structured or homogeneous), dict, or DataFrame

Dict can contain Series, arrays, constants, or list-like objects index : Index or array-like

Index to use for resulting frame. Will default to np.arange(n) if

no indexing information part of input data and no index provided columns : Index or array-like

Column labels to use for resulting frame. Will default to

np.arange(n) if no column labels are provided dtype : dtype, default None

Data type to force. Only a single dtype is allowed. If None, infer copy : boolean, default False

Copy data from inputs. Only affects DataFrame / 2d ndarray input

(2)创建DataFrame的基本方法如下,数据可以是由列表、一维ndarray或Series组成的字典(序列长度必须相同)、二维ndarray、字典组成的字典等df = pd.DataFrame(data, index=index)

df = pd.DataFrame({'one': [1., 2., 3., 5], 'two': [1., 2., 3., 4.]})

one two

0 1.0 1.0

1 2.0 2.0

2 3.0 3.0

3 5.0 4.0

注:以列表组成的字典形式创建,每个序列成为DataFrame的一列。不支持单一列表创建df = pd.DataFrame({[1., 2., 3., 5], [1., 2., 3., 4.]}),因为list为unhashable类型

df = pd.DataFrame([[1., 2., 3., 5],[1., 2., 3., 4.]],index=['a', 'b'],columns=['one','two','three','four'])

one two three four

a 1.0 2.0 3.0 5.0

b 1.0 2.0 3.0 4.0

注:以嵌套列表组成形式创建2行4列的表格,通过index和 columns参数指定了索引和列名

data = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')])

[(0, 0., '') (0, 0., '')]

注:zeros(shape, dtype=float, order='C')返回一个给定形状和类型的用0填充的数组

data[:] = [(1,2.,'Hello'), (2,3.,"World")]

df = pd.DataFrame(data)

A B C

0 1 2.0 Hello

1 2 3.0 World

df = pd.DataFrame(data, index=['first', 'second'])

A B C

first 1 2.0 Hello

second 2 3.0 World

df = pd.DataFrame(data, columns=['C', 'A', 'B'])

C A B

0 Hello 1 2.0

1 World 2 3.0

注:同Series相同,未指定索引时DataFrame会自动加上索引,指定列则按指定顺序进行排列

data = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),

'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(data)

one two

a 1.0 1.0

b 2.0 2.0

c 3.0 3.0

d NaN 4.0

注:以Series组成的字典形式创建时,每个Series成为一列,如果没有显示指定索引,则各Series的索引被合并成结果的行索引。NaN代替缺失的列数据

df = pd.DataFrame(data,index=['d', 'b', 'a'])

one two

d NaN 4.0

b 2.0 2.0

a 1.0 1.0

df = pd.DataFrame(data,index=['d', 'b', 'a'], columns=['two', 'three'])

two three

d 4.0 NaN

b 2.0 NaN

a 1.0 NaN

data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]

df = pd.DataFrame(data2)

a b c

0 1 2 NaN

1 5 10 20.0

注:以字典的列表形式创建时,各项成为DataFrame的一行,字典键索引的并集成为DataFrame的列标

df = pd.DataFrame(data2, index=['first', 'second'])

a b c

first 1 2 NaN

second 5 10 20.0

df = pd.DataFrame(data2, columns=['a', 'b'])

a b

0 1 2

1 5 10

df = pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},

('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},

('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},

('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},

('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})

a b

a b c a b

A B 4.0 1.0 5.0 8.0 10.0

C 3.0 2.0 6.0 7.0 NaN

D NaN NaN NaN NaN 9.0

注:以字典的字典形式创建时,列索引由外层的键合并成结果的列索引,各内层字典成为一列,内层的键会被合并成结果的行索引。

(3)访问DataFrame中的元素和索引

data = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),

'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(data)

one two

a 1.0 1.0

b 2.0 2.0

c 3.0 3.0

d NaN 4.0

df['one']或df.one

a 1.0

b 2.0

c 3.0

d NaN

Name: one, dtype: float64

注:通过类似字典标记的方式或属性的方式,可以将DataFrame的列获取为一个Series。返回的Series拥有原DataFrame相同的索引,且其name属性也被相应设置。

df[0:1]

one two

a 1.0 1.0

注:返回前两列数据

df.loc['a']

one 1.0

two 1.0

Name: a, dtype: float64

df.loc[:,['one','two'] ]

one two

a 1.0 1.0

b 2.0 2.0

c 3.0 3.0

d NaN 4.0

df.loc[['a',],['one','two']]

one two

a 1.0 1.0

df.loc['a','one']

1.0

注:loc是通过标签来选择数据

df.iloc[0:2,0:1]

one

a 1.0

b 2.0

df.iloc[0:2]

one two

a 1.0 1.0

b 2.0 2.0

df.iloc[[0,2],[0,1]]#自由选取行位置,和列位置对应的数据

one two

a 1.0 1.0

c 3.0 3.0

注:iloc通过位置来选择数据

df.ix['a']

one 1.0

two 1.0

Name: a, dtype: float64

df.ix['a',['one','two']]

one 1.0

two 1.0

Name: a, dtype: float64

df.ix['a',[0,1]]

one 1.0

two 1.0

Name: a, dtype: float64

df.ix[['a','b'],[0,1]]

one two

a 1.0 1.0

b 2.0 2.0

df.ix[1,[0,1]]

one 2.0

two 2.0

Name: b, dtype: float64

df.ix[[0,1],[0,1]]

one two

a 1.0 1.0

b 2.0 2.0

注:通过索引字段ix和名称结合的方式获取行数据

df.ix[df.one>1,:1]

one

b 2.0

c 3.0

注:使用条件来选择,选取one列中大于1的行和第一列

df['one']=16.8

one two

a 16.8 1.0

b 16.8 2.0

c 16.8 3.0

d 16.8 4.0

val = pd.Series([2,2,2],index=['b', 'c', 'd'])

df['one']=val

one two

a NaN 1.0

b 2.0 2.0

c 2.0 3.0

d 2.0 4.0

注:列可以通过赋值方式修改,将列表或数组赋值给某个列时长度必须和DataFrame的长度相匹配。Series赋值时会精确匹配DataFrame的索引,空位以NaN填充。

df['four']=[3,3,3,3]

one two four

a NaN 1.0 3

b 2.0 2.0 3

c 2.0 3.0 3

d 2.0 4.0 3

注:对不存在的列赋值会创建新列

df.index.get_loc('a')

0

df.index.get_loc('b')

1

df.columns.get_loc('one')

0

注:通过行/列索引获取整数形式位置

更多python量化交易内容互动请加微信公众号:PythonQT-YuanXiao

欢迎订阅量化交易课程:[链接地址]

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值