Pandas的介绍:
一个强大的分析结构化数据的工具集
基础是NumPy,提供了高性能矩阵的运算
pd.Series数据结构:
1.构建Series数据:
通过数组/列表
通过dict
2.获取索引:
ser_obj.index
3.获取数据:
1>获取所有数据:ser_obj.values
2>依据名称获取Series中的某些值:ser_obj[‘idx_name’], 或 ser_obj.loc[‘idx_name’],其中假设了ser_obj的索引由”idx_name”组成。
3>依据索引位置获取Series中的某些值,ser_obj.iloc[idx],其中idx是整形的。
4.Series的名称:
ser_obj.name
5.Series索引的名称:
ser_obj.index.name
6.根据数据类型自动处理缺失数据:
如object -> None, float -> NaN
实验操作:
第五课数据分析工具Pandas基础
第一节 数据结构--Series
In [1]:
import pandas as pd
import numpy as np
构建Series
In [2]:
# 过滤
pd.Series(range(10,20))
Out[2]:
0 10 1 11 2 12 3 13 4 14 5 15 6 16 7 17 8 18 9 19 dtype: int64
In [3]:
# 通过ndarray
pd.Series(np.random.rand(5))
Out[3]:
0 0.358297 1 0.561983 2 0.315839 3 0.411356 4 0.722184 dtype: float64
In [4]:
# 通过dict
d = {'a': 0., 'b': 1., 'c': 2.}
pd.Series(d)
Out[4]:
a 0.0 b 1.0 c 2.0 dtype: float64
In [5]:
#构建时指定索引
pd.Series(np.random.rand(5), index=['a','b','c','d','e'])
Out[5]:
a 0.150804 b 0.384518 c 0.173143 d 0.247358 e 0.702878 dtype: float64
预览数据
In [6]:
ser_obj = pd.Series(np.random.rand(100))
In [13]:
ser_obj.head(10) #数字可改,预览头10条数据
Out[13]:
0 0.938141 1 0.944047 2 0.379465 3 0.516051 4 0.946966 5 0.374619 6 0.467724 7 0.758027 8 0.708764 9 0.877710 dtype: float64
In [12]:
ser_obj.tail(10) # 预览后10条数据
Out[12]:
90 0.528964 91 0.100904 92 0.801978 93 0.642610 94 0.923427 95 0.786526 96 0.637069 97 0.433589 98 0.651577 99 0.701476 dtype: float64
获取数据和索引
In [8]:
ser_obj.index
Out[8]:
RangeIndex(start=0, stop=100, step=1)
In [14]:
ser_obj.values
Out[14]:
array([0.93814055, 0.94404708, 0.37946451, 0.51605088, 0.94696595, 0.37461863, 0.46772419, 0.75802715, 0.70876413, 0.87770967, 0.17047904, 0.00814934, 0.81984407, 0.250021 , 0.05607658, 0.15459183, 0.34117638, 0.74338738, 0.35825494, 0.56078568, 0.48168834, 0.15804645, 0.9112692 , 0.01979086, 0.0315959 , 0.18303411, 0.67122412, 0.22948137, 0.04989893, 0.64791478, 0.21144846, 0.92516375, 0.13590916, 0.54756753, 0.41681491, 0.16024143, 0.68094794, 0.39225309, 0.45668911, 0.2836108 , 0.62985304, 0.80784036, 0.75853033, 0.84305789, 0.02696868, 0.64354537, 0.97262702, 0.22533263, 0.3657996 , 0.52645998, 0.09503175, 0.56632399, 0.82656123, 0.0318454 , 0.6063451 , 0.89650837, 0.22100889, 0.05134452, 0.73132782, 0.48020783, 0.77226105, 0.42828439, 0.11965448, 0.29040443, 0.44522296, 0.48555264, 0.40059591, 0.08265515, 0.05205472, 0.0034085 , 0.69878852, 0.02179686, 0.82884667, 0.01019204, 0.07085076, 0.7862259 , 0.79653953, 0.82790031, 0.05072612, 0.95742924, 0.15304039, 0.58295421, 0.49490892, 0.2872513 , 0.93807329, 0.87487986, 0.63609513, 0.68760913, 0.9811227 , 0.01622096, 0.52896371, 0.10090407, 0.80197783, 0.64261047, 0.92342749, 0.78652616, 0.63706881, 0.43358866, 0.65157668, 0.70147566])
name属性
In [16]:
ser_obj = pd.Series(np.random.rand(100),name = 'rand_num')
In [17]:
ser_obj.name
Out[17]:
'rand_num'
In [18]:
ser_obj.index.name = 'index'
In [19]:
ser_obj.head()
Out[19]:
index 0 0.920294 1 0.766381 2 0.373396 3 0.456288 4 0.311206 Name: rand_num, dtype: float64
通过索引获取数据
In [20]:
# 通过索引名(字符串)获取数据
ser_obj2 = pd.Series(np.random.rand(5),index=['a','b','c','d','e'])
In [21]:
ser_obj2
Out[21]:
a 0.032560 b 0.450495 c 0.336468 d 0.888428 e 0.857041 dtype: float64
In [22]:
ser_obj2['b']
Out[22]:
0.4504947860927033
In [23]:
ser_obj2.loc['b']
Out[23]:
0.4504947860927033
In [26]:
#通过in判断数据是否存在
'f' in ser_obj2
Out[26]:
False
In [27]:
# 通过索引位置(整型)获取数据
ser_obj2[0]
Out[27]:
0.0325603141122478
In [28]:
ser_obj2.iloc[0]
Out[28]:
0.0325603141122478
处理缺失数据
In [29]:
# 字符串缺失
countries = ['中国','美国','澳大利亚',None]
pd.Series(countries)
Out[29]:
0 中国 1 美国 2 澳大利亚 3 None dtype: object
In [31]:
# 数据缺失
numbers = [4, 5, 6, None]
pd.Series(numbers)
Out[31]:
0 4.0 1 5.0 2 6.0 3 NaN dtype: float64
In [ ]: