数据结构
-
Series:一维数组,与Numpy中的一维array类似。二者与Python基本的数据结构List也很相近,其区别是:List中的元素可以是不同的数据类型,而Array和Series中则只允许存储相同的数据类型,这样可以更有效的使用内存,提高运算效率。
-
Time- Series:以时间为索引的Series。
-
DataFrame:二维的表格型数据结构。很多功能与R中的data.frame类似。可以将DataFrame理解为Series的容器。以下的内容主要以DataFrame为主。
-
Panel :三维的数组,可以理解为DataFrame的容器。
导入csv文件
>>> df = pd.DataFrame(pd.read_csv('insurance.csv',header=0)) #加载csv文件
>>> df.head() #查看前10行
age sex bmi children smoker region charges
0 19 female 27.900 0 yes southwest 16884.92400
1 18 male 33.770 1 no southeast 1725.55230
2 28 male 33.000 3 no southeast 4449.46200
3 33 male 22.705 0 no northwest 21984.47061
4 32 male 28.880 0 no northwest 3866.85520
>>> df.tail() #查看后10行
age sex bmi children smoker region charges
1333 50 male 30.97 3 no northwest 10600.5483
1334 18 female 31.92 0 no northeast 2205.9808
1335 18 female 36.85 0 no southeast 1629.8335
1336 21 female 25.80 0 no southwest 2007.9450
1337 61 female 29.07 0 yes northwest 29141.3603
>>> df = pd.DataFrame(pd.read_excel('insurance.xlsx')) #加载xlsx文件
>>> df.head() #查看前10行
age sex bmi children smoker region charges
0 19 female 27.900 0 yes southwest 16884.92400
1 18 male 33.770 1 no southeast 1725.55230
2 28 male 33.000 3 no southeast 4449.46200
3 33 male 22.705 0 no northwest 21984.47061
4 32 male 28.880 0 no northwest 3866.85520
>>> df.tail() #查看后10行
age sex bmi children smoker region charges
1333 50 male 30.97 3 no northwest 10600.5483
1334 18 female 31.92 0 no northeast 2205.9808
1335 18 female 36.85 0 no southeast 1629.8335
1336 21 female 25.80 0 no southwest 2007.9450
1337 61 female 29.07 0 yes northwest 29141.3603
DataFrame
>>> df = pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006],
"date":pd.date_range('20130102', periods=6),
"city":['Beijing ', 'SH', ' guangzhou ', 'Shenzhen', 'shanghai', 'BEIJING '],
"age":[23,44,54,32,34,32],
"category":['100-A','100-B','110-A','110-C','210-A','130-F'],"price":[1200,np.nan,2133,5433,np.nan,4432]},
columns =['id','date','city','category','age','price'])
>>> df.head()
id date city category age price
0 1001 2013-01-02 Beijing 100-A 23 1200.0
1 1002 2013-01-03 SH 100-B 44 NaN
2 1003 2013-01-04 guangzhou 110-A 54 2133.0
3 1004 2013-01-05 Shenzhen 110-C 32 5433.0
4 1005 2013-01-06 shanghai 210-A 34 NaN
- 维度查看:
>>> df.shape
(6, 6)
- 数据表基本信息(维度、列名称、数据格式、所占空间等):
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 6 columns):
id 6 non-null int64
date 6 non-null datetime64[ns]
city 6 non-null object
category 6 non-null object
age 6 non-null int64
price 4 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(2), object(2)
memory usage: 368.0+ bytes
3、每一列数据的格式:
>>> df.dtypes
id int64
date datetime64[ns]
city object
category object
age int64
price float64
dtype: object
4、某一列格式:
>>> df['id'].dtype
dtype('int64')
5、空值:
>>> df.isnull()
id date city category age price
0 False False False False False False
1 False False False False False True
2 False False False False False False
3 False False False False False False
4 False False False False False True
5 False False False False False False
6、查看某一列的唯一值:
>>> df['age']
0 23
1 44
2 54
3 32
4 34
5 32
Name: age, dtype: int64
>>> df['age'].unique()
array([23, 44, 54, 32, 34])
7、查看数据表的值:
>>> df.values
array([[1001, Timestamp('2013-01-02 00:00:00'), 'Beijing ', '100-A', 23,
1200.0],
[1002, Timestamp('2013-01-03 00:00:00'), 'SH', '100-B', 44, nan],
[1003, Timestamp('2013-01-04 00:00:00'), ' guangzhou ', '110-A',
54, 2133.0],
[1004, Timestamp('2013-01-05 00:00:00'), 'Shenzhen', '110-C', 32,
5433.0],
[1005, Timestamp('2013-01-06 00:00:00'), 'shanghai', '210-A', 34,
nan],
[1006, Timestamp('2013-01-07 00:00:00'), 'BEIJING ', '130-F', 32,
4432.0]], dtype=object)
8、查看列名称:
>>> df.columns
Index(['id', 'date', 'city', 'category', 'age', 'price'], dtype='object')
9、查看前N行数据、后N行数据:
>>> df.head(2) #显示前2行数据
id date city category age price
0 1001 2013-01-02 Beijing 100-A 23 1200.0
1 1002 2013-01-03 SH 100-B 44 NaN
>>> df.tail(2) #显示后2行数据
id date city category age price
4 1005 2013-01-06 shanghai 210-A 34 NaN
5 1006 2013-01-07 BEIJING 130-F 32 4432.0
数据表清洗
1、用数字0填充空值:
>>> df
id date city category age price
0 1001 2013-01-02 Beijing 100-A 23 1200.0
1 1002 2013-01-03 SH 100-B 44 NaN
2 1003 2013-01-04 guangzhou 110-A 54 2133.0
3 1004 2013-01-05 Shenzhen 110-C 32 5433.0
4 1005 2013-01-06 shanghai 210-A 34 NaN
5 1006 2013-01-07 BEIJING 130-F 32 4432.0
>>> df.fillna(0)
id date city category age price
0 1001 2013-01-02 Beijing 100-A 23 1200.0
1 1002 2013-01-03 SH 100-B 44 0.0
2 1003 2013-01-04 guangzhou 110-A 54 2133.0
3 1004 2013-01-05 Shenzhen 110-C 32 5433.0
4 1005 2013-01-06 shanghai 210-A 34 0.0
5 1006 2013-01-07 BEIJING 130-F 32 4432.0
2、使用列price的均值对NAN进行填充:
>>> df['price']=df['price'].fillna(df['price'].mean())
>>> df['price']
0 1200.0
1 3299.5
2 2133.0
3 5433.0
4 3299.5
5 4432.0
Name: price, dtype: float64
3、清除city字段的字符空格:
>>> df['city']
0 Beijing
1 SH
2 guangzhou
3 Shenzhen
4 shanghai
5 BEIJING
Name: city, dtype: object
>>> df['city']=df['city'].map(str.strip)
>>> df['city']
0 Beijing
1 SH
2 guangzhou
3 Shenzhen
4 shanghai
5 BEIJING
Name: city, dtype: object
4、大小写转换:
>>> df['city']
0 Beijing
1 SH
2 guangzhou
3 Shenzhen
4 shanghai
5 BEIJING
Name: city, dtype: object
>>> df['city']=df['city'].str.lower()
>>> df['city']
0 beijing
1 sh
2 guangzhou
3 shenzhen
4 shanghai
5 beijing
Name: city, dtype: object
5、更改数据格式 (astype):
>>> df['price'].astype('int')
0 1200
1 3299
2 2133
3 5433
4 3299
5 4432
Name: price, dtype: int64
6、更改列名称 (rename):
>>> df.columns
Index(['id', 'date', 'city', 'category', 'age', 'price'], dtype='object')
>>> df.rename(columns={'category': 'category-size'})
id date city category-size age price
0 1001 2013-01-02 beijing 100-A 23 1200.0
1 1002 2013-01-03 sh 100-B 44 3299.5
2 1003 2013-01-04 guangzhou 110-A 54 2133.0
3 1004 2013-01-05 shenzhen 110-C 32 5433.0
4 1005 2013-01-06 shanghai 210-A 34 3299.5
5 1006 2013-01-07 beijing 130-F 32 4432.0
7、删除后出现的重复值:
>>> df['city']
0 beijing
1 sh
2 guangzhou
3 shenzhen
4 shanghai
5 beijing
Name: city, dtype: object
>>> df['city'].drop_duplicates()
0 beijing
1 sh
2 guangzhou
3 shenzhen
4 shanghai
Name: city, dtype: object
8、删除先出现的重复值:
>>> df['city'].drop_duplicates(keep='last')
1 sh
2 guangzhou
3 shenzhen
4 shanghai
5 beijing
Name: city, dtype: object
9、数据替换:
>>> df['city']
0 beijing
1 sh
2 guangzhou
3 shenzhen
4 shanghai
5 beijing
Name: city, dtype: object
>>> df['city'].replace('sh', 'shanghai')
0 beijing
1 shanghai
2 guangzhou
3 shenzhen
4 shanghai
5 beijing
Name: city, dtype: object
数据预处理
1、数据合并-merge函数
merge函数的参数如下表所示:
参数 | 说明 |
---|---|
left | 参与合并的左侧DataFrame |
right | 参与合并的右侧DataFrame |
how | “inner”,”outer”,”left”,”right”其中之一,默认为”inner” |
on | 用于连接的列名,必须存在于左右两个DataFrame |
left_on | 左侧DataFrame中用作连接键的列 |
right_on | 右侧DataFrame中用作连接键的列 |
left_index | 将左侧的行索引用作其连接键 |
right_index | 将右侧的行索引用作其连接键 |
sort | 根据连接键对合并后的数据进行排列,默认为True |
suffixes | 字符串值元组,用于追加到重叠列名的末尾,默认为(‘_x’,‘_y’)。如果左右两个DataFrame对象都有“data”,则结果就会出现“data_x”和“data_y” |
copy | 默认为True。如果设置为False,可以避免将数据复制到结果数据结构中 |
1.1、数据表合并-on参数
>>> df1=pd.DataFrame({'key':['b','b','a','a','b','a','c'],'data1':range(7)})
>>> df1
key data1
0 b 0
1 b 1
2 a 2
3 a 3
4 b 4
5 a 5
6 c 6
>>> df2=pd.DataFrame({'key':['a','b','d'],'data2':range(3)})
>>> df2
key data2
0 a 0
1 b 1
2 d 2
>>> pd.merge(df1,df2,on='key')
key data1 data2
0 b 0 1
1 b 1 1
2 b 4 1
3 a 2 0
4 a 3 0
5 a 5 0
1.2、数据合并-left_on,right_on参数
>>> df3=pd.DataFrame({'l_key':['b','b','a','a','b','a','c'],'data1':range(7)})
>>> df3
l_key data1
0 b 0
1 b 1
2 a 2
3 a 3
4 b 4
5 a 5
6 c 6
>>> df4=pd.DataFrame({'r_key':['a','b','d'],'data2':range(3)})
>>> df4
r_key data2
0 a 0
1 b 1
2 d 2
>>> pd.merge(df3,df4,left_on='l_key',right_on='r_key')
l_key data1 r_key data2
0 b 0 b 1
1 b 1 b 1
2 b 4 b 1
3 a 2 a 0
4 a 3 a 0
5 a 5 a 0
1.3、数据合并-how参数
>>> df2=pd.DataFrame({'key':['a','b','d'],'data2':range(3)})
>>> df2
key data2
0 a 0
1 b 1
2 d 2
>>> df1
key data1
0 b 0
1 b 1
2 a 2
3 a 3
4 b 4
5 a 5
6 c 6
>>> pd.merge(df1,df2,on='key',how='outer')
key data1 data2
0 b 0.0 1.0
1 b 1.0 1.0
2 b 4.0 1.0
3 a 2.0 0.0
4 a 3.0 0.0
5 a 5.0 0.0
6 c 6.0 NaN
7 d NaN 2.0
>>> df_inner = pd.merge(df1,df2,on='key',how='inner')
>>> df_inner
key data1 data2
0 b 0 1
1 b 1 1
2 b 4 1
3 a 2 0
4 a 3 0
5 a 5 0
>>> pd.merge(df1,df2,on='key',how='left') #只使用左边的DataFrame的键
key data1 data2
0 b 0 1.0
1 b 1 1.0
2 a 2 0.0
3 a 3 0.0
4 b 4 1.0
5 a 5 0.0
6 c 6 NaN
>>> pd.merge(df1,df2,on='key',how='right') #只使用右边的DataFrame的键
key data1 data2
0 b 0.0 1
1 b 1.0 1
2 b 4.0 1
3 a 2.0 0
4 a 3.0 0
5 a 5.0 0
6 d NaN 2
1.5、数据合并-left_index,right_index参数
>>> df7=pd.DataFrame({'key':['a','b','a','a','b','c'],'value':range(6)})
>>> df7
key value
0 a 0
1 b 1
2 a 2
3 a 3
4 b 4
5 c 5
>>> df8=pd.DataFrame({'group_val':[3.5,7]},index=['a','b'])
>>> df8
group_val
a 3.5
b 7.0
>>> pd.merge(df7,df8,left_on='key',right_index=True) #进行索引上的合并
key value group_val
0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0
1.6、数据合并-多对多的合并操作
>>> df5=pd.DataFrame({'key':['b','b','a','c','a','b'],'data1':range(6)})
>>> df5
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 b 5
>>> df6=pd.DataFrame({'key':['a','b','a','b','d'],'data2':range(5)})
>>> df6
key data2
0 a 0
1 b 1
2 a 2
3 b 3
4 d 4
>>> pd.merge(df5,df6,how='outer') #产生的是行的笛卡尔积,由于左边的DataFrame有3个”b”行,右边的有两个,所以最终结果就有6个“b”行
key data1 data2
0 b 0.0 1.0
1 b 0.0 3.0
2 b 1.0 1.0
3 b 1.0 3.0
4 b 5.0 1.0
5 b 5.0 3.0
6 a 2.0 0.0
7 a 2.0 2.0
8 a 4.0 0.0
9 a 4.0 2.0
10 c 3.0 NaN
11 d NaN 4.0
2、设置索引列
>>> df = pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006], "date":pd.date_range('20130102', periods=6), "city":['Beijing ', 'SH', ' guangzhou ', 'Shenzhen', 'shanghai', 'BEIJING '],"age":[23,44,54,32,34,32],"category":['100-A','100-B','110-A','110-C','210-A','130-F'],"price":[1200,np.nan,2133,5433,np.nan,4432]},columns =['id','date','city','category','age','price'])
>>> df
id date city category age price
0 1001 2013-01-02 Beijing 100-A 23 1200.0
1 1002 2013-01-03 SH 100-B 44 NaN
2 1003 2013-01-04 guangzhou 110-A 54 2133.0
3 1004 2013-01-05 Shenzhen 110-C 32 5433.0
4 1005 2013-01-06 shanghai 210-A 34 NaN
5 1006 2013-01-07 BEIJING 130-F 32 4432.0
>>> df.set_index('id')
date city category age price
id
1001 2013-01-02 Beijing 100-A 23 1200.0
1002 2013-01-03 SH 100-B 44 NaN
1003 2013-01-04 guangzhou 110-A 54 2133.0
1004 2013-01-05 Shenzhen 110-C 32 5433.0
1005 2013-01-06 shanghai 210-A 34 NaN
1006 2013-01-07 BEIJING 130-F 32 4432.0
3、按照特定列的值排序:
>>> df.sort_values(by=['age'])
id date city category age price
0 1001 2013-01-02 Beijing 100-A 23 1200.0
3 1004 2013-01-05 Shenzhen 110-C 32 5433.0
5 1006 2013-01-07 BEIJING 130-F 32 4432.0
4 1005 2013-01-06 shanghai 210-A 34 NaN
1 1002 2013-01-03 SH 100-B 44 NaN
2 1003 2013-01-04 guangzhou 110-A 54 2133.0
4、按照索引列排序:
>>> df.set_index('age').sort_index()
id date city category price
age
23 1001 2013-01-02 Beijing 100-A 1200.0
32 1004 2013-01-05 Shenzhen 110-C 5433.0
32 1006 2013-01-07 BEIJING 130-F 4432.0
34 1005 2013-01-06 shanghai 210-A NaN
44 1002 2013-01-03 SH 100-B NaN
54 1003 2013-01-04 guangzhou 110-A 2133.0
5、如果price列的值>3000,group列显示high,否则显示low:
>>> df['group'] = np.where(df['price'] > 3000,'high','low')
>>> df['group']
0 low
1 low
2 low
3 high
4 low
5 high
Name: group, dtype: object
6、对复合多个条件的数据进行分组标记
>>> df.loc[(df['city'] == 'beijing') & (df['price'] >= 4000),'sign']=1
>>> df
id date city category age price group sign
0 1001 2013-01-02 Beijing 100-A 23 1200.0 low NaN
1 1002 2013-01-03 SH 100-B 44 NaN low NaN
2 1003 2013-01-04 guangzhou 110-A 54 2133.0 low NaN
3 1004 2013-01-05 Shenzhen 110-C 32 5433.0 high NaN
4 1005 2013-01-06 shanghai 210-A 34 NaN low NaN
5 1006 2013-01-07 BEIJING 130-F 32 4432.0 high NaN
接下来两个小实验中使用到数据表df_inner,该数据表内容如下:
>>> df = pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006], "date":pd.date_range('20130102', periods=6),"city":['Beijing ', 'SH', ' guangzhou ', 'Shenzhen', 'shanghai', 'BEIJING '],"age":[23,44,54,32,34,32],"category":['100-A','100-B','110-A','110-C','210-A','130-F'],"price":[1200,np.nan,2133,5433,np.nan,4432]},columns =['id','date','city','category','age','price'])
>>> df1 = df1=pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006,1007,1008],"gender":['male','female','male','female','male','female','male','female'],"pay":['Y','N','Y','Y','N','Y','N','Y',],"m-point":[10,12,20,40,40,40,30,20]})
>>> df_inner=pd.merge(df,df1,how='inner')
>>> df_inner
id date city category ... price gender pay m-point
0 1001 2013-01-02 Beijing 100-A ... 1200.0 male Y 10
1 1002 2013-01-03 SH 100-B ... NaN female N 12
2 1003 2013-01-04 guangzhou 110-A ... 2133.0 male Y 20
3 1004 2013-01-05 Shenzhen 110-C ... 5433.0 female Y 40
4 1005 2013-01-06 shanghai 210-A ... NaN male N 40
5 1006 2013-01-07 BEIJING 130-F ... 4432.0 female Y 40
[6 rows x 9 columns]
7、对category字段的值依次进行分列,并创建数据表,索引值为df_inner的索引列,列名称为category和size
>>> split=pd.DataFrame((x.split('-') for x in df_inner['category']),index=df_inner.index,columns=['category','size']))
>>> print(split.values)
[['100' 'A']
['100' 'B']
['110' 'A']
['110' 'C']
['210' 'A']
['130' 'F']]
8、将完成分裂后的数据表和原df_inner数据表进行匹配
>>> df_inner=pd.merge(df_inner,split,right_index=True, left_index=True)
>>> print(df_inner.values)
[[1001 Timestamp('2013-01-02 00:00:00') 'Beijing ' '100-A' 23 1200.0
'male' 'Y' 10 '100' 'A']
[1002 Timestamp('2013-01-03 00:00:00') 'SH' '100-B' 44 nan 'female' 'N'
12 '100' 'B']
[1003 Timestamp('2013-01-04 00:00:00') ' guangzhou ' '110-A' 54 2133.0
'male' 'Y' 20 '110' 'A']
[1004 Timestamp('2013-01-05 00:00:00') 'Shenzhen' '110-C' 32 5433.0
'female' 'Y' 40 '110' 'C']
[1005 Timestamp('2013-01-06 00:00:00') 'shanghai' '210-A' 34 nan 'male'
'N' 40 '210' 'A']
[1006 Timestamp('2013-01-07 00:00:00') 'BEIJING ' '130-F' 32 4432.0
'female' 'Y' 40 '130' 'F']]