目录
以下部分内容仅做示例用,或无实际统计意义
(1)求均值
代码:
correct_mean_station = subway_info["on_station"].mean() #字段on_station出去缺失值外,所有值求平均
print(correct_mean_station)
输出:
44.09090909090909
(2)分类求均值
按卡种字段card_type分类计算字段on_station的均值
代码:
card_classes = [7, 101, 102] #共有三类卡种
on_stations_by_class = {} #用字典形式来记录结果
for this_class in card_classes: #按卡种分类逐一计算字段on_station的均值
card_class_rows = subway_info[subway_info["card_type"] == this_class] #挑选出当前卡种的数据信息
card_class_on_stations = card_class_rows["on_station"] #选出当前卡种数据中的on_station字段信息
station_for_class = card_class_on_stations.mean() #计算当前卡种on_station字段的均值
on_stations_by_class[this_class] = station_for_class #把结果放入字典中,键为当前卡种号,值为计算出的on_station均值
print(on_stations_by_class)
输出:
{101: 38.75, 102: 25.0, 7: 50.833333333333336}
(3)pivot_table()函数:求数据之间的关系
例1: 按卡种字段card_type分类计算字段on_station的均值
- index=”card_type” 表示以字段card_type为基准进行统计
- values=”on_station” 表示所统计的值是什么,即on_station值
- aggfunc=np.mean 表示以np.mean方式进行统计(np为numpy库导入简写),np.mean为aggfunc的默认值,因此不写也可以
代码:
passenger_on_station = subway_info.pivot_table(index="card_type", values="on_station", aggfunc=np.mean)
print(passenger_on_station) #此例实际为(2)的简化
输出:
card_type | on_station |
---|---|
7 | 50.833333 |
101 | 38.750000 |
102 | 25.000000 |
例2:以卡种card_type分类统计on_station和off_station的和
代码:
port_station = subway_info.pivot_table(index="card_type", values=["on_station", "off_station"], aggfunc=np.sum)
print(port_station)
输出:
off_station on_station
card_type
7 318 305.0
101 171 155.0
102 37 25.0
(4)dropna()函数:删去有缺失值的行或列
示例文件中编号为4和7的信息中on_station数据缺失
例1:axis=1 或 axis=”columns” 表示删掉有缺失值的列
代码:
drop_na_columns = subway_info.dropna(axis=1) #删掉有缺失值的列
print(drop_na_columns)
输出:
date off_time card_id ... device_num off_station on_time
0 2016-04-22 08:25:36 990771990514 … 22023703 37 20160422075032
1 2016-04-22 12:32:43 990772079197 … 22011011 10 20160422122620
2 2016-04-22 21:11:31 990772083185 … 22011315 13 20160422210546
3 2016-04-22 18:05:24 990772083509 … 22023701 37 20160422175633
4 2016-04-22 17:03:04 990772083471 … 22040210 2 20160422163715
5 2016-04-22 08:07:11 997169251538 … 22035506 55 20160422071314
6 2016-04-22 10:46:41 997169251538 … 22011216 12 20160422094757
7 2016-04-22 18:04:25 997169148602 … 22023802 38 20160422172720
8 2016-04-22 12:38:31 997169148602 … 22010814 8 20160422115952
9 2016-04-22 09:36:15 997169237105 … 22022121 21 20160422091548
10 2016-04-22 08:11:36 997169396171 … 22079406 94 20160422075540
11 2016-04-22 19:28:03 997169396171 … 22079006 90 20160422190423
12 2016-04-22 08:34:06 990172167480 … 29070903 109 20160422080705
[13 rows x 7 columns]
#可以看到输出结果中字段on_station列被删掉
例2:axis=0 表示删掉指定字段有缺失值所在的行
代码:
new_subway_info = subway_info.dropna(axis=0, subset=["on_station", "off_station"])
#删掉字段on_station和off_station有缺失值的行
print(new_subway_info)
输出:
date off_time card_id ... off_station on_time on_station
0 2016-04-22 08:25:36 990771990514 … 37 20160422075032 25.0
1 2016-04-22 12:32:43 990772079197 … 10 20160422122620 8.0
2 2016-04-22 21:11:31 990772083185 … 13 20160422210546 12.0
3 2016-04-22 18:05:24 990772083509 … 37 20160422175633 35.0
5 2016-04-22 08:07:11 997169251538 … 55 20160422071314 11.0
6 2016-04-22 10:46:41 997169251538 … 12 20160422094757 55.0
8 2016-04-22 12:38:31 997169148602 … 8 20160422115952 37.0
9 2016-04-22 09:36:15 997169237105 … 21 20160422091548 17.0
10 2016-04-22 08:11:36 997169396171 … 94 20160422075540 90.0
11 2016-04-22 19:28:03 997169396171 … 90 20160422190423 95.0
12 2016-04-22 08:34:06 990172167480 … 109 20160422080705 100.0
[11 rows x 8 columns]
#可以看到输出结果中编号为4和7的数据信息被删掉
(5)定位查询
代码:
row_index_10_type = subway_info.loc[10, "card_type"] #查找第10号信息的card_type值
row_index_8_on_station = subway_info.loc[8, "on_station"] #查找第8号信息的on_station值
print(row_index_10_type)
print(row_index_8_on_station)
输出:
7
37.0
(6)reset_index()函数:排序后重新编号
代码:
new_subway_info = subway_info.sort_values("on_station", ascending=False) #以on_station为基准按降序排序
print(new_subway_info)
subway_reindexed = new_subway_info.reset_index(drop=True) #重置索引编号,drop=True 表示删除掉原先的编号
print('------------------')
print(subway_reindexed)
输出:
date off_time card_id ... off_station on_time on_station
12 2016-04-22 08:34:06 990172167480 … 109 20160422080705 100.0
11 2016-04-22 19:28:03 997169396171 … 90 20160422190423 95.0
10 2016-04-22 08:11:36 997169396171 … 94 20160422075540 90.0
6 2016-04-22 10:46:41 997169251538 … 12 20160422094757 55.0
8 2016-04-22 12:38:31 997169148602 … 8 20160422115952 37.0
3 2016-04-22 18:05:24 990772083509 … 37 20160422175633 35.0
0 2016-04-22 08:25:36 990771990514 … 37 20160422075032 25.0
9 2016-04-22 09:36:15 997169237105 … 21 20160422091548 17.0
2 2016-04-22 21:11:31 990772083185 … 13 20160422210546 12.0
5 2016-04-22 08:07:11 997169251538 … 55 20160422071314 11.0
1 2016-04-22 12:32:43 990772079197 … 10 20160422122620 8.0
4 2016-04-22 17:03:04 990772083471 … 2 20160422163715 NaN
7 2016-04-22 18:04:25 997169148602 … 38 20160422172720 NaN
[13 rows x 8 columns] #可以看到按on_station降序排序后,输出结果的索引编号是乱序的
------------------
date off_time card_id ... off_station on_time on_station
0 2016-04-22 08:34:06 990172167480 … 109 20160422080705 100.0
1 2016-04-22 19:28:03 997169396171 … 90 20160422190423 95.0
2 2016-04-22 08:11:36 997169396171 … 94 20160422075540 90.0
3 2016-04-22 10:46:41 997169251538 … 12 20160422094757 55.0
4 2016-04-22 12:38:31 997169148602 … 8 20160422115952 37.0
5 2016-04-22 18:05:24 990772083509 … 37 20160422175633 35.0
6 2016-04-22 08:25:36 990771990514 … 37 20160422075032 25.0
7 2016-04-22 09:36:15 997169237105 … 21 20160422091548 17.0
8 2016-04-22 21:11:31 990772083185 … 13 20160422210546 12.0
9 2016-04-22 08:07:11 997169251538 … 55 20160422071314 11.0
10 2016-04-22 12:32:43 990772079197 … 10 20160422122620 8.0
11 2016-04-22 17:03:04 990772083471 … 2 20160422163715 NaN
12 2016-04-22 18:04:25 997169148602 … 38 20160422172720 NaN
[13 rows x 8 columns] #可以看到输出结果的索引编号被按序重置
(7)apply()函数:自定义函数
例1:输出指定行数据信息
代码:
def sixth_row(column): #定义一个输出第6条数据信息的函数
sixth_item = column.loc[5]
return sixth_item
sixth_row = subway_info.apply(sixth_row) #用apply()函数调用sixth_row()函数
print(sixth_row)
输出:
date 2016-04-22
off_time 08:07:11
card_id 997169251538
card_type 7
device_num 22035506
off_station 55
on_time 20160422071314
on_station 11
dtype: object
例2:统计各字段缺失值的个数
代码:
def not_null_count(column):
column_null = pd.isnull(column)
null = column[column_null]
return len(null)
column_null_count = subway_info.apply(not_null_count)
print(column_null_count)
输出:
date 0
off_time 0
card_id 0
card_type 0
device_num 0
off_station 0
on_time 0
on_station 2
dtype: int64
例3:变换字段的值
将卡种007, 101, 102分别变换为First Card, Second Card, Third Card,若卡种值缺失,则变换为Unknown
代码:
def which_class(row):
cardtype = row["card_type"]
if pd.isnull(cardtype):
return "Unknown"
elif cardtype == 7:
return "First Card"
elif cardtype == 101:
return "Second Card"
elif cardtype == 102:
return "Third Card"
classes = subway_info.apply(which_class, axis=1)
print(classes)
输出:
0 Third Card
1 Second Card
2 Second Card
3 Second Card
4 Second Card
5 First Card
6 First Card
7 First Card
8 First Card
9 First Card
10 First Card
11 First Card
12 Second Card
dtype: object