pandas库(3):常用函数


以下部分内容仅做示例用,或无实际统计意义

(1)求均值

代码:

correct_mean_station = subway_info["on_station"].mean()   #字段on_station出去缺失值外,所有值求平均
print(correct_mean_station)

输出:

44.09090909090909

(2)分类求均值

按卡种字段card_type分类计算字段on_station的均值
代码:

card_classes = [7, 101, 102]                 #共有三类卡种
on_stations_by_class = {}                    #用字典形式来记录结果
for this_class in card_classes:              #按卡种分类逐一计算字段on_station的均值
    card_class_rows = subway_info[subway_info["card_type"] == this_class]   #挑选出当前卡种的数据信息
    card_class_on_stations = card_class_rows["on_station"]                  #选出当前卡种数据中的on_station字段信息
    station_for_class = card_class_on_stations.mean()                       #计算当前卡种on_station字段的均值
    on_stations_by_class[this_class] = station_for_class                    #把结果放入字典中,键为当前卡种号,值为计算出的on_station均值
print(on_stations_by_class)

输出:

{101: 38.75, 102: 25.0, 7: 50.833333333333336}

(3)pivot_table()函数:求数据之间的关系

例1: 按卡种字段card_type分类计算字段on_station的均值

  • index=”card_type” 表示以字段card_type为基准进行统计
  • values=”on_station” 表示所统计的值是什么,即on_station值
  • aggfunc=np.mean 表示以np.mean方式进行统计(np为numpy库导入简写),np.mean为aggfunc的默认值,因此不写也可以

代码:

passenger_on_station = subway_info.pivot_table(index="card_type", values="on_station", aggfunc=np.mean)
print(passenger_on_station)       #此例实际为(2)的简化

输出:

card_typeon_station
750.833333
10138.750000
10225.000000

例2:以卡种card_type分类统计on_station和off_station的和
代码:

port_station = subway_info.pivot_table(index="card_type", values=["on_station", "off_station"], aggfunc=np.sum)
print(port_station)

输出:

            off_station  on_station
 card_type                         
 7                  318       305.0
 101                171       155.0
 102                 37        25.0

(4)dropna()函数:删去有缺失值的行或列

示例文件中编号为4和7的信息中on_station数据缺失

例1:axis=1 或 axis=”columns” 表示删掉有缺失值的列
代码:
drop_na_columns = subway_info.dropna(axis=1) #删掉有缺失值的列
print(drop_na_columns)

输出:

     date        off_time  card_id      ...   device_num    off_station    on_time
 0   2016-04-22  08:25:36  99077199051422023703           37  20160422075032
 1   2016-04-22  12:32:43  99077207919722011011           10  20160422122620
 2   2016-04-22  21:11:31  99077208318522011315           13  20160422210546
 3   2016-04-22  18:05:24  99077208350922023701           37  20160422175633
 4   2016-04-22  17:03:04  99077208347122040210            2  20160422163715
 5   2016-04-22  08:07:11  99716925153822035506           55  20160422071314
 6   2016-04-22  10:46:41  99716925153822011216           12  20160422094757
 7   2016-04-22  18:04:25  99716914860222023802           38  20160422172720
 8   2016-04-22  12:38:31  99716914860222010814            8  20160422115952
 9   2016-04-22  09:36:15  99716923710522022121           21  20160422091548
 10  2016-04-22  08:11:36  99716939617122079406           94  20160422075540
 11  2016-04-22  19:28:03  99716939617122079006           90  20160422190423
 12  2016-04-22  08:34:06  99017216748029070903          109  20160422080705
 [13 rows x 7 columns]
#可以看到输出结果中字段on_station列被删掉

例2:axis=0 表示删掉指定字段有缺失值所在的行
代码:

new_subway_info = subway_info.dropna(axis=0, subset=["on_station", "off_station"])
               #删掉字段on_station和off_station有缺失值的行
print(new_subway_info)

输出:

       date  off_time       card_id  ...  off_station         on_time  on_station
 0   2016-04-22  08:25:36  99077199051437  20160422075032        25.0
 1   2016-04-22  12:32:43  99077207919710  20160422122620         8.0
 2   2016-04-22  21:11:31  99077208318513  20160422210546        12.0
 3   2016-04-22  18:05:24  99077208350937  20160422175633        35.0
 5   2016-04-22  08:07:11  99716925153855  20160422071314        11.0
 6   2016-04-22  10:46:41  99716925153812  20160422094757        55.0
 8   2016-04-22  12:38:31  9971691486028  20160422115952        37.0
 9   2016-04-22  09:36:15  99716923710521  20160422091548        17.0
 10  2016-04-22  08:11:36  99716939617194  20160422075540        90.0
 11  2016-04-22  19:28:03  99716939617190  20160422190423        95.0
 12  2016-04-22  08:34:06  990172167480109  20160422080705       100.0
 [11 rows x 8 columns]
#可以看到输出结果中编号为4和7的数据信息被删掉

(5)定位查询

代码:

row_index_10_type = subway_info.loc[10, "card_type"]         #查找第10号信息的card_type值
row_index_8_on_station = subway_info.loc[8, "on_station"]    #查找第8号信息的on_station值
print(row_index_10_type)
print(row_index_8_on_station)

输出:

7
37.0

(6)reset_index()函数:排序后重新编号

代码:

new_subway_info = subway_info.sort_values("on_station", ascending=False)   #以on_station为基准按降序排序
print(new_subway_info)
subway_reindexed = new_subway_info.reset_index(drop=True)                  #重置索引编号,drop=True 表示删除掉原先的编号
print('------------------')
print(subway_reindexed)

输出:

       date  off_time       card_id  ...  off_station         on_time  on_station
 12  2016-04-22  08:34:06  990172167480  …          109  20160422080705       100.0
 11  2016-04-22  19:28:03  997169396171  …           90  20160422190423        95.0
 10  2016-04-22  08:11:36  997169396171  …           94  20160422075540        90.0
 6   2016-04-22  10:46:41  997169251538  …           12  20160422094757        55.0
 8   2016-04-22  12:38:31  997169148602  …            8  20160422115952        37.0
 3   2016-04-22  18:05:24  990772083509  …           37  20160422175633        35.0
 0   2016-04-22  08:25:36  990771990514  …           37  20160422075032        25.0
 9   2016-04-22  09:36:15  997169237105  …           21  20160422091548        17.0
 2   2016-04-22  21:11:31  990772083185  …           13  20160422210546        12.0
 5   2016-04-22  08:07:11  997169251538  …           55  20160422071314        11.0
 1   2016-04-22  12:32:43  990772079197  …           10  20160422122620         8.0
 4   2016-04-22  17:03:04  990772083471  …            2  20160422163715         NaN
 7   2016-04-22  18:04:25  997169148602  …           38  20160422172720         NaN
 [13 rows x 8 columns]           #可以看到按on_station降序排序后,输出结果的索引编号是乱序的
 ------------------ 
       date  off_time       card_id  ...  off_station         on_time  on_station
 0   2016-04-22  08:34:06  990172167480  …          109  20160422080705       100.0
 1   2016-04-22  19:28:03  997169396171  …           90  20160422190423        95.0
 2   2016-04-22  08:11:36  997169396171  …           94  20160422075540        90.0
 3   2016-04-22  10:46:41  997169251538  …           12  20160422094757        55.0
 4   2016-04-22  12:38:31  997169148602  …            8  20160422115952        37.0
 5   2016-04-22  18:05:24  990772083509  …           37  20160422175633        35.0
 6   2016-04-22  08:25:36  990771990514  …           37  20160422075032        25.0
 7   2016-04-22  09:36:15  997169237105  …           21  20160422091548        17.0
 8   2016-04-22  21:11:31  990772083185  …           13  20160422210546        12.0
 9   2016-04-22  08:07:11  997169251538  …           55  20160422071314        11.0
 10  2016-04-22  12:32:43  990772079197  …           10  20160422122620         8.0
 11  2016-04-22  17:03:04  990772083471  …            2  20160422163715         NaN
 12  2016-04-22  18:04:25  997169148602  …           38  20160422172720         NaN
 [13 rows x 8 columns]          #可以看到输出结果的索引编号被按序重置

(7)apply()函数:自定义函数

例1:输出指定行数据信息
代码:

def sixth_row(column):                         #定义一个输出第6条数据信息的函数
    sixth_item = column.loc[5]
    return sixth_item

sixth_row = subway_info.apply(sixth_row)       #用apply()函数调用sixth_row()函数
print(sixth_row)

输出:

 date               2016-04-22
 off_time             08:07:11
 card_id          997169251538
 card_type                   7
 device_num           22035506
 off_station                55
 on_time        20160422071314
 on_station                 11
 dtype: object

例2:统计各字段缺失值的个数
代码:

def not_null_count(column):
    column_null = pd.isnull(column)
    null = column[column_null]
    return len(null)

column_null_count = subway_info.apply(not_null_count)
print(column_null_count)

输出:

 date           0
 off_time       0
 card_id        0
 card_type      0
 device_num     0
 off_station    0
 on_time        0
 on_station     2
 dtype: int64

例3:变换字段的值
将卡种007, 101, 102分别变换为First Card, Second Card, Third Card,若卡种值缺失,则变换为Unknown
代码:

def which_class(row):
    cardtype = row["card_type"]
    if pd.isnull(cardtype):
        return "Unknown"
    elif cardtype == 7:
        return "First Card"
    elif cardtype == 101:
        return "Second Card"
    elif cardtype == 102:
        return "Third Card"

classes = subway_info.apply(which_class, axis=1)
print(classes)

输出:

 0      Third Card
 1     Second Card
 2     Second Card
 3     Second Card
 4     Second Card
 5      First Card
 6      First Card
 7      First Card
 8      First Card
 9      First Card
 10     First Card
 11     First Card
 12    Second Card
 dtype: object
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值