pandas库（3）：常用函数

最新推荐文章于 2021-03-26 15:23:32 发布

沛然ypr

最新推荐文章于 2021-03-26 15:23:32 发布

阅读量320

点赞数

分类专栏：数据分析

本文链接：https://blog.csdn.net/weixin_43824414/article/details/105372383

版权

数据分析专栏收录该内容

7 篇文章 1 订阅

订阅专栏

（1）求均值

代码：

correct_mean_station = subway_info["on_station"].mean()   #字段on_station出去缺失值外，所有值求平均
print(correct_mean_station)

输出：

44.09090909090909

（2）分类求均值

按卡种字段card_type分类计算字段on_station的均值
代码：

card_classes = [7, 101, 102]                 #共有三类卡种
on_stations_by_class = {}                    #用字典形式来记录结果
for this_class in card_classes:              #按卡种分类逐一计算字段on_station的均值
    card_class_rows = subway_info[subway_info["card_type"] == this_class]   #挑选出当前卡种的数据信息
    card_class_on_stations = card_class_rows["on_station"]                  #选出当前卡种数据中的on_station字段信息
    station_for_class = card_class_on_stations.mean()                       #计算当前卡种on_station字段的均值
    on_stations_by_class[this_class] = station_for_class                    #把结果放入字典中，键为当前卡种号，值为计算出的on_station均值
print(on_stations_by_class)

输出：

{101: 38.75, 102: 25.0, 7: 50.833333333333336}

（3）pivot_table()函数：求数据之间的关系

例1：按卡种字段card_type分类计算字段on_station的均值

index=”card_type” 表示以字段card_type为基准进行统计
values=”on_station” 表示所统计的值是什么，即on_station值
aggfunc=np.mean 表示以np.mean方式进行统计（np为numpy库导入简写），np.mean为aggfunc的默认值，因此不写也可以

代码：

passenger_on_station = subway_info.pivot_table(index="card_type", values="on_station", aggfunc=np.mean)
print(passenger_on_station)       #此例实际为（2）的简化

输出：

card_type	on_station
7	50.833333
101	38.750000
102	25.000000

例2：以卡种card_type分类统计on_station和off_station的和
代码：

port_station = subway_info.pivot_table(index="card_type", values=["on_station", "off_station"], aggfunc=np.sum)
print(port_station)

输出：

            off_station  on_station
 card_type                         
 7                  318       305.0
 101                171       155.0
 102                 37        25.0

（4）dropna()函数：删去有缺失值的行或列

示例文件中编号为4和7的信息中on_station数据缺失

例1：axis=1 或 axis=”columns” 表示删掉有缺失值的列
代码：
drop_na_columns = subway_info.dropna(axis=1) #删掉有缺失值的列
print(drop_na_columns)

输出：

     date        off_time  card_id      ...   device_num    off_station    on_time
 0   2016-04-22  08:25:36  990771990514  …    22023703           37  20160422075032
 1   2016-04-22  12:32:43  990772079197  …    22011011           10  20160422122620
 2   2016-04-22  21:11:31  990772083185  …    22011315           13  20160422210546
 3   2016-04-22  18:05:24  990772083509  …    22023701           37  20160422175633
 4   2016-04-22  17:03:04  990772083471  …    22040210            2  20160422163715
 5   2016-04-22  08:07:11  997169251538  …    22035506           55  20160422071314
 6   2016-04-22  10:46:41  997169251538  …    22011216           12  20160422094757
 7   2016-04-22  18:04:25  997169148602  …    22023802           38  20160422172720
 8   2016-04-22  12:38:31  997169148602  …    22010814            8  20160422115952
 9   2016-04-22  09:36:15  997169237105  …    22022121           21  20160422091548
 10  2016-04-22  08:11:36  997169396171  …    22079406           94  20160422075540
 11  2016-04-22  19:28:03  997169396171  …    22079006           90  20160422190423
 12  2016-04-22  08:34:06  990172167480  …    29070903          109  20160422080705
 [13 rows x 7 columns]
#可以看到输出结果中字段on_station列被删掉

例2：axis=0 表示删掉指定字段有缺失值所在的行
代码：

new_subway_info = subway_info.dropna(axis=0, subset=["on_station", "off_station"])
               #删掉字段on_station和off_station有缺失值的行
print(new_subway_info)

输出：

       date  off_time       card_id  ...  off_station         on_time  on_station
 0   2016-04-22  08:25:36  990771990514  …           37  20160422075032        25.0
 1   2016-04-22  12:32:43  990772079197  …           10  20160422122620         8.0
 2   2016-04-22  21:11:31  990772083185  …           13  20160422210546        12.0
 3   2016-04-22  18:05:24  990772083509  …           37  20160422175633        35.0
 5   2016-04-22  08:07:11  997169251538  …           55  20160422071314        11.0
 6   2016-04-22  10:46:41  997169251538  …           12  20160422094757        55.0
 8   2016-04-22  12:38:31  997169148602  …            8  20160422115952        37.0
 9   2016-04-22  09:36:15  997169237105  …           21  20160422091548        17.0
 10  2016-04-22  08:11:36  997169396171  …           94  20160422075540        90.0
 11  2016-04-22  19:28:03  997169396171  …           90  20160422190423        95.0
 12  2016-04-22  08:34:06  990172167480  …          109  20160422080705       100.0
 [11 rows x 8 columns]
#可以看到输出结果中编号为4和7的数据信息被删掉

（5）定位查询

代码：

row_index_10_type = subway_info.loc[10, "card_type"]         #查找第10号信息的card_type值
row_index_8_on_station = subway_info.loc[8, "on_station"]    #查找第8号信息的on_station值
print(row_index_10_type)
print(row_index_8_on_station)

输出：

7
37.0

（6）reset_index()函数：排序后重新编号

代码：

new_subway_info = subway_info.sort_values("on_station", ascending=False)   #以on_station为基准按降序排序
print(new_subway_info)
subway_reindexed = new_subway_info.reset_index(drop=True)                  #重置索引编号，drop=True 表示删除掉原先的编号
print('------------------')
print(subway_reindexed)

输出：

       date  off_time       card_id  ...  off_station         on_time  on_station
 12  2016-04-22  08:34:06  990172167480  …          109  20160422080705       100.0
 11  2016-04-22  19:28:03  997169396171  …           90  20160422190423        95.0
 10  2016-04-22  08:11:36  997169396171  …           94  20160422075540        90.0
 6   2016-04-22  10:46:41  997169251538  …           12  20160422094757        55.0
 8   2016-04-22  12:38:31  997169148602  …            8  20160422115952        37.0
 3   2016-04-22  18:05:24  990772083509  …           37  20160422175633        35.0
 0   2016-04-22  08:25:36  990771990514  …           37  20160422075032        25.0
 9   2016-04-22  09:36:15  997169237105  …           21  20160422091548        17.0
 2   2016-04-22  21:11:31  990772083185  …           13  20160422210546        12.0
 5   2016-04-22  08:07:11  997169251538  …           55  20160422071314        11.0
 1   2016-04-22  12:32:43  990772079197  …           10  20160422122620         8.0
 4   2016-04-22  17:03:04  990772083471  …            2  20160422163715         NaN
 7   2016-04-22  18:04:25  997169148602  …           38  20160422172720         NaN
 [13 rows x 8 columns]           #可以看到按on_station降序排序后，输出结果的索引编号是乱序的
 ------------------ 
       date  off_time       card_id  ...  off_station         on_time  on_station
 0   2016-04-22  08:34:06  990172167480  …          109  20160422080705       100.0
 1   2016-04-22  19:28:03  997169396171  …           90  20160422190423        95.0
 2   2016-04-22  08:11:36  997169396171  …           94  20160422075540        90.0
 3   2016-04-22  10:46:41  997169251538  …           12  20160422094757        55.0
 4   2016-04-22  12:38:31  997169148602  …            8  20160422115952        37.0
 5   2016-04-22  18:05:24  990772083509  …           37  20160422175633        35.0
 6   2016-04-22  08:25:36  990771990514  …           37  20160422075032        25.0
 7   2016-04-22  09:36:15  997169237105  …           21  20160422091548        17.0
 8   2016-04-22  21:11:31  990772083185  …           13  20160422210546        12.0
 9   2016-04-22  08:07:11  997169251538  …           55  20160422071314        11.0
 10  2016-04-22  12:32:43  990772079197  …           10  20160422122620         8.0
 11  2016-04-22  17:03:04  990772083471  …            2  20160422163715         NaN
 12  2016-04-22  18:04:25  997169148602  …           38  20160422172720         NaN
 [13 rows x 8 columns]          #可以看到输出结果的索引编号被按序重置

（7）apply()函数：自定义函数

例1：输出指定行数据信息
代码：

def sixth_row(column):                         #定义一个输出第6条数据信息的函数
    sixth_item = column.loc[5]
    return sixth_item

sixth_row = subway_info.apply(sixth_row)       #用apply()函数调用sixth_row()函数
print(sixth_row)

输出：

 date               2016-04-22
 off_time             08:07:11
 card_id          997169251538
 card_type                   7
 device_num           22035506
 off_station                55
 on_time        20160422071314
 on_station                 11
 dtype: object

例2：统计各字段缺失值的个数
代码：

def not_null_count(column):
    column_null = pd.isnull(column)
    null = column[column_null]
    return len(null)

column_null_count = subway_info.apply(not_null_count)
print(column_null_count)

输出：

 date           0
 off_time       0
 card_id        0
 card_type      0
 device_num     0
 off_station    0
 on_time        0
 on_station     2
 dtype: int64

例3：变换字段的值
将卡种007, 101, 102分别变换为First Card, Second Card, Third Card，若卡种值缺失，则变换为Unknown
代码：

def which_class(row):
    cardtype = row["card_type"]
    if pd.isnull(cardtype):
        return "Unknown"
    elif cardtype == 7:
        return "First Card"
    elif cardtype == 101:
        return "Second Card"
    elif cardtype == 102:
        return "Third Card"

classes = subway_info.apply(which_class, axis=1)
print(classes)

输出：

 0      Third Card
 1     Second Card
 2     Second Card
 3     Second Card
 4     Second Card
 5      First Card
 6      First Card
 7      First Card
 8      First Card
 9      First Card
 10     First Card
 11     First Card
 12    Second Card
 dtype: object

沛然ypr

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
pandas库（3）：常用函数

以下部分内容仅做示例用，或无实际统计意义（1）求均值代码：correct_mean_station = subway_info["on_station"].mean() #字段on_station出去缺失值外，所有值求平均print(correct_mean_station)输出：44.09090909090909（2）分类求均值按卡种字段card_type分类计算字段o...
复制链接

扫一扫