目录
示例文件:
南京一日地铁刷卡数据csv文件
其中含有字段:date(日期),off_time(出站时间),card_id(公交卡号),card_type(卡种),device_num(设备编号),off_station(出站站点编号),on_time(进站时间),on_station(进站站点编号)
文件部分内容如下:
date | off_time | card_id | card_type | device_num | off_station | on_time | on_station |
---|---|---|---|---|---|---|---|
2016-04-22 | 08:25:36 | 0000990771990514 | 102 | 22023703 | 0000037 | 201604 | 22075032 |
2016-04-22 | 12:32:43 | 0000990772079197 | 101 | 22011011 | 0000010 | 201604 | 22122620 |
2016-04-22 | 21:11:31 | 0000990772083185 | 101 | 22011315 | 0000013 | 201604 | 22210546 |
2016-04-22 | 18:05:24 | 0000990772083509 | 101 | 22023701 | 0000037 | 201604 | 22175633 |
2016-04-22 | 17:03:04 | 0000990772083471 | 101 | 22040210 | 0000002 | 201604 | 22163715 |
2016-04-22 | 08:07:11 | 0000997169251538 | 007 | 22035506 | 0000055 | 201604 | 22071314 |
2016-04-22 | 10:46:41 | 0000997169251538 | 007 | 22011216 | 0000012 | 201604 | 22094757 |
2016-04-22 | 18:04:25 | 0000997169148602 | 007 | 22023802 | 0000038 | 201604 | 22172720 |
2016-04-22 | 12:38:31 | 0000997169148602 | 007 | 22010814 | 0000008 | 201604 | 22115952 |
2016-04-22 | 09:36:15 | 0000997169237105 | 007 | 22022121 | 0000021 | 201604 | 22091548 |
2016-04-22 | 08:11:36 | 0000997169396171 | 007 | 22079406 | 0000094 | 201604 | 22075540 |
2016-04-22 | 19:28:03 | 0000997169396171 | 007 | 22079006 | 0000090 | 201604 | 22190423 |
2016-04-22 | 08:34:06 | 0000990172167480 | 101 | 29070903 | 0000109 | 201604 | 22080705 |
示例代码:
(1)读取文件
import pandas as pd
subway_info = pd.read_csv("D:MET20160422.csv")
(2)输出数据类型和数据量
代码:
print(subway_info.dtypes) #输出数据类型
输出:
date object
off_time object
card_id int64
card_type int64
device_num int64
off_station int64
on_time int64
on_station int64
dtype: object
代码:
print(subway_info.shape) #输出数据量
输出:
(1230010, 8) #表示有1230010条信息,8个字段
(3)输出所有字段信息
代码:
print(subway_info.head(3)) #输出前三条信息
输出:
date off_time card_id ... off_station on_time on_station
0 2016-04-22 08:25:36 990771990514 … 37 20160422075032 25
1 2016-04-22 12:32:43 990772079197 … 10 20160422122620 8
2 2016-04-22 21:11:31 990772083185 … 13 20160422210546 12
代码:
print(subway_info.tail(3)) #输出末尾三条信息
输出:
date off_time ... on_time on_station
1230007 2016-04-22 10:44:09 … 20160422101028 108
1230008 2016-04-22 12:35:02 … 20160422121822 43
1230009 2016-04-22 20:05:27 … 20160422192914 10
代码:
print(subway_info.loc[2]) #输出第三条信息(编号为2)
输出:
date 2016-04-22
off_time 21:11:31
card_id 990772083185
card_type 101
device_num 22011315
off_station 13
on_time 20160422210546
on_station 12
Name: 2, dtype: object
(4)输出指定字段信息
代码:
ndb_col = subway_info["on_station"]
print(ndb_col) #输出字段on_station的全部信息
输出:
0 25
1 8
2 12
3 35
4 44
5 11
6 55
7 9
8 37
9 17
10 90
11 95
12 100
13 109
14 52
15 12
16 42
17 72
18 100
19 22
20 26
21 110
22 97
23 10
24 24
25 17
26 12
27 53
28 10
29 41
…
1229980 98
1229981 24
1229982 101
1229983 101
1229984 99
1229985 113
1229986 103
1229987 92
1229988 104
1229989 113
1229990 16
1229991 97
1229992 44
1229993 26
1229994 99
1229995 102
1229996 113
1229997 42
1229998 44
1229999 106
1230000 103
1230001 6
1230002 21
1230003 16
1230004 75
1230005 9
1230006 31
1230007 108
1230008 43
1230009 10
Name: on_station, Length: 1230010, dtype: int64
代码:
columns = ["on_station", "off_station"]
ndb_col = subway_info[columns]
print(ndb_col) #输出字段on_station和off_station的全部信息
输出:
on_station off_station
0 25 37
1 8 10
2 12 13
3 35 37
4 44 2
5 11 55
6 55 12
7 9 38
8 37 8
9 17 21
10 90 94
11 95 90
12 100 109
13 109 102
14 52 55
15 12 70
16 42 12
17 72 100
18 100 75
19 22 9
20 26 110
21 110 97
22 97 26
23 10 12
24 24 17
25 17 14
26 12 53
27 53 12
28 10 41
29 41 10
… … …
1229980 98 111
1229981 24 111
1229982 101 111
1229983 101 107
1229984 99 107
1229985 113 108
1229986 103 108
1229987 92 108
1229988 104 108
1229989 113 108
1229990 16 108
1229991 97 109
1229992 44 109
1229993 26 109
1229994 99 109
1229995 102 109
1229996 113 109
1229997 42 110
1229998 44 110
1229999 106 111
1230000 103 110
1230001 6 111
1230002 21 107
1230003 16 108
1230004 75 108
1230005 9 108
1230006 31 108
1230007 108 108
1230008 43 109
1230009 10 109
(5)输出字段名
代码:
col_names = subway_info.columns.tolist()
print(col_names) #输出全部字段名
输出:
['date', 'off_time', 'card_id', 'card_type', 'device_num', 'off_station', 'on_time', 'on_station']
(6)筛选特定字段输出信息
代码:
col_names = subway_info.columns.tolist()
station_columns = []
for i in col_names:
if i.endswith("station"): #筛选出以station结尾的字段
station_columns.append(i)
station_df = subway_info[station_columns]
print(station_df.head(3)) #输出以station结尾的字段的前三条信息
输出:
off_station on_station
0 37 25
1 10 8
2 13 12