{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在这个教程中,你将会学到如何使用python的pandas包对出租车GPS数据进行数据清洗,识别出行OD\n",
"\n",
"
提供的基础数据是:
数据:
\n",
" 1.出租车原始GPS数据(在data-sample文件夹下,原始数据集的抽样500辆车的数据)
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[pandas包的简介](https://baike.baidu.com/item/pandas/17209606?fr=aladdin)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 读取数据"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"首先,读取出租车数据。"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"ExecuteTime": {
"end_time": "2020-01-18T04:51:53.552930Z",
"start_time": "2020-01-18T04:51:52.397018Z"
}
},
"outputs": [],
"source": [
"import pandas as pd\n",
"#读取数据\n",
"data = pd.read_csv(r'data-sample/TaxiData-Sample',header = None)\n",
"#给数据命名列\n",
"data.columns = ['VehicleNum', 'Stime', 'Lng', 'Lat', 'OpenStatus', 'Speed']"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"ExecuteTime": {
"end_time": "2020-01-18T04:51:58.299239Z",
"start_time": "2020-01-18T04:51:58.271312Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
"
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"\n",
"
" \n",
"
\n","
\n","
VehicleNum\n","
Stime\n","
Lng\n","
Lat\n","
OpenStatus\n","
Speed\n","
\n","
\n","
\n","
\n","
0\n","
22271\n","
22:54:04\n","
114.167000\n","
22.718399\n","
0\n","
0\n","
\n","
\n","
1\n","
22271\n","
18:26:26\n","
114.190598\n","
22.647800\n","
0\n","
4\n","
\n","
\n","
2\n","
22271\n","
18:35:18\n","
114.201401\n","
22.649700\n","
0\n","
0\n","
\n","
\n","
3\n","
22271\n","
16:02:46\n","
114.233498\n","
22.725901\n","
0\n","
24\n","
\n","
\n","
4\n","
22271\n","
21:41:17\n","
114.233597\n","
22.720900\n","
0\n","
19\n","
\n","
\n","
\n","
],
"text/plain": [
" VehicleNum Stime Lng Lat OpenStatus Speed\n",
"0 22271 22:54:04 114.167000 22.718399 0 0\n",
"1 22271 18:26:26 114.190598 22.647800 0 4\n",
"2 22271 18:35:18 114.201401 22.649700 0 0\n",
"3 22271 16:02:46 114.233498 22.725901 0 24\n",
"4 22271 21:41:17 114.233597 22.720900 0 19"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#显示数据的前5行\n",
"data.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"数据的格式:\n",
"\n",
">VehicleNum —— 车牌 \n",
"Stime —— 时间 \n",
"Lng —— 经度 \n",
"Lat —— 纬度 \n",
"OpenStatus —— 是否有乘客(0没乘客,1有乘客) \n",
"Speed —— 速度 "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 基础的数据操作"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## DataFrame和Series"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"DataFrame和Series\n",
"\n",
" > 当我们读一个数据的时候,我们读进来的就是DataFrame格式的数据表,而一个DataFrame中的每一列,则为一个Series \n",
" 也就是说,DataFrame由多个Series组成\n"
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {
"ExecuteTime": {
"end_time": "2020-01-18T04:52:25.713432Z",
"start_time": "2020-01-18T04:52:25.708450Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.frame.DataFrame"
]
},
"execution_count": 87,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"如果我们想取DataFrame的某一列,想得到的是Series,那么直接用以下代码\n",
"\n",
" > data[列名]"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {
"ExecuteTime": {
"end_time": "2020-01-18T04:52:32.097575Z",
"start_time": "2020-01-18T04:52:32.090592Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.series.Series"
]
},
"execution_count": 88,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(data['Lng'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2019-09-06T09:22:43.642625Z",
"start_time": "2019-09-06T09:22:43.638487Z"
}
},
"source": [
"如果我们想取DataFrame的某一列或者某几列,想得到的是DataFrame,那么直接用以下代码\n",
"\n",
"> data2[[列名,列名]]"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {
"ExecuteTime": {
"end_time": "2020-01-18T04:52:33.013124Z",
"start_time": "2020-01-18T04:52:32.990186Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.frame.DataFrame"
]
},
"execution_count": 89,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(data[['Lng']])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 数据的筛选"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"数据的筛选:\n",
"\n",
" 在筛选数据的时候,我们一般用data[条件]的格式\n",
" 其中的条件,是对data每一行数据的true和false布尔变量的Series"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" 例如,我们想得到车牌照为22271的所有数据\n",
" 首先我们要获得一个布尔变量的Series,这个Series对应的是data的每一行,如果车牌照为\"粤B4H2K8\"则为true,不是则为false\n",
" 这样子的Series很容易获得,只需要\n",
" data['VehicleNum']==22271"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {
"ExecuteTime": {
"end_time": "2020-01-18T04:52:44.078571Z",
"start_time": "2020-01-18T04:52:44.049646Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0 True\n",
"1 True\n",
"2 True\n",
"3 True\n",
"4 True\n",
"Name: VehicleNum, dtype: bool"
]
},
"execution_count": 90,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(data['VehicleNum']==22271).head(5)"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {
"ExecuteTime": {
"end_time": "2020-01-18T04:52:51.723416Z",
"start_time": "2020-01-18T04:52:51.688510Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
"
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"\n",
"
" \n",
"
\n","
\n","
VehicleNum\n","
Stime\n","
Lng\n","
Lat\n","
OpenStatus\n","
Speed\n","
\n","
\n","
\n","
\n","
0\n","
22271\n","
22:54:04\n","
114.167000\n","
22.718399\n","
0\n","
0\n","
\n","
\n","
1\n","
22271\n","
18:26:26\n","
114.190598\n","
22.647800\n","
0\n","
4\n","
\n","
\n","
2\n","
22271\n","
18:35:18\n","
114.201401\n","
22.649700\n","
0\n","
0\n","
\n","
\n","
3\n","
22271\n","
16:02:46\n","
114.233498\n","
22.725901\n","
0\n","
24\n","
\n","
\n","
4\n","
22271\n","
21:41:17\n","
114.233597\n","
22.720900\n","
0\n","
19\n","
\n","
\n","
\n","
],
"text/plain": [
" VehicleNum Stime Lng Lat OpenStatus Speed\n",
"0 22271 22:54:04 114.167000 22.718399 0 0\n",
"1 22271 18:26:26 114.190598 22.647800 0 4\n",
"2 22271 18:35:18 114.201401 22.649700 0 0\n",
"3 22271 16:02:46 114.233498 22.725901 0 24\n",
"4 22271 21:41:17 114.233597 22.720900 0 19"
]
},
"execution_count": 92,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#得到车牌照为22271的所有数据\n",
"data[data['VehicleNum']==22271].head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"如果我们想要对data删去所有牌照为22271的数据,所需要的操作也很简单:\n",
" \n",
"> data[-(条件)]\n",
" \n",
"注意,如果你不想要奇奇怪怪的bug出现,请按照我上面给的格式来筛选数据,建议不要用data.drop()来删数据,,data.drop()只在你想删除某一列的时候用"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {
"ExecuteTime": {
"end_time": "2020-01-18T04:53:05.516849Z",
"start_time": "2020-01-18T04:53:05.403154Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
"
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"\n",
"
" \n",
"
\n","
\n","
VehicleNum\n","
Stime\n","
Lng\n","
Lat\n","
OpenStatus\n","
Speed\n","
\n","
\n","
\n","
\n","
1437\n","
35807\n","
01:53:46\n","
113.809898\n","
22.626801\n","
0\n","
0\n","
\n","
\n","
1438\n","
35807\n","
01:43:46\n","
113.813301\n","
22.623600\n","
0\n","
0\n","
\n","
\n","
1439\n","
35807\n","
01:14:15\n","
113.847000\n","
22.594700\n","
0\n","
41\n","
\n","
\n","
1440\n","
35807\n","
02:01:41\n","
113.852501\n","
22.625700\n","
0\n","
22\n","
\n","
\n","
1441\n","
35807\n","
01:01:59\n","
113.897003\n","
22.551901\n","
0\n","
42\n","
\n","
\n","
\n","
],
"text/plain": [
" VehicleNum Stime Lng Lat OpenStatus Speed\n",
"1437 35807 01:53:46 113.809898 22.626801 0 0\n",
"1438 35807 01:43:46 113.813301 22.623600 0 0\n",
"1439 35807 01:14:15 113.847000 22.594700 0 41\n",
"1440 35807 02:01:41 113.852501 22.625700 0 22\n",
"1441 35807 01:01:59 113.897003 22.551901 0 42"
]
},
"execution_count": 93,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#删除车牌照为22271的所有数据\n",
"data[-(data['VehicleNum']==22271)].head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 获取/删除/定义DataFrame的某一列"
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {
"ExecuteTime": {
"end_time": "2020-01-18T04:53:06.731768Z",
"start_time": "2020-01-18T04:53:06.688883Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
"
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"\n",
"
" \n",
"
\n","
\n","
Stime\n","
\n","
\n","
\n","
\n","
0\n","
22:54:04\n","
\n","
\n","
1\n","
18:26:26\n","
\n","
\n","
2\n","
18:35:18\n","
\n","
\n","
3\n","
16:02:46\n","
\n","
\n","
4\n","
21:41:17\n","
\n","
\n","
\n","
],
"text/plain": [
" Stime\n",
"0 22:54:04\n",
"1 18:26:26\n",
"2 18:35:18\n",
"3 16:02:46\n",
"4 21:41:17"
]
},
"execution_count": 94,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#获取列'Stime',注意,此操作不会影响到data,你在操作后必须将得到的表重新赋值给data才有影响\n",
"data[['Stime']].head(5)"
]
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {
"ExecuteTime": {
"end_time": "2020-01-18T04:53:07.602784Z",
"start_time": "2020-01-18T04:53:07.471136Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
"
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"\n",
"
" \n",
"
\n","
\n","
VehicleNum\n","
Stime\n","
Lng\n","
Lat\n","
OpenStatus\n","
Speed\n","
Speed1\n","
\n","
\n","
\n","
\n","
0\n","
22271\n","
22:54:04\n","
114.167000\n","
22.718399\n","
0\n","
0\n","
0\n","
\n","
\n","
1\n","
22271\n","
18:26:26\n","
114.190598\n","
22.647800\n","
0\n","
4\n","
8\n","
\n","
\n","
2\n","
22271\n","
18:35:18\n","
114.201401\n","
22.649700\n","
0\n","
0\n","
0\n","
\n","
\n","
3\n","
22271\n","
16:02:46\n","
114.233498\n","
22.725901\n","
0\n","
24\n","
48\n","
\n","
\n","
4\n","
22271\n","
21:41:17\n","
114.233597\n","
22.720900\n","
0\n","
19\n","
38\n","
\n","
\n","
\n","
],
"text/plain": [
" VehicleNum Stime Lng Lat OpenStatus Speed Speed1\n",
"0 22271 22:54:04 114.167000 22.718399 0 0 0\n",
"1 22271 18:26:26 114.190598 22.647800 0 4 8\n",
"2 22271 18:35:18 114.201401 22.649700 0 0 0\n",
"3 22271 16:02:46 114.233498 22.725901 0 24 48\n",
"4 22271 21:41:17 114.233597 22.720900 0 19 38"
]
},
"execution_count": 95,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#定义列'Speed1'为Speed列的两倍,注意,此操作会影响到data\n",
"data['Speed1']=data['Speed']*2\n",
"#或者\n",
"data.loc[:,'Speed1']=data['Speed']*2\n",
"\n",
"data.head(5)"
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {
"ExecuteTime": {
"end_time": "2020-01-18T04:53:45.170336Z",
"start_time": "2020-01-18T04:53:45.109499Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
"
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"\n",
"
" \n",
"
\n","
\n","
VehicleNum\n","
Lng\n","
Lat\n","
OpenStatus\n","
Speed\n","
Speed1\n","
\n","
\n","
\n","
\n","
0\n","
22271\n","
114.167000\n","
22.718399\n","
0\n","
0\n","
0\n","
\n","
\n","
1\n","
22271\n","
114.190598\n","
22.647800\n","
0\n","
4\n","
8\n","
\n","
\n","
2\n","
22271\n","
114.201401\n","
22.649700\n","
0\n","
0\n","
0\n","
\n","
\n","
3\n","
22271\n","
114.233498\n","
22.725901\n","
0\n","
24\n","
48\n","
\n","
\n","
4\n","
22271\n","
114.233597\n","
22.720900\n","
0\n","
19\n","
38\n","
\n","
\n","
\n","
],
"text/plain": [
" VehicleNum Lng Lat OpenStatus Speed Speed1\n",
"0 22271 114.167000 22.718399 0 0 0\n",
"1 22271 114.190598 22.647800 0 4 8\n",
"2 22271 114.201401 22.649700 0 0 0\n",
"3 22271 114.233498 22.725901 0 24 48\n",
"4 22271 114.233597 22.720900 0 19 38"
]
},
"execution_count": 96,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#删除列'Stime',注意,此操作不会影响到data,你在操作后必须将得到的表重新赋值给data才有影响\n",
"data.drop(['Stime'],axis=1).head(5)"
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {
"ExecuteTime": {
"end_time": "2020-01-18T04:53:48.056620Z",
"start_time": "2020-01-18T04:53:47.972844Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
"
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"\n",
"
" \n",
"
\n","
\n","
VehicleNum\n","
Stime\n","
Lng\n","
Lat\n","
OpenStatus\n","
Speed\n","
\n","
\n","
\n","
\n","
0\n","
22271\n","
22:54:04\n","
114.167000\n","
22.718399\n","
0\n","
0\n","
\n","
\n","
1\n","
22271\n","
18:26:26\n","
114.190598\n","
22.647800\n","
0\n","
4\n","
\n","
\n","
2\n","
22271\n","
18:35:18\n","
114.201401\n","
22.649700\n","
0\n","
0\n","
\n","
\n","
3\n","
22271\n","
16:02:46\n","
114.233498\n","
22.725901\n","
0\n","
24\n","
\n","
\n","
4\n","
22271\n","
21:41:17\n","
114.233597\n","
22.720900\n","
0\n","
19\n","
\n","
\n","
\n","
],
"text/plain": [
" VehicleNum Stime Lng Lat OpenStatus Speed\n",
"0 22271 22:54:04 114.167000 22.718399 0 0\n",
"1 22271 18:26:26 114.190598 22.647800 0 4\n",
"2 22271 18:35:18 114.201401 22.649700 0 0\n",
"3 22271 16:02:46 114.233498 22.725901 0 24\n",
"4 22271 21:41:17 114.233597 22.720900 0 19"
]
},
"execution_count": 97,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#删除列'Stime',注意,此操作不会影响到data,你在操作后必须将得到的表重新赋值给data才有影响\n",
"#axis=1表示的是,对列进行删除,如果axis=0,则是对行删除,但是建议不用这个功能对行删除\n",
"data = data.drop(['Speed1'],axis=1)\n",
"data.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 获取某一列某一行的数据"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在获取某行某列的数据时,记得一定要用iloc(按表目前排列的顺序取),不能用loc(按index取) \n",
"因为很多时候我们做完筛选、排序等操作,表就不是按index来排列,用loc取列就会取错列,或者直接报错(没有这个index)"
]
},
{
"cell_type": "code",
"execution_count": 98,
"metadata": {
"ExecuteTime": {
"end_time": "2020-01-18T04:53:50.234795Z",
"start_time": "2020-01-18T04:53:50.226816Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'16:02:46'"
]
},
"execution_count": 98,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#获取Stime列的第4行数据\n",
"data['Stime'].iloc[3]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 数据清洗"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"首先,我们将数据按车牌、时间排序,特别要注意,data排序后需要再赋值给data,否则没作用"
]
},
{
"cell_type": "code",
"execution_count": 99,
"metadata": {
"ExecuteTime": {
"end_time": "2020-01-18T04:53:52.863800Z",
"start_time": "2020-01-18T04:53:51.815569Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
"
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"\n",
"
" \n",
"
\n","
\n","
VehicleNum\n","
Stime\n","
Lng\n","
Lat\n","
OpenStatus\n","
Speed\n","
\n","
\n","
\n","
\n","
39\n","
22271\n","
00:00:49\n","
114.266502\n","
22.728201\n","
0\n","
0\n","
\n","
\n","
397\n","
22271\n","
00:01:48\n","
114.266502\n","
22.728201\n","
0\n","
0\n","
\n","
\n","
1413\n","
22271\n","
00:02:47\n","
114.266502\n","
22.728201\n","
0\n","
0\n","
\n","
\n","
244\n","
22271\n","
00:03:46\n","
114.266502\n","
22.728201\n","
0\n","
0\n","
\n","
\n","
247\n","
22271\n","
00:04:45\n","
114.268898\n","
22.729500\n","
0\n","
11\n","
\n","
\n","
\n","
],
"text/plain": [
" VehicleNum Stime Lng Lat OpenStatus Speed\n",
"39 22271 00:00:49 114.266502 22.728201 0 0\n",
"397 22271 00:01:48 114.266502 22.728201 0 0\n",
"1413 22271 00:02:47 114.266502 22.728201 0 0\n",
"244 22271 00:03:46 114.266502 22.728201 0 0\n",
"247 22271 00:04:45 114.268898 22.729500 0 11"
]
},
"execution_count": 99,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 将数据排序,并把排序后的数据赋值给原来的数据\n",
"data = data.sort_values(by = ['VehicleNum','Stime'])\n",
"data.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"我们现在要做的是,用出租车GPS数据识别OD:\n",
"数据按车牌、时间排序后,正常情况下的OpenStatus是这样的:\n",
"\n",
"|OpenStatus | |\n",
"| :-----------: |-----------|\n",
"|0||\n",
"|0||\n",
"|0||\n",
"|0||\n",
"|0||\n",
"|1|←此时乘客上车了|\n",
"|1||\n",
"|1||\n",
"|1||\n",
"|1||\n",
"|1||\n",
"|1||\n",
"|0| ←此时乘客下车了|\n",
"|0||\n",
"|0||\n",
"|0||\n",
"\n",
"但是,也会有时候有数据异常出现,比如:\n",
"\n",
"|OpenStatus | |\n",
"| ----------- |-----------|\n",
"|0||\n",
"|0||\n",
"|0||\n",
"|0||\n",
"|0||\n",
"|1|←异常|\n",
"|0||\n",
"|0||\n",
"|0||\n",
"|0||\n",
"\n",
"或者\n",
"\n",
"|OpenStatus | |\n",
"| ----------- |-----------|\n",
"|1||\n",
"|1||\n",
"|1||\n",
"|1||\n",
"|1||\n",
"|0|←异常|\n",
"|1||\n",
"|1||\n",
"|1||\n",
"|1||\n",
"\n",
"前后都是0,突然有一条数据变成1,或者前后都是1,突然变成0。这种异常情况我们是要排除的"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"在pandas的数据处理过程中,我们筛掉不要的数据用下面的方法是最好的\n",
"\n",
" > data[条件]是保留符合条件的数据 \n",
" data[-(条件)]是删除符合条件的数据"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Series的shift()函数能够将数据按顺序后移一位\n",
" Series的shift(-1)函数能够将数据按顺序前移一位\n",
" 因此我们要判断的是,如果:\n",
" 后一位和前一位相等,但是后一位与中间一位不等,那么中间一位的数据就要删除(前一条数据,中间一条数据,后一条数据的车牌必须相等)\n",
" \n",
"所以,我们立马开始筛选,把异常数据删除:"
]
},
{
"cell_type": "code",
"execution_count": 100,
"metadata": {
"ExecuteTime": {
"end_time": "2020-01-18T04:53:55.724226Z",
"start_time": "2020-01-18T04:53:55.717245Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"1601307"
]
},
"execution_count": 100,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#筛选前的数据量\n",
"len(data)"
]
},
{
"cell_type": "code",
"execution_count": 101,
"metadata": {
"ExecuteTime": {
"end_time": "2020-01-18T04:53:56.442306Z",
"start_time": "2020-01-18T04:53:56.438317Z"
}
},
"outputs": [],
"source": [
"###############################你需要在下面写代码##################################\n",
"#这里,把上述异常数据清洗出来\n",
"#用到的条件是:\n",
"#1.后一位和前一位相等\n",
"#2.但是后一位与中间一位不等\n",
"#3.前一条数据,后一条数据的车牌相等\n",
"#4.中间一条数据,后一条数据的车牌相等\n",
"\n",
"#各条件之间的交集可以用\n",
"#data[(条件1)&(条件2)]\n",
"#各条件之间的并集可以用\n",
"#data[(条件1)|(条件2)]\n",
"\n",
"\n",
"#data = \n",
"\n",
"\n",
"###################################################################################"
]
},
{
"cell_type": "code",
"execution_count": 102,
"metadata": {
"ExecuteTime": {
"end_time": "2020-01-18T04:53:57.340933Z",
"start_time": "2020-01-18T04:53:57.057662Z"
}
},
"outputs": [],
"source": [
"############################### 答 案 ##################################\n",
"data = data[-((data['OpenStatus'].shift(-1) == data['OpenStatus'].shift())&\n",
"(data['OpenStatus'].shift(-1) != data['OpenStatus'])&\n",
"(data['VehicleNum'].shift(-1) == data['VehicleNum'].shift())&\n",
"(data['VehicleNum'].shift(-1) == data['VehicleNum']))]\n",
"\n",
"###################################################################################"
]
},
{
"cell_type": "code",
"execution_count": 103,
"metadata": {
"ExecuteTime": {
"end_time": "2020-01-18T04:53:58.198611Z",
"start_time": "2020-01-18T04:53:58.191630Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"1598866"
]
},
"execution_count": 103,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#如果你代码对的话,筛选完了data的数据量应该是\n",
"len(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 识别OD"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 乘客上下车的状态变化识别"
]
},
{
"cell_type": "code",
"execution_count": 104,
"metadata": {
"ExecuteTime": {
"end_time": "2020-01-18T04:54:00.064621Z",
"start_time": "2020-01-18T04:54:00.050659Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
"
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"\n",
"
" \n",
"
\n","
\n","
VehicleNum\n","
Stime\n","
Lng\n","
Lat\n","
OpenStatus\n","
Speed\n","
\n","
\n","
\n","
\n","
39\n","
22271\n","
00:00:49\n","
114.266502\n","
22.728201\n","
0\n","
0\n","
\n","
\n","
397\n","
22271\n","
00:01:48\n","
114.266502\n","
22.728201\n","
0\n","
0\n","
\n","
\n","
1413\n","
22271\n","
00:02:47\n","
114.266502\n","
22.728201\n","
0\n","
0\n","
\n","
\n","
244\n","
22271\n","
00:03:46\n","
114.266502\n","
22.728201\n","
0\n","
0\n","
\n","
\n","
247\n","
22271\n","
00:04:45\n","
114.268898\n","
22.729500\n","
0\n","
11\n","
\n","
\n","
\n","
],
"text/plain": [
" VehicleNum Stime Lng Lat OpenStatus Speed\n",
"39 22271 00:00:49 114.266502 22.728201 0 0\n",
"397 22271 00:01:48 114.266502 22.728201 0 0\n",
"1413 22271 00:02:47 114.266502 22.728201 0 0\n",
"244 22271 00:03:46 114.266502 22.728201 0 0\n",
"247 22271 00:04:45 114.268898 22.729500 0 11"
]
},
"execution_count": 104,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"接下来,我们把下一条数据的信息放到前一条数据上,这样子,就能很方便的比较这条数据和下条数据的差异\n",
"\n",
"在字段名加个1,代表后面一条数据的值"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"另外我们定义StatusChange为下一条数据的OpenStatus减去这一条数据的OpenStatus,这样就会出现:\n",
"\n",
"\n",
"\n",
"\n",
"|OpenStatus | OpenStatus1 | StatusChange||\n",
"| ----------- |-----------|||\n",
"|0 | 0 | 0||\n",
"|0 | 0 | 0||\n",
"|0 | 0 | 0||\n",
"|0 | 1 | 1 | ←此时乘客上车了|\n",
"|1 | 1 | 0 | ←此时乘客上车了|\n",
"|1 | 1 | 0||\n",
"|1 | 1 | 0||\n",
"|1 | 1 | 0||\n",
"|1 | 1 | 0||\n",
"|1 | 1 | 0||\n",
"|1 | 1 | 0||\n",
"|1 | 1 | 0||\n",
"|1 | 1 | 0||\n",
"|1 | 0 | -1| ←此时乘客下车了|\n",
"|0 | 0 | 0 | ←此时乘客下车了|\n",
"|0 | 0 | 0||\n",
"|0 | 0 | 0||\n",
"|0 | 0 | 0||\n",
"\n",
"注意到,乘客上车和下车都有两条数据,一般我们认为这两条数据的位置和时间非常接近,都可以认为是下车或者上车的地点"
]
},
{
"cell_type": "code",
"execution_count": 105,
"metadata": {
"ExecuteTime": {
"end_time": "2020-01-18T04:54:01.994461Z",
"start_time": "2020-01-18T04:54:01.989475Z"
}
},
"outputs": [],
"source": [
"###############################你需要在下面写代码##################################\n",
"#让这几个字段的下一条数据赋值给新的字段,在字段名加个1,代表后面一条数据的值\n",
"#data.loc[:,'OpenStatus1'] = \n",
"#data.loc[:,'VehicleNum1'] = \n",
"#data.loc[:,'Lng1'] = \n",
"#data.loc[:,'Lat1'] = \n",
"#data.loc[:,'Stime1'] = \n",
"\n",
"#data.loc[:,'StatusChange'] = \n",
"\n",
"\n",
"###################################################################################"
]
},
{
"cell_type": "code",
"execution_count": 106,
"metadata": {
"ExecuteTime": {
"end_time": "2020-01-18T04:54:03.862529Z",
"start_time": "2020-01-18T04:54:02.623838Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
"
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"\n",
"
" \n",
"
\n","
\n","
VehicleNum\n","
Stime\n","
Lng\n","
Lat\n","
OpenStatus\n","
Speed\n","
OpenStatus1\n","
VehicleNum1\n","
Lng1\n","
Lat1\n","
Stime1\n","
StatusChange\n","
\n","
\n","
\n","
\n","
39\n","
22271\n","
00:00:49\n","
114.266502\n","
22.728201\n","
0\n","
0\n","
0.0\n","
22271.0\n","
114.266502\n","
22.728201\n","
00:01:48\n","
0.0\n","
\n","
\n","
397\n","
22271\n","
00:01:48\n","
114.266502\n","
22.728201\n","
0\n","
0\n","
0.0\n","
22271.0\n","
114.266502\n","
22.728201\n","
00:02:47\n","
0.0\n","
\n","
\n","
1413\n","
22271\n","
00:02:47\n","
114.266502\n","
22.728201\n","
0\n","
0\n","
0.0\n","
22271.0\n","
114.266502\n","
22.728201\n","
00:03:46\n","
0.0\n","
\n","
\n","
244\n","
22271\n","
00:03:46\n","
114.266502\n","
22.728201\n","
0\n","
0\n","
0.0\n","
22271.0\n","
114.268898\n","
22.729500\n","
00:04:45\n","
0.0\n","
\n","
\n","
247\n","
22271\n","
00:04:45\n","
114.268898\n","
22.729500\n","
0\n","
11\n","
0.0\n","
22271.0\n","
114.272003\n","
22.731199\n","
00:05:44\n","
0.0\n","
\n","
\n","
\n","
],
"text/plain": [
" VehicleNum Stime Lng Lat OpenStatus Speed \\\n",
"39 22271 00:00:49 114.266502 22.728201 0 0 \n",
"397 22271 00:01:48 114.266502 22.728201 0 0 \n",
"1413 22271 00:02:47 114.266502 22.728201 0 0 \n",
"244 22271 00:03:46 114.266502 22.728201 0 0 \n",
"247 22271 00:04:45 114.268898 22.729500 0 11 \n",
"\n",
" OpenStatus1 VehicleNum1 Lng1 Lat1 Stime1 StatusChange \n",
"39 0.0 22271.0 114.266502 22.728201 00:01:48 0.0 \n",
"397 0.0 22271.0 114.266502 22.728201 00:02:47 0.0 \n",
"1413 0.0 22271.0 114.266502 22.728201 00:03:46 0.0 \n",
"244 0.0 22271.0 114.268898 22.729500 00:04:45 0.0 \n",
"247 0.0 22271.0 114.272003 22.731199 00:05:44 0.0 "
]
},
"execution_count": 106,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"############################### 答 案 ##################################\n",
"#让这几个字段的下一条数据赋值给新的字段,在字段名加个1,代表后面一条数据的值\n",
"data.loc[:,'OpenStatus1'] = data['OpenStatus'].shift(-1)\n",
"data.loc[:,'VehicleNum1'] = data['VehicleNum'].shift(-1)\n",
"data.loc[:,'Lng1'] = data['Lng'].shift(-1)\n",
"data.loc[:,'Lat1'] = data['Lat'].shift(-1)\n",
"data.loc[:,'Stime1'] = data['Stime'].shift(-1)\n",
"\n",
"data.loc[:,'StatusChange'] = data['OpenStatus1']-data['OpenStatus']\n",
"\n",
"\n",
"###################################################################################\n",
"data.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 将上下车状态整理为OD"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" 这里,我只想保留StatusChange字段为1或者-1的数据,因此要把这些数据筛出来\n",
" 不过,还得加一个条件,就是这条数据和下一条数据的车辆ID必须是相同的"
]
},
{
"cell_type": "code",
"execution_count": 107,
"metadata": {
"ExecuteTime": {
"end_time": "2020-01-18T04:54:05.800844Z",
"start_time": "2020-01-18T04:54:05.795858Z"
}
},
"outputs": [],
"source": [
"###############################你需要在下面写代码##################################\n",
"#两个条件:\n",
"#1.StatusChange字段为1或者-1\n",
"#2.这条数据和下一条数据的车辆ID必须是相同的\n",
"#data=\n",
"\n",
"###################################################################################"
]
},
{
"cell_type": "code",
"execution_count": 108,
"metadata": {
"ExecuteTime": {
"end_time": "2020-01-18T04:54:06.863033Z",
"start_time": "2020-01-18T04:54:06.800172Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
"
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"\n",
"
" \n",
"
\n","
\n","
VehicleNum\n","
Stime\n","
Lng\n","
Lat\n","
OpenStatus\n","
Speed\n","
OpenStatus1\n","
VehicleNum1\n","
Lng1\n","
Lat1\n","
Stime1\n","
StatusChange\n","
\n","
\n","
\n","
\n","
1548741\n","
22334\n","
00:00:52\n","
114.111130\n","
22.576750\n","
1\n","
13\n","
0.0\n","
22334.0\n","
114.111130\n","
22.576750\n","
00:01:04\n","
-1.0\n","
\n","
\n","
1548351\n","
22334\n","
00:07:44\n","
114.080498\n","
22.554182\n","
0\n","
11\n","
1.0\n","
22334.0\n","
114.080498\n","
22.554182\n","
00:07:57\n","
1.0\n","
\n","
\n","
1549620\n","
22334\n","
00:17:58\n","
114.084915\n","
22.540850\n","
1\n","
2\n","
0.0\n","
22334.0\n","
114.084915\n","
22.540850\n","
00:18:16\n","
-1.0\n","
\n","
\n","
1547182\n","
22334\n","
00:18:56\n","
114.084915\n","
22.540850\n","
0\n","
0\n","
1.0\n","
22334.0\n","
114.084915\n","
22.540850\n","
00:19:05\n","
1.0\n","
\n","
\n","
1547627\n","
22334\n","
00:44:47\n","
114.056236\n","
22.633383\n","
1\n","
3\n","
0.0\n","
22334.0\n","
114.056236\n","
22.633383\n","
00:44:52\n","
-1.0\n","
\n","
\n","
\n","
],
"text/plain": [
" VehicleNum Stime Lng Lat OpenStatus Speed \\\n",
"1548741 22334 00:00:52 114.111130 22.576750 1 13 \n",
"1548351 22334 00:07:44 114.080498 22.554182 0 11 \n",
"1549620 22334 00:17:58 114.084915 22.540850 1 2 \n",
"1547182 22334 00:18:56 114.084915 22.540850 0 0 \n",
"1547627 22334 00:44:47 114.056236 22.633383 1 3 \n",
"\n",
" OpenStatus1 VehicleNum1 Lng1 Lat1 Stime1 \\\n",
"1548741 0.0 22334.0 114.111130 22.576750 00:01:04 \n",
"1548351 1.0 22334.0 114.080498 22.554182 00:07:57 \n",
"1549620 0.0 22334.0 114.084915 22.540850 00:18:16 \n",
"1547182 1.0 22334.0 114.084915 22.540850 00:19:05 \n",
"1547627 0.0 22334.0 114.056236 22.633383 00:44:52 \n",
"\n",
" StatusChange \n",
"1548741 -1.0 \n",
"1548351 1.0 \n",
"1549620 -1.0 \n",
"1547182 1.0 \n",
"1547627 -1.0 "
]
},
"execution_count": 108,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"############################### 答 案 ##################################\n",
"data = data[((data['StatusChange'] == 1)|(data['StatusChange'] == -1))\n",
"&(data['VehicleNum'] == data['VehicleNum1'])]\n",
"\n",
"###################################################################################\n",
"data.head(5)"
]
},
{
"cell_type": "code",
"execution_count": 109,
"metadata": {
"ExecuteTime": {
"end_time": "2020-01-18T04:54:07.783542Z",
"start_time": "2020-01-18T04:54:07.764594Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
"
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"\n",
"
" \n",
"
\n","
\n","
VehicleNum\n","
Stime\n","
Lng\n","
Lat\n","
StatusChange\n","
\n","
\n","
\n","
\n","
1548741\n","
22334\n","
00:00:52\n","
114.111130\n","
22.576750\n","
-1.0\n","
\n","
\n","
1548351\n","
22334\n","
00:07:44\n","
114.080498\n","
22.554182\n","
1.0\n","
\n","
\n","
1549620\n","
22334\n","
00:17:58\n","
114.084915\n","
22.540850\n","
-1.0\n","
\n","
\n","
1547182\n","
22334\n","
00:18:56\n","
114.084915\n","
22.540850\n","
1.0\n","
\n","
\n","
1547627\n","
22334\n","
00:44:47\n","
114.056236\n","
22.633383\n","
-1.0\n","
\n","
\n","
\n","
],
"text/plain": [
" VehicleNum Stime Lng Lat StatusChange\n",
"1548741 22334 00:00:52 114.111130 22.576750 -1.0\n",
"1548351 22334 00:07:44 114.080498 22.554182 1.0\n",
"1549620 22334 00:17:58 114.084915 22.540850 -1.0\n",
"1547182 22334 00:18:56 114.084915 22.540850 1.0\n",
"1547627 22334 00:44:47 114.056236 22.633383 -1.0"
]
},
"execution_count": 109,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#data数据只保留一些我们需要的字段\n",
"data = data[['VehicleNum','Stime','Lng','Lat','StatusChange']]\n",
"data.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" 我们现在就得到了乘客哪里上车,哪里下车。\n",
" 而我们想要得到的OD数据形式是,每一行记录包括了信息:车辆ID,上车时间,上车地点,下车时间,下车地点\n",
" 这样一行数据就是一个OD\n",
" 所以,接下来的操作就是。。。"
]
},
{
"cell_type": "code",
"execution_count": 111,
"metadata": {
"ExecuteTime": {
"end_time": "2020-01-18T04:54:15.682458Z",
"start_time": "2020-01-18T04:54:15.669456Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
"
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"\n",
"
" \n",
"
\n","
\n","
VehicleNum\n","
Stime\n","
SLng\n","
SLat\n","
ELng\n","
ELat\n","
Etime\n","
\n","
\n","
\n","
\n","
1548351\n","
22334\n","
00:07:44\n","
114.080498\n","
22.554182\n","
114.084915\n","
22.540850\n","
00:17:58\n","
\n","
\n","
1547182\n","
22334\n","
00:18:56\n","
114.084915\n","
22.540850\n","
114.056236\n","
22.633383\n","
00:44:47\n","
\n","
\n","
1547511\n","
22334\n","
02:38:35\n","
114.091637\n","
22.543200\n","
114.093498\n","
22.554382\n","
02:46:52\n","
\n","
\n","
1547789\n","
22334\n","
03:58:46\n","
114.038818\n","
22.553232\n","
114.052299\n","
22.604366\n","
04:13:57\n","
\n","
\n","
1547764\n","
22334\n","
06:30:11\n","
114.031250\n","
22.519550\n","
114.067886\n","
22.521299\n","
06:41:19\n","
\n","
\n","
\n","
],
"text/plain": [
" VehicleNum Stime SLng SLat ELng ELat \\\n",
"1548351 22334 00:07:44 114.080498 22.554182 114.084915 22.540850 \n",
"1547182 22334 00:18:56 114.084915 22.540850 114.056236 22.633383 \n",
"1547511 22334 02:38:35 114.091637 22.543200 114.093498 22.554382 \n",
"1547789 22334 03:58:46 114.038818 22.553232 114.052299 22.604366 \n",
"1547764 22334 06:30:11 114.031250 22.519550 114.067886 22.521299 \n",
"\n",
" Etime \n",
"1548351 00:17:58 \n",
"1547182 00:44:47 \n",
"1547511 02:46:52 \n",
"1547789 04:13:57 \n",
"1547764 06:41:19 "
]
},
"execution_count": 111,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"###############################你需要在下面写代码##################################\n",
"\n",
"\n",
"\n",
"\n",
"###################################################################################\n",
"data.head(5)"
]
},
{
"cell_type": "code",
"execution_count": 110,
"metadata": {
"ExecuteTime": {
"end_time": "2020-01-18T04:54:12.768242Z",
"start_time": "2020-01-18T04:54:12.732310Z"
}
},
"outputs": [],
"source": [
"############################### 答 案 ##################################\n",
"data = data.rename(columns = {'Lng':'SLng','Lat':'SLat'})\n",
"data['ELng'] = data['SLng'].shift(-1)\n",
"data['ELat'] = data['SLat'].shift(-1)\n",
"data['Etime'] = data['Stime'].shift(-1)\n",
"data = data[data['StatusChange'] == 1]\n",
"data = data.drop('StatusChange',axis = 1)\n",
"###################################################################################"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" 大功告成!接下来就是保存 \n",
" 数据文件夹下有一个TaxiOD.csv文件是我用1.7GB的全部数据计算的OD"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"ExecuteTime": {
"end_time": "2020-01-18T04:20:14.690034Z",
"start_time": "2020-01-18T04:20:11.579344Z"
}
},
"outputs": [],
"source": [
"data.to_csv(r'data-sample\\TaxiOD-Sample.csv',index = None)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {
"height": "calc(100% - 180px)",
"left": "10px",
"top": "150px",
"width": "211px"
},
"toc_section_display": true,
"toc_window_display": true
}
},
"nbformat": 4,
"nbformat_minor": 2
}
一键复制
编辑
Web IDE
原始数据
按行查看
历史