python出租车数据_1-出租车数据的基础处理，由gps生成OD（pandas）.ipynb

最新推荐文章于 2022-11-07 21:12:40 发布

linux分享猿

最新推荐文章于 2022-11-07 21:12:40 发布

阅读量982

点赞数

文章标签： python出租车数据

本文链接：https://blog.csdn.net/weixin_35260789/article/details/113655751

版权

{

"cells": [

{

"cell_type": "markdown",

"metadata": {},

"source": [

"在这个教程中，你将会学到如何使用python的pandas包对出租车GPS数据进行数据清洗，识别出行OD\n",

"\n",

提供的基础数据是：

数据：
\n",

" 1.出租车原始GPS数据(在data-sample文件夹下，原始数据集的抽样500辆车的数据)

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"[pandas包的简介](https://baike.baidu.com/item/pandas/17209606?fr=aladdin)"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"# 读取数据"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"首先，读取出租车数据。"

]

{

"cell_type": "code",

"execution_count": 2,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-18T04:51:53.552930Z",

"start_time": "2020-01-18T04:51:52.397018Z"

}

"outputs": [],

"source": [

"import pandas as pd\n",

"#读取数据\n",

"data = pd.read_csv(r'data-sample/TaxiData-Sample',header = None)\n",

"#给数据命名列\n",

"data.columns = ['VehicleNum', 'Stime', 'Lng', 'Lat', 'OpenStatus', 'Speed']"

]

{

"cell_type": "code",

"execution_count": 3,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-18T04:51:58.299239Z",

"start_time": "2020-01-18T04:51:58.271312Z"

}

"outputs": [

{

"data": {

"text/html": [

\n",

" .dataframe tbody tr th:only-of-type {\n",

" vertical-align: middle;\n",

" }\n",

"\n",

" .dataframe tbody tr th {\n",

" vertical-align: top;\n",

" }\n",

"\n",

" .dataframe thead th {\n",

" text-align: right;\n",

" }\n",

"\n",

" \n",

\n",

VehicleNum\n",

Stime\n",

Lng\n",

Lat\n",

OpenStatus\n",

Speed\n",

\n",

0\n",

22271\n",

22:54:04\n",

114.167000\n",

22.718399\n",

0\n",

\n",

1\n",

22271\n",

18:26:26\n",

114.190598\n",

22.647800\n",

0\n",

4\n",

\n",

2\n",

22271\n",

18:35:18\n",

114.201401\n",

22.649700\n",

0\n",

\n",

3\n",

22271\n",

16:02:46\n",

114.233498\n",

22.725901\n",

0\n",

24\n",

\n",

4\n",

22271\n",

21:41:17\n",

114.233597\n",

22.720900\n",

0\n",

19\n",

\n",

"text/plain": [

" VehicleNum Stime Lng Lat OpenStatus Speed\n",

"0 22271 22:54:04 114.167000 22.718399 0 0\n",

"1 22271 18:26:26 114.190598 22.647800 0 4\n",

"2 22271 18:35:18 114.201401 22.649700 0 0\n",

"3 22271 16:02:46 114.233498 22.725901 0 24\n",

"4 22271 21:41:17 114.233597 22.720900 0 19"

]

"execution_count": 3,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"#显示数据的前5行\n",

"data.head(5)"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"数据的格式：\n",

"\n",

">VehicleNum —— 车牌 \n",

"Stime —— 时间 \n",

"Lng —— 经度 \n",

"Lat —— 纬度 \n",

"OpenStatus —— 是否有乘客(0没乘客，1有乘客) \n",

"Speed —— 速度 "

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"# 基础的数据操作"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"## DataFrame和Series"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"DataFrame和Series\n",

"\n",

" > 当我们读一个数据的时候，我们读进来的就是DataFrame格式的数据表，而一个DataFrame中的每一列，则为一个Series \n",

" 也就是说，DataFrame由多个Series组成\n"

]

{

"cell_type": "code",

"execution_count": 87,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-18T04:52:25.713432Z",

"start_time": "2020-01-18T04:52:25.708450Z"

}

"outputs": [

{

"data": {

"text/plain": [

"pandas.core.frame.DataFrame"

]

"execution_count": 87,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"type(data)"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"如果我们想取DataFrame的某一列，想得到的是Series，那么直接用以下代码\n",

"\n",

" > data[列名]"

]

{

"cell_type": "code",

"execution_count": 88,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-18T04:52:32.097575Z",

"start_time": "2020-01-18T04:52:32.090592Z"

}

"outputs": [

{

"data": {

"text/plain": [

"pandas.core.series.Series"

]

"execution_count": 88,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"type(data['Lng'])"

]

{

"cell_type": "markdown",

"metadata": {

"ExecuteTime": {

"end_time": "2019-09-06T09:22:43.642625Z",

"start_time": "2019-09-06T09:22:43.638487Z"

}

"source": [

"如果我们想取DataFrame的某一列或者某几列，想得到的是DataFrame，那么直接用以下代码\n",

"\n",

"> data2[[列名,列名]]"

]

{

"cell_type": "code",

"execution_count": 89,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-18T04:52:33.013124Z",

"start_time": "2020-01-18T04:52:32.990186Z"

}

"outputs": [

{

"data": {

"text/plain": [

"pandas.core.frame.DataFrame"

]

"execution_count": 89,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"type(data[['Lng']])"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"## 数据的筛选"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"数据的筛选:\n",

"\n",

" 在筛选数据的时候，我们一般用data[条件]的格式\n",

" 其中的条件，是对data每一行数据的true和false布尔变量的Series"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

" 例如，我们想得到车牌照为22271的所有数据\n",

" 首先我们要获得一个布尔变量的Series，这个Series对应的是data的每一行，如果车牌照为\"粤B4H2K8\"则为true，不是则为false\n",

" 这样子的Series很容易获得，只需要\n",

" data['VehicleNum']==22271"

]

{

"cell_type": "code",

"execution_count": 90,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-18T04:52:44.078571Z",

"start_time": "2020-01-18T04:52:44.049646Z"

}

"outputs": [

{

"data": {

"text/plain": [

"0 True\n",

"1 True\n",

"2 True\n",

"3 True\n",

"4 True\n",

"Name: VehicleNum, dtype: bool"

]

"execution_count": 90,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"(data['VehicleNum']==22271).head(5)"

]

{

"cell_type": "code",

"execution_count": 92,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-18T04:52:51.723416Z",

"start_time": "2020-01-18T04:52:51.688510Z"

}

"outputs": [

{

"data": {

"text/html": [

\n",

" .dataframe tbody tr th:only-of-type {\n",

" vertical-align: middle;\n",

" }\n",

"\n",

" .dataframe tbody tr th {\n",

" vertical-align: top;\n",

" }\n",

"\n",

" .dataframe thead th {\n",

" text-align: right;\n",

" }\n",

"\n",

" \n",

\n",

VehicleNum\n",

Stime\n",

Lng\n",

Lat\n",

OpenStatus\n",

Speed\n",

\n",

0\n",

22271\n",

22:54:04\n",

114.167000\n",

22.718399\n",

0\n",

\n",

1\n",

22271\n",

18:26:26\n",

114.190598\n",

22.647800\n",

0\n",

4\n",

\n",

2\n",

22271\n",

18:35:18\n",

114.201401\n",

22.649700\n",

0\n",

\n",

3\n",

22271\n",

16:02:46\n",

114.233498\n",

22.725901\n",

0\n",

24\n",

\n",

4\n",

22271\n",

21:41:17\n",

114.233597\n",

22.720900\n",

0\n",

19\n",

\n",

"text/plain": [

" VehicleNum Stime Lng Lat OpenStatus Speed\n",

"0 22271 22:54:04 114.167000 22.718399 0 0\n",

"1 22271 18:26:26 114.190598 22.647800 0 4\n",

"2 22271 18:35:18 114.201401 22.649700 0 0\n",

"3 22271 16:02:46 114.233498 22.725901 0 24\n",

"4 22271 21:41:17 114.233597 22.720900 0 19"

]

"execution_count": 92,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"#得到车牌照为22271的所有数据\n",

"data[data['VehicleNum']==22271].head(5)"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"如果我们想要对data删去所有牌照为22271的数据，所需要的操作也很简单：\n",

" \n",

"> data[-(条件)]\n",

" \n",

"注意，如果你不想要奇奇怪怪的bug出现，请按照我上面给的格式来筛选数据，建议不要用data.drop()来删数据，，data.drop()只在你想删除某一列的时候用"

]

{

"cell_type": "code",

"execution_count": 93,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-18T04:53:05.516849Z",

"start_time": "2020-01-18T04:53:05.403154Z"

}

"outputs": [

{

"data": {

"text/html": [

\n",

" .dataframe tbody tr th:only-of-type {\n",

" vertical-align: middle;\n",

" }\n",

"\n",

" .dataframe tbody tr th {\n",

" vertical-align: top;\n",

" }\n",

"\n",

" .dataframe thead th {\n",

" text-align: right;\n",

" }\n",

"\n",

" \n",

\n",

VehicleNum\n",

Stime\n",

Lng\n",

Lat\n",

OpenStatus\n",

Speed\n",

\n",

1437\n",

35807\n",

01:53:46\n",

113.809898\n",

22.626801\n",

0\n",

\n",

1438\n",

35807\n",

01:43:46\n",

113.813301\n",

22.623600\n",

0\n",

\n",

1439\n",

35807\n",

01:14:15\n",

113.847000\n",

22.594700\n",

0\n",

41\n",

\n",

1440\n",

35807\n",

02:01:41\n",

113.852501\n",

22.625700\n",

0\n",

22\n",

\n",

1441\n",

35807\n",

01:01:59\n",

113.897003\n",

22.551901\n",

0\n",

42\n",

\n",

"text/plain": [

" VehicleNum Stime Lng Lat OpenStatus Speed\n",

"1437 35807 01:53:46 113.809898 22.626801 0 0\n",

"1438 35807 01:43:46 113.813301 22.623600 0 0\n",

"1439 35807 01:14:15 113.847000 22.594700 0 41\n",

"1440 35807 02:01:41 113.852501 22.625700 0 22\n",

"1441 35807 01:01:59 113.897003 22.551901 0 42"

]

"execution_count": 93,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"#删除车牌照为22271的所有数据\n",

"data[-(data['VehicleNum']==22271)].head(5)"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"## 获取/删除/定义DataFrame的某一列"

]

{

"cell_type": "code",

"execution_count": 94,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-18T04:53:06.731768Z",

"start_time": "2020-01-18T04:53:06.688883Z"

}

"outputs": [

{

"data": {

"text/html": [

\n",

" .dataframe tbody tr th:only-of-type {\n",

" vertical-align: middle;\n",

" }\n",

"\n",

" .dataframe tbody tr th {\n",

" vertical-align: top;\n",

" }\n",

"\n",

" .dataframe thead th {\n",

" text-align: right;\n",

" }\n",

"\n",

" \n",

\n",

Stime\n",

\n",

0\n",

22:54:04\n",

\n",

1\n",

18:26:26\n",

\n",

2\n",

18:35:18\n",

\n",

3\n",

16:02:46\n",

\n",

4\n",

21:41:17\n",

\n",

"text/plain": [

" Stime\n",

"0 22:54:04\n",

"1 18:26:26\n",

"2 18:35:18\n",

"3 16:02:46\n",

"4 21:41:17"

]

"execution_count": 94,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"#获取列'Stime',注意，此操作不会影响到data，你在操作后必须将得到的表重新赋值给data才有影响\n",

"data[['Stime']].head(5)"

]

{

"cell_type": "code",

"execution_count": 95,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-18T04:53:07.602784Z",

"start_time": "2020-01-18T04:53:07.471136Z"

}

"outputs": [

{

"data": {

"text/html": [

\n",

" .dataframe tbody tr th:only-of-type {\n",

" vertical-align: middle;\n",

" }\n",

"\n",

" .dataframe tbody tr th {\n",

" vertical-align: top;\n",

" }\n",

"\n",

" .dataframe thead th {\n",

" text-align: right;\n",

" }\n",

"\n",

" \n",

\n",

VehicleNum\n",

Stime\n",

Lng\n",

Lat\n",

OpenStatus\n",

Speed\n",

Speed1\n",

\n",

0\n",

22271\n",

22:54:04\n",

114.167000\n",

22.718399\n",

0\n",

\n",

1\n",

22271\n",

18:26:26\n",

114.190598\n",

22.647800\n",

0\n",

4\n",

8\n",

\n",

2\n",

22271\n",

18:35:18\n",

114.201401\n",

22.649700\n",

0\n",

\n",

3\n",

22271\n",

16:02:46\n",

114.233498\n",

22.725901\n",

0\n",

24\n",

48\n",

\n",

4\n",

22271\n",

21:41:17\n",

114.233597\n",

22.720900\n",

0\n",

19\n",

38\n",

\n",

"text/plain": [

" VehicleNum Stime Lng Lat OpenStatus Speed Speed1\n",

"0 22271 22:54:04 114.167000 22.718399 0 0 0\n",

"1 22271 18:26:26 114.190598 22.647800 0 4 8\n",

"2 22271 18:35:18 114.201401 22.649700 0 0 0\n",

"3 22271 16:02:46 114.233498 22.725901 0 24 48\n",

"4 22271 21:41:17 114.233597 22.720900 0 19 38"

]

"execution_count": 95,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"#定义列'Speed1'为Speed列的两倍,注意，此操作会影响到data\n",

"data['Speed1']=data['Speed']*2\n",

"#或者\n",

"data.loc[:,'Speed1']=data['Speed']*2\n",

"\n",

"data.head(5)"

]

{

"cell_type": "code",

"execution_count": 96,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-18T04:53:45.170336Z",

"start_time": "2020-01-18T04:53:45.109499Z"

}

"outputs": [

{

"data": {

"text/html": [

\n",

" .dataframe tbody tr th:only-of-type {\n",

" vertical-align: middle;\n",

" }\n",

"\n",

" .dataframe tbody tr th {\n",

" vertical-align: top;\n",

" }\n",

"\n",

" .dataframe thead th {\n",

" text-align: right;\n",

" }\n",

"\n",

" \n",

\n",

VehicleNum\n",

Lng\n",

Lat\n",

OpenStatus\n",

Speed\n",

Speed1\n",

\n",

0\n",

22271\n",

114.167000\n",

22.718399\n",

0\n",

\n",

1\n",

22271\n",

114.190598\n",

22.647800\n",

0\n",

4\n",

8\n",

\n",

2\n",

22271\n",

114.201401\n",

22.649700\n",

0\n",

\n",

3\n",

22271\n",

114.233498\n",

22.725901\n",

0\n",

24\n",

48\n",

\n",

4\n",

22271\n",

114.233597\n",

22.720900\n",

0\n",

19\n",

38\n",

\n",

"text/plain": [

" VehicleNum Lng Lat OpenStatus Speed Speed1\n",

"0 22271 114.167000 22.718399 0 0 0\n",

"1 22271 114.190598 22.647800 0 4 8\n",

"2 22271 114.201401 22.649700 0 0 0\n",

"3 22271 114.233498 22.725901 0 24 48\n",

"4 22271 114.233597 22.720900 0 19 38"

]

"execution_count": 96,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"#删除列'Stime',注意，此操作不会影响到data，你在操作后必须将得到的表重新赋值给data才有影响\n",

"data.drop(['Stime'],axis=1).head(5)"

]

{

"cell_type": "code",

"execution_count": 97,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-18T04:53:48.056620Z",

"start_time": "2020-01-18T04:53:47.972844Z"

}

"outputs": [

{

"data": {

"text/html": [

\n",

" .dataframe tbody tr th:only-of-type {\n",

" vertical-align: middle;\n",

" }\n",

"\n",

" .dataframe tbody tr th {\n",

" vertical-align: top;\n",

" }\n",

"\n",

" .dataframe thead th {\n",

" text-align: right;\n",

" }\n",

"\n",

" \n",

\n",

VehicleNum\n",

Stime\n",

Lng\n",

Lat\n",

OpenStatus\n",

Speed\n",

\n",

0\n",

22271\n",

22:54:04\n",

114.167000\n",

22.718399\n",

0\n",

\n",

1\n",

22271\n",

18:26:26\n",

114.190598\n",

22.647800\n",

0\n",

4\n",

\n",

2\n",

22271\n",

18:35:18\n",

114.201401\n",

22.649700\n",

0\n",

\n",

3\n",

22271\n",

16:02:46\n",

114.233498\n",

22.725901\n",

0\n",

24\n",

\n",

4\n",

22271\n",

21:41:17\n",

114.233597\n",

22.720900\n",

0\n",

19\n",

\n",

"text/plain": [

" VehicleNum Stime Lng Lat OpenStatus Speed\n",

"0 22271 22:54:04 114.167000 22.718399 0 0\n",

"1 22271 18:26:26 114.190598 22.647800 0 4\n",

"2 22271 18:35:18 114.201401 22.649700 0 0\n",

"3 22271 16:02:46 114.233498 22.725901 0 24\n",

"4 22271 21:41:17 114.233597 22.720900 0 19"

]

"execution_count": 97,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"#删除列'Stime',注意，此操作不会影响到data，你在操作后必须将得到的表重新赋值给data才有影响\n",

"#axis=1表示的是，对列进行删除，如果axis=0，则是对行删除，但是建议不用这个功能对行删除\n",

"data = data.drop(['Speed1'],axis=1)\n",

"data.head(5)"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"## 获取某一列某一行的数据"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"在获取某行某列的数据时，记得一定要用iloc(按表目前排列的顺序取)，不能用loc(按index取) \n",

"因为很多时候我们做完筛选、排序等操作，表就不是按index来排列，用loc取列就会取错列，或者直接报错(没有这个index)"

]

{

"cell_type": "code",

"execution_count": 98,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-18T04:53:50.234795Z",

"start_time": "2020-01-18T04:53:50.226816Z"

}

"outputs": [

{

"data": {

"text/plain": [

"'16:02:46'"

]

"execution_count": 98,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"#获取Stime列的第4行数据\n",

"data['Stime'].iloc[3]"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"# 数据清洗"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"首先，我们将数据按车牌、时间排序，特别要注意，data排序后需要再赋值给data，否则没作用"

]

{

"cell_type": "code",

"execution_count": 99,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-18T04:53:52.863800Z",

"start_time": "2020-01-18T04:53:51.815569Z"

}

"outputs": [

{

"data": {

"text/html": [

\n",

" .dataframe tbody tr th:only-of-type {\n",

" vertical-align: middle;\n",

" }\n",

"\n",

" .dataframe tbody tr th {\n",

" vertical-align: top;\n",

" }\n",

"\n",

" .dataframe thead th {\n",

" text-align: right;\n",

" }\n",

"\n",

" \n",

\n",

VehicleNum\n",

Stime\n",

Lng\n",

Lat\n",

OpenStatus\n",

Speed\n",

\n",

39\n",

22271\n",

00:00:49\n",

114.266502\n",

22.728201\n",

0\n",

\n",

397\n",

22271\n",

00:01:48\n",

114.266502\n",

22.728201\n",

0\n",

\n",

1413\n",

22271\n",

00:02:47\n",

114.266502\n",

22.728201\n",

0\n",

\n",

244\n",

22271\n",

00:03:46\n",

114.266502\n",

22.728201\n",

0\n",

\n",

247\n",

22271\n",

00:04:45\n",

114.268898\n",

22.729500\n",

0\n",

11\n",

\n",

"text/plain": [

" VehicleNum Stime Lng Lat OpenStatus Speed\n",

"39 22271 00:00:49 114.266502 22.728201 0 0\n",

"397 22271 00:01:48 114.266502 22.728201 0 0\n",

"1413 22271 00:02:47 114.266502 22.728201 0 0\n",

"244 22271 00:03:46 114.266502 22.728201 0 0\n",

"247 22271 00:04:45 114.268898 22.729500 0 11"

]

"execution_count": 99,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"# 将数据排序,并把排序后的数据赋值给原来的数据\n",

"data = data.sort_values(by = ['VehicleNum','Stime'])\n",

"data.head(5)"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"我们现在要做的是，用出租车GPS数据识别OD：\n",

"数据按车牌、时间排序后，正常情况下的OpenStatus是这样的：\n",

"\n",

"|OpenStatus | |\n",

"| :-----------: |-----------|\n",

"|0||\n",

"|1|←此时乘客上车了|\n",

"|1||\n",

"|0| ←此时乘客下车了|\n",

"|0||\n",

"\n",

"但是，也会有时候有数据异常出现，比如：\n",

"\n",

"|OpenStatus | |\n",

"| ----------- |-----------|\n",

"|0||\n",

"|1|←异常|\n",

"|0||\n",

"\n",

"或者\n",

"\n",

"|OpenStatus | |\n",

"| ----------- |-----------|\n",

"|1||\n",

"|0|←异常|\n",

"|1||\n",

"\n",

"前后都是0，突然有一条数据变成1，或者前后都是1，突然变成0。这种异常情况我们是要排除的"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"在pandas的数据处理过程中，我们筛掉不要的数据用下面的方法是最好的\n",

"\n",

" > data[条件]是保留符合条件的数据 \n",

" data[-(条件)]是删除符合条件的数据"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

" Series的shift()函数能够将数据按顺序后移一位\n",

" Series的shift(-1)函数能够将数据按顺序前移一位\n",

" 因此我们要判断的是，如果：\n",

" 后一位和前一位相等，但是后一位与中间一位不等，那么中间一位的数据就要删除(前一条数据，中间一条数据，后一条数据的车牌必须相等)\n",

" \n",

"所以，我们立马开始筛选，把异常数据删除："

]

{

"cell_type": "code",

"execution_count": 100,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-18T04:53:55.724226Z",

"start_time": "2020-01-18T04:53:55.717245Z"

}

"outputs": [

{

"data": {

"text/plain": [

"1601307"

]

"execution_count": 100,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"#筛选前的数据量\n",

"len(data)"

]

{

"cell_type": "code",

"execution_count": 101,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-18T04:53:56.442306Z",

"start_time": "2020-01-18T04:53:56.438317Z"

}

"outputs": [],

"source": [

"###############################你需要在下面写代码##################################\n",

"#这里，把上述异常数据清洗出来\n",

"#用到的条件是：\n",

"#1.后一位和前一位相等\n",

"#2.但是后一位与中间一位不等\n",

"#3.前一条数据，后一条数据的车牌相等\n",

"#4.中间一条数据，后一条数据的车牌相等\n",

"\n",

"#各条件之间的交集可以用\n",

"#data[(条件1)&(条件2)]\n",

"#各条件之间的并集可以用\n",

"#data[(条件1)|(条件2)]\n",

"\n",

"#data = \n",

"\n",

"###################################################################################"

]

{

"cell_type": "code",

"execution_count": 102,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-18T04:53:57.340933Z",

"start_time": "2020-01-18T04:53:57.057662Z"

}

"outputs": [],

"source": [

"############################### 答案 ##################################\n",

"data = data[-((data['OpenStatus'].shift(-1) == data['OpenStatus'].shift())&\n",

"(data['OpenStatus'].shift(-1) != data['OpenStatus'])&\n",

"(data['VehicleNum'].shift(-1) == data['VehicleNum'].shift())&\n",

"(data['VehicleNum'].shift(-1) == data['VehicleNum']))]\n",

"\n",

"###################################################################################"

]

{

"cell_type": "code",

"execution_count": 103,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-18T04:53:58.198611Z",

"start_time": "2020-01-18T04:53:58.191630Z"

}

"outputs": [

{

"data": {

"text/plain": [

"1598866"

]

"execution_count": 103,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"#如果你代码对的话，筛选完了data的数据量应该是\n",

"len(data)"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"# 识别OD"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"## 乘客上下车的状态变化识别"

]

{

"cell_type": "code",

"execution_count": 104,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-18T04:54:00.064621Z",

"start_time": "2020-01-18T04:54:00.050659Z"

}

"outputs": [

{

"data": {

"text/html": [

\n",

" .dataframe tbody tr th:only-of-type {\n",

" vertical-align: middle;\n",

" }\n",

"\n",

" .dataframe tbody tr th {\n",

" vertical-align: top;\n",

" }\n",

"\n",

" .dataframe thead th {\n",

" text-align: right;\n",

" }\n",

"\n",

" \n",

\n",

VehicleNum\n",

Stime\n",

Lng\n",

Lat\n",

OpenStatus\n",

Speed\n",

\n",

39\n",

22271\n",

00:00:49\n",

114.266502\n",

22.728201\n",

0\n",

\n",

397\n",

22271\n",

00:01:48\n",

114.266502\n",

22.728201\n",

0\n",

\n",

1413\n",

22271\n",

00:02:47\n",

114.266502\n",

22.728201\n",

0\n",

\n",

244\n",

22271\n",

00:03:46\n",

114.266502\n",

22.728201\n",

0\n",

\n",

247\n",

22271\n",

00:04:45\n",

114.268898\n",

22.729500\n",

0\n",

11\n",

\n",

"text/plain": [

" VehicleNum Stime Lng Lat OpenStatus Speed\n",

"39 22271 00:00:49 114.266502 22.728201 0 0\n",

"397 22271 00:01:48 114.266502 22.728201 0 0\n",

"1413 22271 00:02:47 114.266502 22.728201 0 0\n",

"244 22271 00:03:46 114.266502 22.728201 0 0\n",

"247 22271 00:04:45 114.268898 22.729500 0 11"

]

"execution_count": 104,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"data.head(5)"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"接下来，我们把下一条数据的信息放到前一条数据上，这样子，就能很方便的比较这条数据和下条数据的差异\n",

"\n",

"在字段名加个1，代表后面一条数据的值"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"另外我们定义StatusChange为下一条数据的OpenStatus减去这一条数据的OpenStatus，这样就会出现：\n",

"\n",

"|OpenStatus | OpenStatus1 | StatusChange||\n",

"| ----------- |-----------|||\n",

"|0 | 0 | 0||\n",

"|0 | 1 | 1 | ←此时乘客上车了|\n",

"|1 | 1 | 0 | ←此时乘客上车了|\n",

"|1 | 1 | 0||\n",

"|1 | 0 | -1| ←此时乘客下车了|\n",

"|0 | 0 | 0 | ←此时乘客下车了|\n",

"|0 | 0 | 0||\n",

"\n",

"注意到，乘客上车和下车都有两条数据，一般我们认为这两条数据的位置和时间非常接近，都可以认为是下车或者上车的地点"

]

{

"cell_type": "code",

"execution_count": 105,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-18T04:54:01.994461Z",

"start_time": "2020-01-18T04:54:01.989475Z"

}

"outputs": [],

"source": [

"###############################你需要在下面写代码##################################\n",

"#让这几个字段的下一条数据赋值给新的字段，在字段名加个1，代表后面一条数据的值\n",

"#data.loc[:,'OpenStatus1'] = \n",

"#data.loc[:,'VehicleNum1'] = \n",

"#data.loc[:,'Lng1'] = \n",

"#data.loc[:,'Lat1'] = \n",

"#data.loc[:,'Stime1'] = \n",

"\n",

"#data.loc[:,'StatusChange'] = \n",

"\n",

"###################################################################################"

]

{

"cell_type": "code",

"execution_count": 106,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-18T04:54:03.862529Z",

"start_time": "2020-01-18T04:54:02.623838Z"

}

"outputs": [

{

"data": {

"text/html": [

\n",

" .dataframe tbody tr th:only-of-type {\n",

" vertical-align: middle;\n",

" }\n",

"\n",

" .dataframe tbody tr th {\n",

" vertical-align: top;\n",

" }\n",

"\n",

" .dataframe thead th {\n",

" text-align: right;\n",

" }\n",

"\n",

" \n",

\n",

VehicleNum\n",

Stime\n",

Lng\n",

Lat\n",

OpenStatus\n",

Speed\n",

OpenStatus1\n",

VehicleNum1\n",

Lng1\n",

Lat1\n",

Stime1\n",

StatusChange\n",

\n",

39\n",

22271\n",

00:00:49\n",

114.266502\n",

22.728201\n",

0\n",

0.0\n",

22271.0\n",

114.266502\n",

22.728201\n",

00:01:48\n",

0.0\n",

\n",

397\n",

22271\n",

00:01:48\n",

114.266502\n",

22.728201\n",

0\n",

0.0\n",

22271.0\n",

114.266502\n",

22.728201\n",

00:02:47\n",

0.0\n",

\n",

1413\n",

22271\n",

00:02:47\n",

114.266502\n",

22.728201\n",

0\n",

0.0\n",

22271.0\n",

114.266502\n",

22.728201\n",

00:03:46\n",

0.0\n",

\n",

244\n",

22271\n",

00:03:46\n",

114.266502\n",

22.728201\n",

0\n",

0.0\n",

22271.0\n",

114.268898\n",

22.729500\n",

00:04:45\n",

0.0\n",

\n",

247\n",

22271\n",

00:04:45\n",

114.268898\n",

22.729500\n",

0\n",

11\n",

0.0\n",

22271.0\n",

114.272003\n",

22.731199\n",

00:05:44\n",

0.0\n",

\n",

"text/plain": [

" VehicleNum Stime Lng Lat OpenStatus Speed \\\n",

"39 22271 00:00:49 114.266502 22.728201 0 0 \n",

"397 22271 00:01:48 114.266502 22.728201 0 0 \n",

"1413 22271 00:02:47 114.266502 22.728201 0 0 \n",

"244 22271 00:03:46 114.266502 22.728201 0 0 \n",

"247 22271 00:04:45 114.268898 22.729500 0 11 \n",

"\n",

" OpenStatus1 VehicleNum1 Lng1 Lat1 Stime1 StatusChange \n",

"39 0.0 22271.0 114.266502 22.728201 00:01:48 0.0 \n",

"397 0.0 22271.0 114.266502 22.728201 00:02:47 0.0 \n",

"1413 0.0 22271.0 114.266502 22.728201 00:03:46 0.0 \n",

"244 0.0 22271.0 114.268898 22.729500 00:04:45 0.0 \n",

"247 0.0 22271.0 114.272003 22.731199 00:05:44 0.0 "

]

"execution_count": 106,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"############################### 答案 ##################################\n",

"#让这几个字段的下一条数据赋值给新的字段，在字段名加个1，代表后面一条数据的值\n",

"data.loc[:,'OpenStatus1'] = data['OpenStatus'].shift(-1)\n",

"data.loc[:,'VehicleNum1'] = data['VehicleNum'].shift(-1)\n",

"data.loc[:,'Lng1'] = data['Lng'].shift(-1)\n",

"data.loc[:,'Lat1'] = data['Lat'].shift(-1)\n",

"data.loc[:,'Stime1'] = data['Stime'].shift(-1)\n",

"\n",

"data.loc[:,'StatusChange'] = data['OpenStatus1']-data['OpenStatus']\n",

"\n",

"###################################################################################\n",

"data.head(5)"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

"## 将上下车状态整理为OD"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

" 这里，我只想保留StatusChange字段为1或者-1的数据，因此要把这些数据筛出来\n",

" 不过，还得加一个条件，就是这条数据和下一条数据的车辆ID必须是相同的"

]

{

"cell_type": "code",

"execution_count": 107,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-18T04:54:05.800844Z",

"start_time": "2020-01-18T04:54:05.795858Z"

}

"outputs": [],

"source": [

"###############################你需要在下面写代码##################################\n",

"#两个条件：\n",

"#1.StatusChange字段为1或者-1\n",

"#2.这条数据和下一条数据的车辆ID必须是相同的\n",

"#data=\n",

"\n",

"###################################################################################"

]

{

"cell_type": "code",

"execution_count": 108,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-18T04:54:06.863033Z",

"start_time": "2020-01-18T04:54:06.800172Z"

}

"outputs": [

{

"data": {

"text/html": [

\n",

" .dataframe tbody tr th:only-of-type {\n",

" vertical-align: middle;\n",

" }\n",

"\n",

" .dataframe tbody tr th {\n",

" vertical-align: top;\n",

" }\n",

"\n",

" .dataframe thead th {\n",

" text-align: right;\n",

" }\n",

"\n",

" \n",

\n",

VehicleNum\n",

Stime\n",

Lng\n",

Lat\n",

OpenStatus\n",

Speed\n",

OpenStatus1\n",

VehicleNum1\n",

Lng1\n",

Lat1\n",

Stime1\n",

StatusChange\n",

\n",

1548741\n",

22334\n",

00:00:52\n",

114.111130\n",

22.576750\n",

1\n",

13\n",

0.0\n",

22334.0\n",

114.111130\n",

22.576750\n",

00:01:04\n",

-1.0\n",

\n",

1548351\n",

22334\n",

00:07:44\n",

114.080498\n",

22.554182\n",

0\n",

11\n",

1.0\n",

22334.0\n",

114.080498\n",

22.554182\n",

00:07:57\n",

1.0\n",

\n",

1549620\n",

22334\n",

00:17:58\n",

114.084915\n",

22.540850\n",

1\n",

2\n",

0.0\n",

22334.0\n",

114.084915\n",

22.540850\n",

00:18:16\n",

-1.0\n",

\n",

1547182\n",

22334\n",

00:18:56\n",

114.084915\n",

22.540850\n",

0\n",

1.0\n",

22334.0\n",

114.084915\n",

22.540850\n",

00:19:05\n",

1.0\n",

\n",

1547627\n",

22334\n",

00:44:47\n",

114.056236\n",

22.633383\n",

1\n",

3\n",

0.0\n",

22334.0\n",

114.056236\n",

22.633383\n",

00:44:52\n",

-1.0\n",

\n",

"text/plain": [

" VehicleNum Stime Lng Lat OpenStatus Speed \\\n",

"1548741 22334 00:00:52 114.111130 22.576750 1 13 \n",

"1548351 22334 00:07:44 114.080498 22.554182 0 11 \n",

"1549620 22334 00:17:58 114.084915 22.540850 1 2 \n",

"1547182 22334 00:18:56 114.084915 22.540850 0 0 \n",

"1547627 22334 00:44:47 114.056236 22.633383 1 3 \n",

"\n",

" OpenStatus1 VehicleNum1 Lng1 Lat1 Stime1 \\\n",

"1548741 0.0 22334.0 114.111130 22.576750 00:01:04 \n",

"1548351 1.0 22334.0 114.080498 22.554182 00:07:57 \n",

"1549620 0.0 22334.0 114.084915 22.540850 00:18:16 \n",

"1547182 1.0 22334.0 114.084915 22.540850 00:19:05 \n",

"1547627 0.0 22334.0 114.056236 22.633383 00:44:52 \n",

"\n",

" StatusChange \n",

"1548741 -1.0 \n",

"1548351 1.0 \n",

"1549620 -1.0 \n",

"1547182 1.0 \n",

"1547627 -1.0 "

]

"execution_count": 108,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"############################### 答案 ##################################\n",

"data = data[((data['StatusChange'] == 1)|(data['StatusChange'] == -1))\n",

"&(data['VehicleNum'] == data['VehicleNum1'])]\n",

"\n",

"###################################################################################\n",

"data.head(5)"

]

{

"cell_type": "code",

"execution_count": 109,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-18T04:54:07.783542Z",

"start_time": "2020-01-18T04:54:07.764594Z"

}

"outputs": [

{

"data": {

"text/html": [

\n",

" .dataframe tbody tr th:only-of-type {\n",

" vertical-align: middle;\n",

" }\n",

"\n",

" .dataframe tbody tr th {\n",

" vertical-align: top;\n",

" }\n",

"\n",

" .dataframe thead th {\n",

" text-align: right;\n",

" }\n",

"\n",

" \n",

\n",

VehicleNum\n",

Stime\n",

Lng\n",

Lat\n",

StatusChange\n",

\n",

1548741\n",

22334\n",

00:00:52\n",

114.111130\n",

22.576750\n",

-1.0\n",

\n",

1548351\n",

22334\n",

00:07:44\n",

114.080498\n",

22.554182\n",

1.0\n",

\n",

1549620\n",

22334\n",

00:17:58\n",

114.084915\n",

22.540850\n",

-1.0\n",

\n",

1547182\n",

22334\n",

00:18:56\n",

114.084915\n",

22.540850\n",

1.0\n",

\n",

1547627\n",

22334\n",

00:44:47\n",

114.056236\n",

22.633383\n",

-1.0\n",

\n",

"text/plain": [

" VehicleNum Stime Lng Lat StatusChange\n",

"1548741 22334 00:00:52 114.111130 22.576750 -1.0\n",

"1548351 22334 00:07:44 114.080498 22.554182 1.0\n",

"1549620 22334 00:17:58 114.084915 22.540850 -1.0\n",

"1547182 22334 00:18:56 114.084915 22.540850 1.0\n",

"1547627 22334 00:44:47 114.056236 22.633383 -1.0"

]

"execution_count": 109,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"#data数据只保留一些我们需要的字段\n",

"data = data[['VehicleNum','Stime','Lng','Lat','StatusChange']]\n",

"data.head(5)"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

" 我们现在就得到了乘客哪里上车，哪里下车。\n",

" 而我们想要得到的OD数据形式是，每一行记录包括了信息：车辆ID，上车时间，上车地点，下车时间，下车地点\n",

" 这样一行数据就是一个OD\n",

" 所以，接下来的操作就是。。。"

]

{

"cell_type": "code",

"execution_count": 111,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-18T04:54:15.682458Z",

"start_time": "2020-01-18T04:54:15.669456Z"

}

"outputs": [

{

"data": {

"text/html": [

\n",

" .dataframe tbody tr th:only-of-type {\n",

" vertical-align: middle;\n",

" }\n",

"\n",

" .dataframe tbody tr th {\n",

" vertical-align: top;\n",

" }\n",

"\n",

" .dataframe thead th {\n",

" text-align: right;\n",

" }\n",

"\n",

" \n",

\n",

VehicleNum\n",

Stime\n",

SLng\n",

SLat\n",

ELng\n",

ELat\n",

Etime\n",

\n",

1548351\n",

22334\n",

00:07:44\n",

114.080498\n",

22.554182\n",

114.084915\n",

22.540850\n",

00:17:58\n",

\n",

1547182\n",

22334\n",

00:18:56\n",

114.084915\n",

22.540850\n",

114.056236\n",

22.633383\n",

00:44:47\n",

\n",

1547511\n",

22334\n",

02:38:35\n",

114.091637\n",

22.543200\n",

114.093498\n",

22.554382\n",

02:46:52\n",

\n",

1547789\n",

22334\n",

03:58:46\n",

114.038818\n",

22.553232\n",

114.052299\n",

22.604366\n",

04:13:57\n",

\n",

1547764\n",

22334\n",

06:30:11\n",

114.031250\n",

22.519550\n",

114.067886\n",

22.521299\n",

06:41:19\n",

\n",

"text/plain": [

" VehicleNum Stime SLng SLat ELng ELat \\\n",

"1548351 22334 00:07:44 114.080498 22.554182 114.084915 22.540850 \n",

"1547182 22334 00:18:56 114.084915 22.540850 114.056236 22.633383 \n",

"1547511 22334 02:38:35 114.091637 22.543200 114.093498 22.554382 \n",

"1547789 22334 03:58:46 114.038818 22.553232 114.052299 22.604366 \n",

"1547764 22334 06:30:11 114.031250 22.519550 114.067886 22.521299 \n",

"\n",

" Etime \n",

"1548351 00:17:58 \n",

"1547182 00:44:47 \n",

"1547511 02:46:52 \n",

"1547789 04:13:57 \n",

"1547764 06:41:19 "

]

"execution_count": 111,

"metadata": {},

"output_type": "execute_result"

}

"source": [

"###############################你需要在下面写代码##################################\n",

"\n",

"###################################################################################\n",

"data.head(5)"

]

{

"cell_type": "code",

"execution_count": 110,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-18T04:54:12.768242Z",

"start_time": "2020-01-18T04:54:12.732310Z"

}

"outputs": [],

"source": [

"############################### 答案 ##################################\n",

"data = data.rename(columns = {'Lng':'SLng','Lat':'SLat'})\n",

"data['ELng'] = data['SLng'].shift(-1)\n",

"data['ELat'] = data['SLat'].shift(-1)\n",

"data['Etime'] = data['Stime'].shift(-1)\n",

"data = data[data['StatusChange'] == 1]\n",

"data = data.drop('StatusChange',axis = 1)\n",

"###################################################################################"

]

{

"cell_type": "markdown",

"metadata": {},

"source": [

" 大功告成！接下来就是保存 \n",

" 数据文件夹下有一个TaxiOD.csv文件是我用1.7GB的全部数据计算的OD"

]

{

"cell_type": "code",

"execution_count": 45,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-18T04:20:14.690034Z",

"start_time": "2020-01-18T04:20:11.579344Z"

}

"outputs": [],

"source": [

"data.to_csv(r'data-sample\\TaxiOD-Sample.csv',index = None)"

]

}

"metadata": {

"kernelspec": {

"display_name": "Python 3",

"language": "python",

"name": "python3"

"language_info": {

"codemirror_mode": {

"name": "ipython",

"version": 3

"file_extension": ".py",

"mimetype": "text/x-python",

"name": "python",

"nbconvert_exporter": "python",

"pygments_lexer": "ipython3",

"version": "3.6.5"

"toc": {

"base_numbering": 1,

"nav_menu": {},

"number_sections": true,

"sideBar": true,

"skip_h1_title": false,

"title_cell": "Table of Contents",

"title_sidebar": "Contents",

"toc_cell": false,

"toc_position": {

"height": "calc(100% - 180px)",

"left": "10px",

"top": "150px",

"width": "211px"

"toc_section_display": true,

"toc_window_display": true

}

"nbformat": 4,

"nbformat_minor": 2

}

一键复制

编辑

Web IDE

原始数据

按行查看

历史

linux分享猿

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python出租车数据_1-出租车数据的基础处理，由gps生成OD（pandas）.ipynb

{"cells": [{"cell_type": "markdown","metadata": {},"source": ["在这个教程中，你将会学到如何使用python的pandas包对出租车GPS数据进行数据清洗，识别出行OD\n","\n","提供的基础数据是：数据： \n"," 1.出租车原始GPS数据(在data-sample文件夹下，原始数据集的抽样500辆车的数据)"...
复制链接

扫一扫

python出租车数据_1-出租车数据的基础处理，由gps生成OD（pandas）.ipynb

提供的基础数据是：

“相关推荐”对你有帮助么？