python处理出租车轨迹数据_1-出租车数据的基础处理,由gps生成OD(pandas).ipynb...

{

"cells": [

{

"cell_type": "markdown",

"metadata": {},

"source": [

"在这个教程中,你将会学到如何使用python的pandas包对出租车GPS数据进行数据清洗,识别出行OD\n",

"\n",

"

提供的基础数据是:

数据:
\n",

" 1.出租车原始GPS数据(在data-sample文件夹下,原始数据集的抽样500辆车的数据)

"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"[pandas包的简介](https://baike.baidu.com/item/pandas/17209606?fr=aladdin)"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"# 读取数据"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"首先,读取出租车数据。"

]

},

{

"cell_type": "code",

"execution_count": 2,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-18T04:51:53.552930Z",

"start_time": "2020-01-18T04:51:52.397018Z"

}

},

"outputs": [],

"source": [

"import pandas as pd\n",

"#读取数据\n",

"data = pd.read_csv(r'data-sample/TaxiData-Sample',header = None)\n",

"#给数据命名列\n",

"data.columns = ['VehicleNum', 'Stime', 'Lng', 'Lat', 'OpenStatus', 'Speed']"

]

},

{

"cell_type": "code",

"execution_count": 3,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-18T04:51:58.299239Z",

"start_time": "2020-01-18T04:51:58.271312Z"

}

},

"outputs": [

{

"data": {

"text/html": [

"

\n",

"

" .dataframe tbody tr th:only-of-type {\n",

" vertical-align: middle;\n",

" }\n",

"\n",

" .dataframe tbody tr th {\n",

" vertical-align: top;\n",

" }\n",

"\n",

" .dataframe thead th {\n",

" text-align: right;\n",

" }\n",

"\n",

"

" \n",

"

\n",

"

\n",

"

VehicleNum\n",

"

Stime\n",

"

Lng\n",

"

Lat\n",

"

OpenStatus\n",

"

Speed\n",

"

\n",

"

\n",

"

\n",

"

\n",

"

0\n",

"

22271\n",

"

22:54:04\n",

"

114.167000\n",

"

22.718399\n",

"

0\n",

"

0\n",

"

\n",

"

\n",

"

1\n",

"

22271\n",

"

18:26:26\n",

"

114.190598\n",

"

22.647800\n",

"

0\n",

"

4\n",

"

\n",

"

\n",

"

2\n",

"

22271\n",

"

18:35:18\n",

"

114.201401\n",

"

22.649700\n",

"

0\n",

"

0\n",

"

\n",

"

\n",

"

3\n",

"

22271\n",

"

16:02:46\n",

"

114.233498\n",

"

22.725901\n",

"

0\n",

"

24\n",

"

\n",

"

\n",

"

4\n",

"

22271\n",

"

21:41:17\n",

"

114.233597\n",

"

22.720900\n",

"

0\n",

"

19\n",

"

\n",

"

\n",

"

\n",

"

"

],

"text/plain": [

" VehicleNum Stime Lng Lat OpenStatus Speed\n",

"0 22271 22:54:04 114.167000 22.718399 0 0\n",

"1 22271 18:26:26 114.190598 22.647800 0 4\n",

"2 22271 18:35:18 114.201401 22.649700 0 0\n",

"3 22271 16:02:46 114.233498 22.725901 0 24\n",

"4 22271 21:41:17 114.233597 22.720900 0 19"

]

},

"execution_count": 3,

"metadata": {},

"output_type": "execute_result"

}

],

"source": [

"#显示数据的前5行\n",

"data.head(5)"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"数据的格式:\n",

"\n",

">VehicleNum —— 车牌 \n",

"Stime —— 时间 \n",

"Lng —— 经度 \n",

"Lat —— 纬度 \n",

"OpenStatus —— 是否有乘客(0没乘客,1有乘客) \n",

"Speed —— 速度 "

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"# 基础的数据操作"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"## DataFrame和Series"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"DataFrame和Series\n",

"\n",

" > 当我们读一个数据的时候,我们读进来的就是DataFrame格式的数据表,而一个DataFrame中的每一列,则为一个Series \n",

" 也就是说,DataFrame由多个Series组成\n"

]

},

{

"cell_type": "code",

"execution_count": 87,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-18T04:52:25.713432Z",

"start_time": "2020-01-18T04:52:25.708450Z"

}

},

"outputs": [

{

"data": {

"text/plain": [

"pandas.core.frame.DataFrame"

]

},

"execution_count": 87,

"metadata": {},

"output_type": "execute_result"

}

],

"source": [

"type(data)"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"如果我们想取DataFrame的某一列,想得到的是Series,那么直接用以下代码\n",

"\n",

" > data[列名]"

]

},

{

"cell_type": "code",

"execution_count": 88,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-18T04:52:32.097575Z",

"start_time": "2020-01-18T04:52:32.090592Z"

}

},

"outputs": [

{

"data": {

"text/plain": [

"pandas.core.series.Series"

]

},

"execution_count": 88,

"metadata": {},

"output_type": "execute_result"

}

],

"source": [

"type(data['Lng'])"

]

},

{

"cell_type": "markdown",

"metadata": {

"ExecuteTime": {

"end_time": "2019-09-06T09:22:43.642625Z",

"start_time": "2019-09-06T09:22:43.638487Z"

}

},

"source": [

"如果我们想取DataFrame的某一列或者某几列,想得到的是DataFrame,那么直接用以下代码\n",

"\n",

"> data2[[列名,列名]]"

]

},

{

"cell_type": "code",

"execution_count": 89,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-18T04:52:33.013124Z",

"start_time": "2020-01-18T04:52:32.990186Z"

}

},

"outputs": [

{

"data": {

"text/plain": [

"pandas.core.frame.DataFrame"

]

},

"execution_count": 89,

"metadata": {},

"output_type": "execute_result"

}

],

"source": [

"type(data[['Lng']])"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"## 数据的筛选"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"数据的筛选:\n",

"\n",

" 在筛选数据的时候,我们一般用data[条件]的格式\n",

" 其中的条件,是对data每一行数据的true和false布尔变量的Series"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

" 例如,我们想得到车牌照为22271的所有数据\n",

" 首先我们要获得一个布尔变量的Series,这个Series对应的是data的每一行,如果车牌照为\"粤B4H2K8\"则为true,不是则为false\n",

" 这样子的Series很容易获得,只需要\n",

" data['VehicleNum']==22271"

]

},

{

"cell_type": "code",

"execution_count": 90,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-18T04:52:44.078571Z",

"start_time": "2020-01-18T04:52:44.049646Z"

}

},

"outputs": [

{

"data": {

"text/plain": [

"0 True\n",

"1 True\n",

"2 True\n",

"3 True\n",

"4 True\n",

"Name: VehicleNum, dtype: bool"

]

},

"execution_count": 90,

"metadata": {},

"output_type": "execute_result"

}

],

"source": [

"(data['VehicleNum']==22271).head(5)"

]

},

{

"cell_type": "code",

"execution_count": 92,

"metadata": {

"ExecuteTime": {

"end_time": "2020-01-18T04:52:51.723416Z",

"start_time": "2020-01-18T04:52:51.688510Z"

}

},

"outputs": [

{

"data": {

"text/html": [

"

\n",

"

" .dataframe tbody tr th:only-of-type {\n",

" vertical-align: middle;\n",

" }\n",

"\n",

" .dataframe tbody tr th {\n",

" vertical-align: top;\n",

" }\n",

"\n",

" .dataframe thead th {\n",

" text-align: right;\n",

" }\n",

"\n",

"

" \n",

"

\n",

"

\n",

"

VehicleNum\n",

"

Stime\n",

"

Lng\n",

"

Lat\n",

"

OpenStatus\n",

"

Speed\n",

"

\n",

"

\n",

"

\n",

"

\n",

"

0\n",

"

22271\n",

"

22:54:04\n",

"

114.167000\n",

"

22.718399\n",

"

0\n",

"

0\n",

"

\n",

"

\n",

"

1\n",

"

22271\n",

"

18:26:26\n",

"

114.190598\n",

"

22.647800\n",

"

0\n",

"

4\n",

"

\n",

"

\n",

"

2\n",

"

22271\n",

"

18:35:18\n",

"

114.201401\n",

"

22.649700\n",

"

0\n",

"

0\n",

"

\n",

"

\n",

"

3\n",

"

22271\n",

"

16:02:46\n",

"

114.233498\n",

"

22.725901\n",

"

0\n",

"

24\n",

"

\n",

"

\n",

"

4\n",

"

22271\n",

"

21:41:17\n",

"

114.233597\n",

"

22.720900\n",

"

0\n",

"

19\n",

"

\n",

"

\n",

"

\n",

"

"

],

"text/plain": [

" VehicleNum Stime Lng Lat OpenStatus Speed\n",

"0 22271 22:54:04 114.167000 22.718399 0 0\n",

"1 22271 18:26:26 114.190598 22.647800 0 4\n",

"2 22271 18:35:18 114.201401 22.649700 0 0\n",

"3 22271 16:02:46 114.233498 22.725901 0 24\n",

"4 22271 21:41:17 114.233597 22.720900 0 19"

]

},

"execution_count": 92,

"metadata": {},

"output_type": "execute_result"

}

],

"source": [

"#得到车牌照为22271的所有数据\n",

"data[data['VehicleNum']==22271].head(5)"

]

},

{

"cell_type": "markdown",

"metadata": {},

"sou

  • 0
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
要使用Python实现TF-IDF和LDA,并处理游记数据(travel_note_lvmama.csv),需要使用一些常用的数据处理和文本分析库,如pandas、sklearn和gensim。 下面是一个简单的示例代码,演示如何使用TF-IDF和LDA处理游记数据: ```python import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.decomposition import LatentDirichletAllocation from gensim import corpora # 读取游记数据 data = pd.read_csv("travel_note_lvmama.csv") # 停用词列表(可根据需要进行扩展) stop_words = ["的", "了", "和", "在", "是", "我", "有", "就", "不", "也"] # 使用TF-IDF向量化器对文本进行转换 tfidf_vec = TfidfVectorizer(stop_words=stop_words) tfidf_matrix = tfidf_vec.fit_transform(data["content"]) # 使用LDA对TF-IDF矩阵进行主题建模 num_topics = 5 # 设置主题数目 lda_model = LatentDirichletAllocation(n_components=num_topics) lda_model.fit(tfidf_matrix) # 输出每个主题的关键词 feature_names = tfidf_vec.get_feature_names() for topic_idx, topic in enumerate(lda_model.components_): top_features = [feature_names[i] for i in topic.argsort()[:-6:-1]] print(f"Topic {topic_idx+1}: {', '.join(top_features)}") ``` 这段代码假设你的游记数据文件名为 "travel_note_lvmama.csv",并且其中的内容列名为 "content"。你可以根据实际情况进行调整。 请确保安装了所需的库(pandas、scikit-learn和gensim)。你可以使用以下命令通过pip安装它们: ``` pip install pandas scikit-learn gensim ``` 希望这可以帮助你开始处理游记数据并实现TF-IDF和LDA分析。如果有任何问题,请随时提问!

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值