将Foursquare:dataset_TSMC2014_NYC 数据集处理成gowalla格式,赶紧收藏!!!

话不多说直接上干货!

Foursquare NYC数据地址:Dingqi YANG's Homepage - Foursquare Dataset (google.com)

 

gowalla数据集格式如下:

0	2010-10-19T23:55:27Z	30.2359091167	-97.7951395833	22847
0	2010-10-18T22:17:43Z	30.2691029532	-97.7493953705	420315
0	2010-10-17T23:42:03Z	30.2557309927	-97.7633857727	316637
0	2010-10-17T19:26:05Z	30.2634181234	-97.7575966669	16516
0	2010-10-16T18:50:42Z	30.2742918584	-97.7405226231	5535878
0	2010-10-12T23:58:03Z	30.261599404	-97.7585805953	15372
0	2010-10-12T22:02:11Z	30.2679095833	-97.7493124167	21714
0	2010-10-12T19:44:40Z	30.2691029532	-97.7493953705	420315

从左往右分别是: user, time, lat, lng, loc.


 dataset_TSMC2014_NYC数据格式:

470	49bbd6c0f964a520f4531fe3	4bf58dd8d48988d127951735	Arts & Crafts Store	40.719810375488535	-74.00258103213994	-240	Tue Apr 03 18:00:09 +0000 2012
979	4a43c0aef964a520c6a61fe3	4bf58dd8d48988d1df941735	Bridge	40.60679958140643	-74.04416981025437	-240	Tue Apr 03 18:00:25 +0000 2012
69	4c5cc7b485a1e21e00d35711	4bf58dd8d48988d103941735	Home (private)	40.716161684843215	-73.88307005845945	-240	Tue Apr 03 18:02:24 +0000 2012
395	4bc7086715a7ef3bef9878da	4bf58dd8d48988d104941735	Medical Center	40.7451638	-73.982518775	-240	Tue Apr 03 18:02:41 +0000 2012
87	4cf2c5321d18a143951b5cec	4bf58dd8d48988d1cb941735	Food Truck	40.74010382743943	-73.98965835571289	-240	Tue Apr 03 18:03:00 +0000 2012
484	4b5b981bf964a520900929e3	4bf58dd8d48988d118951735	Food & Drink Shop	40.69042711809854	-73.95468677509598	-240	Tue Apr 03 18:04:00 +0000 2012
642	4ab966c3f964a5203c7f20e3	4bf58dd8d48988d1e0931735	Coffee Shop	40.751591431346306	-73.9741214009634	-240	Tue Apr 03 18:04:38 +0000 2012
292	4d0cc47f903d37041864bf55	4bf58dd8d48988d12b951735	Bus Station	40.77942173066975	

从左往右,每列的含义分别是

1. User ID (anonymized)
2. Venue ID (Foursquare)
3. Venue category ID (Foursquare)
4. Venue category name (Fousquare)
5. Latitude
6. Longitude
7. Timezone offset in minutes (The offset in minutes between when this check-in         occurred and the same time in UTC)
8. UTC time

所以我们的任务如下:

保留序号1,2,5,6,8的列,然后将列2和8对换位置(也就是loc和time),注意到还需要转换时间的格式。

Code:

from datetime import datetime
import pandas as pd
#227428 rows x 8 columns

with open('dataset_TSMC2014_NYC.txt','r',encoding = 'latin-1')as f:
    lines = []
    for line in f:
        line = line.strip().split('\t')
        lines.append(line)
    df = pd.DataFrame(lines,columns = ['user','loc','a','b','lat','lng','c','time'])
    
    #删掉没用的列
    df = df.drop(['a','b','c'], axis = 1)
    
    #删掉缺失值的数据
    df = df.dropna()

    #处理时间格式
    for i in range(df.shape[0]):
        time = datetime.strptime(str(df.iloc[i, [4]].values[0]), '%a %b %d %H:%M:%S +0000 %Y')
        df.iloc[i, [4]] = datetime.strftime(time, '%Y-%m-%dT%H:%M:%SZ')
        df.iloc[i, [0]] = int(df.iloc[i, [0]].values)#把user这一列都从str转换成int,方便后边排序
    
    #重新对列columns排序
    df = df.reindex(columns = ['user','time','lat','lng','loc'])

    #按userid排序
    df = df.sort_values(by = ['user'], ascending = True)
    
    #写入
    df.to_csv('data.txt',sep = '\t',index = False)

当然,如果你想了解每一个部分做了什么事情,你完全可以自己打印输出来看看。

user	time	lat	lng	loc
1	2012-04-14T17:45:23Z	40.75673068586622	-73.97406974625711	428d2880f964a520b5231fe3
1	2012-05-12T01:20:06Z	40.786765544538504	-73.9757341687067	4ea8ab9cf790328ae1736bac
1	2012-12-13T03:05:17Z	40.76963366209763	-73.99434602137363	50c7104fe4b0860fb5620c72
1	2012-04-29T20:45:25Z	40.78401843692215	-73.97452399134636	4d4ac10da0ef54814b6ffff6
1	2012-07-15T22:14:46Z	40.78085	-73.976243	42814b00f964a520dc211fe3
1	2012-04-12T17:19:21Z	40.720087	-74.003961	49d2b43ef964a520cb5b1fe3
1	2012-06-16T21:29:30Z	40.74588914242651	-73.98852104404719	4cc9a5760f7bef3b5f9c7fdd
1	2012-04-20T01:49:26Z	40.7337627	-74.006264	3fd66200f964a520e9e61ee3

最后数据处理成这样,你只需要删除第一行就行了.还不收藏起来,别到时候找不到啦!

处理完数据,赶紧跑模型吧,加油!

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值