话不多说直接上干货!
Foursquare NYC数据地址:Dingqi YANG's Homepage - Foursquare Dataset (google.com)
gowalla数据集格式如下:
0 2010-10-19T23:55:27Z 30.2359091167 -97.7951395833 22847
0 2010-10-18T22:17:43Z 30.2691029532 -97.7493953705 420315
0 2010-10-17T23:42:03Z 30.2557309927 -97.7633857727 316637
0 2010-10-17T19:26:05Z 30.2634181234 -97.7575966669 16516
0 2010-10-16T18:50:42Z 30.2742918584 -97.7405226231 5535878
0 2010-10-12T23:58:03Z 30.261599404 -97.7585805953 15372
0 2010-10-12T22:02:11Z 30.2679095833 -97.7493124167 21714
0 2010-10-12T19:44:40Z 30.2691029532 -97.7493953705 420315
从左往右分别是: user, time, lat, lng, loc.
dataset_TSMC2014_NYC数据格式:
470 49bbd6c0f964a520f4531fe3 4bf58dd8d48988d127951735 Arts & Crafts Store 40.719810375488535 -74.00258103213994 -240 Tue Apr 03 18:00:09 +0000 2012
979 4a43c0aef964a520c6a61fe3 4bf58dd8d48988d1df941735 Bridge 40.60679958140643 -74.04416981025437 -240 Tue Apr 03 18:00:25 +0000 2012
69 4c5cc7b485a1e21e00d35711 4bf58dd8d48988d103941735 Home (private) 40.716161684843215 -73.88307005845945 -240 Tue Apr 03 18:02:24 +0000 2012
395 4bc7086715a7ef3bef9878da 4bf58dd8d48988d104941735 Medical Center 40.7451638 -73.982518775 -240 Tue Apr 03 18:02:41 +0000 2012
87 4cf2c5321d18a143951b5cec 4bf58dd8d48988d1cb941735 Food Truck 40.74010382743943 -73.98965835571289 -240 Tue Apr 03 18:03:00 +0000 2012
484 4b5b981bf964a520900929e3 4bf58dd8d48988d118951735 Food & Drink Shop 40.69042711809854 -73.95468677509598 -240 Tue Apr 03 18:04:00 +0000 2012
642 4ab966c3f964a5203c7f20e3 4bf58dd8d48988d1e0931735 Coffee Shop 40.751591431346306 -73.9741214009634 -240 Tue Apr 03 18:04:38 +0000 2012
292 4d0cc47f903d37041864bf55 4bf58dd8d48988d12b951735 Bus Station 40.77942173066975
从左往右,每列的含义分别是
1. User ID (anonymized) 2. Venue ID (Foursquare) 3. Venue category ID (Foursquare) 4. Venue category name (Fousquare) 5. Latitude 6. Longitude 7. Timezone offset in minutes (The offset in minutes between when this check-in occurred and the same time in UTC) 8. UTC time
所以我们的任务如下:
保留序号1,2,5,6,8的列,然后将列2和8对换位置(也就是loc和time),注意到还需要转换时间的格式。
Code:
from datetime import datetime
import pandas as pd
#227428 rows x 8 columns
with open('dataset_TSMC2014_NYC.txt','r',encoding = 'latin-1')as f:
lines = []
for line in f:
line = line.strip().split('\t')
lines.append(line)
df = pd.DataFrame(lines,columns = ['user','loc','a','b','lat','lng','c','time'])
#删掉没用的列
df = df.drop(['a','b','c'], axis = 1)
#删掉缺失值的数据
df = df.dropna()
#处理时间格式
for i in range(df.shape[0]):
time = datetime.strptime(str(df.iloc[i, [4]].values[0]), '%a %b %d %H:%M:%S +0000 %Y')
df.iloc[i, [4]] = datetime.strftime(time, '%Y-%m-%dT%H:%M:%SZ')
df.iloc[i, [0]] = int(df.iloc[i, [0]].values)#把user这一列都从str转换成int,方便后边排序
#重新对列columns排序
df = df.reindex(columns = ['user','time','lat','lng','loc'])
#按userid排序
df = df.sort_values(by = ['user'], ascending = True)
#写入
df.to_csv('data.txt',sep = '\t',index = False)
当然,如果你想了解每一个部分做了什么事情,你完全可以自己打印输出来看看。
user time lat lng loc
1 2012-04-14T17:45:23Z 40.75673068586622 -73.97406974625711 428d2880f964a520b5231fe3
1 2012-05-12T01:20:06Z 40.786765544538504 -73.9757341687067 4ea8ab9cf790328ae1736bac
1 2012-12-13T03:05:17Z 40.76963366209763 -73.99434602137363 50c7104fe4b0860fb5620c72
1 2012-04-29T20:45:25Z 40.78401843692215 -73.97452399134636 4d4ac10da0ef54814b6ffff6
1 2012-07-15T22:14:46Z 40.78085 -73.976243 42814b00f964a520dc211fe3
1 2012-04-12T17:19:21Z 40.720087 -74.003961 49d2b43ef964a520cb5b1fe3
1 2012-06-16T21:29:30Z 40.74588914242651 -73.98852104404719 4cc9a5760f7bef3b5f9c7fdd
1 2012-04-20T01:49:26Z 40.7337627 -74.006264 3fd66200f964a520e9e61ee3
最后数据处理成这样,你只需要删除第一行就行了.还不收藏起来,别到时候找不到啦!
处理完数据,赶紧跑模型吧,加油!