商场中精确定位用户所在店铺

之前参加了阿里云的比赛,因为是第一次,也是一个人做,成绩不怎么样,差一点进入决赛,把程序和思路记录下,方便以后查询。

   比赛的网址是[链接](https://tianchi.aliyun.com/competition/introduction.htm?spm=5176.100066.0.0.71b16a1d6AHhKg&raceId=231620)
   核心支撑思想是,手机的wifi可以搜索附近所有可以链接的wifi信息并推送至服务器上,在我们进入商场购物的时候,手机wifi将当前所连接的wifi信息、wifi定位信息在服务器上查询,如果以后有人连接了类似的wifi,那么这个人就有可能在这个店铺购物。
   最后也就是根据连接的wifi信息给用户推送用户可能购买的商品。

输入的数据包括90多个商场,每个商场的信息包括用户ID,用户wifi连接信息,wifi定位信息,交易时间,店铺的消费水平、类型,最后预测当前所在的店铺。
这个题目算是一个wifi定位问题,最后是预测用户当前所在的店铺,算是一个分类的问题,首先就想到用xgboost,毕竟数据量也不大,还能少不少特征工程的工作。
下面是个人对这三类输入数据的理解:

  1. 用户wifi连接信息:
    首先要去掉连接人数特别少的wifi节点,也去掉连接人数特别多的节点(可能是商场的公用wifi)
  2. 交易时间:
    细分时间,比如把一周的时间分为周末和工作日,一天的时间分为白天和夜晚等
  3. wifi定位信息:
    原是经纬度信息,取一定的置信区间转换成平面坐标
  4. 用户多次购买:
    有的用户购买一次就离开,有的用户连续购买,则连续购买的商品类型应该会存在一定的相似性,同时连续购买的店铺消费水平应该是接近的

为了方便训练(处理数据过程和训练过程分开),我把数据处理过程细分了下:

  1. 首先把数据按照店铺分开保存
  2. 读取每个店铺的数据,增加上面说的一些属性并保存
  3. 读取第二步的数据,采用交叉验证,网格搜索每个店铺最合适的训练参数进行训练,并输出数据

下面分别是这三部分程序:

# -*- coding: utf-8 -*-
"""
将输入数据处理成每个mall的文件,添加wifi连接数量的筛选
"""
import pandas as pd
import numpy as np
from sklearn import preprocessing
import pickle
import gc

path = './'
spath = './data/'
df = pd.read_csv(path + u'训练数据-ccf_first_round_user_shop_behavior.csv')
shop = pd.read_csv(path + u'训练数据-ccf_first_round_shop_info.csv')
test = pd.read_csv(path + u'AB榜测试集-evaluation_public.csv')
df = pd.merge(df, shop[['shop_id', 'mall_id']], how='left', on='shop_id')
df['time_stamp'] = pd.to_datetime(df['time_stamp'])
train = pd.concat([df, test])
mall_list = list(set(list(shop.mall_id)))
result = pd.DataFrame()
count = 0

for mall in mall_list:
    traino1 = ''
    traino2 = ''
    train1 = train[train.mall_id == mall].reset_index(drop=True)
    l = []
    wifi_dict = {}
    for index, row in train1.iterrows():
        wifi_list = [wifi.split('|') for wifi in row['wifi_infos'].split(';')]
        for i in wifi_list:
            row[i[0]] = int(i[1])
            if i[0] not in wifi_dict:
                wifi_dict[i[0]] = 1
            else:
                wifi_dict[i[0]] += 1
        l.append(row)
    delate_wifi = []
    for i in wifi_dict:
        if wifi_dict[i] < 20:
            delate_wifi.append(i)
    m = []
    m2 = []
    for row in l:
        new = {}
        new2 = {}
        for n in row.keys():
            new2[n] = row[n]
            if n not in delate_wifi:
                new[n] = row[n]
        m.append(new)
        m2.append(new2)
    traino1 = pd.DataFrame(m)

    count = count + 1
    print(traino1.shape)
    print(mall, '序号', count)

    with open((spath + mall + '_1.txt'), 'wb') as f:
        pickle.dump(traino1, f)
    # del traino1
    # del m
    # del m2
    # del l
    # del wifi_dict
    # gc.collect()
print('finish')
# -*- coding: utf-8 -*-
"""
输入每个mall文件,添加属性

"""
import pandas as pd
import numpy as np
from sklearn import preprocessing
import xgboost as xgb
import lightgbm as lgb
import pickle
import gc
import matplotlib.pyplot as plt
import xgboost as xgb
import itertools
import os
import math
import csv
import scipy as sp
import scipy.stats
from math import radians, cos, sin, asin, sqrt, pi, log, tan
from sklearn.cluster import KMeans
# import procewifi
from sklearn.cross_validation import train_test_split

path = '../'
from newmodel import sitechange


def geodistance(lng1, lat1, lng2, lat2):
    '''计算经纬度两点间距离-m'''
    lng1, lat1, lng2, lat2 = map(radians, [lng1, lat1, lng2, lat2])
    dlon = lng2 - lng1
    dlat = lat2 - lat1
    a = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2
    dis = 2 * asin(sqrt(a)) * 6371 * 1000
    return dis


def millerxy(lon, lat):
    '''经纬度坐标转换平面坐标'''
    # pi=math.pi
    l = 6381372 * pi * 2
    w = l
    h = l / 2
    mill = 2.3
    x = lon * pi / 180
    y = lat * pi / 180
    y = 1.25 * math.log(math.tan(0.25 * pi + 0.4 * y))
    x = (w / 2) + (w / (2 * pi)) * x
    y = (h / 2) - (h / (2 * mill)) * y
    return x, y


def addfea(train1):
    '''增加平面坐标,星期,每天具体时间属性'''
    addfeature = []
    flagtwice = 0  # 顾客多次访问一个店铺
    tshape = train1.shape
    flagtime = []  #
    for ix, row in train1.iterrows():
        # 日期时间
        try:
            l2tmp = row['time_stamp']._time_repr.split(':')
        except:
            print(row['time_stamp'])
            l2tmp = row['time_stamp']._time_repr.split(':')
        tmplday = int(int(l2tmp[0]) * 6 + int(l2tmp[1]) / 10)
        if tmplday < 100:
            lday = True
        else:
            lday = False
        # lday = int(l2tmp[0])
        lweek = row['time_stamp'].weekday()
        if lweek == 6 or lweek == 5:
            lweek = 0
        else:
            lweek = 1
        if ix < tshape[0] - 1 and ix > 2:
            nexttrain = train1.ix[ix + 1]
            befortrain = train1.ix[ix - 1]
            if row['user_id'] == nexttrain['user_id'] or row['user_id'] == befortrain['user_id']:
                flagtwice = 1
            else:
                flagtwice = 0
        else:
            flagtwice = 0
        if lday < 60:
            flagtime = 0
        elif lday > 90:
            flagtime = 2
        else:
            flagtime = 1
        # 增加位置
        newlon, newlat = millerxy(row['longitude'], row['latitude'])
        addfeature.append([lday, lweek, flagtwice, flagtime, newlon, newlat])

    addfea = pd.DataFrame(np.array(addfeature), index=[x for x in range(tshape[0])],
                          columns=['lday', 'lweek', 'flagtwice', 'flagtime', 'newlon', 'newlat'])
    train2 = pd.concat([train1, addfea], axis=1)
    return train2


def addfea2(train1, shoplist):
    '''统计练过每个店铺时的位置,聚类出一个中心,然后用这个中心当做中心点,统计位置与中心点的距离,设定阈值并二值化'''
    df_train = train1[train1.shop_id.notnull()]
    df_test = train1[train1.shop_id.isnull()]
    addfeature = []
    # 训练集增加位置属性
    shop = pd.read_csv(path + u'训练数据-ccf_first_round_shop_info.csv')
    shop = shop[shop.mall_id == df_train['mall_id'][0]]
    shopdic = {}
    for i in shoplist:
        tmp1 = shop[shop.shop_id == i]
        shopdic[i] = [tmp1['longitude'].values[0], tmp1['latitude'].values[0]]

    posit_sit = dict(zip(shoplist, [[] for x in range(len(shoplist))]))
    for index, row in df_train.iterrows():
        # wifi_list = [wifi.split('|') for wifi in row['wifi_infos'].split(';')]
        posit_sit[row['shop_id']].append([row['longitude'], row['latitude']])
    for i in posit_sit:
        tmp1 = posit_sit[i]
        if len(tmp1) == 1:
            shopdic[i] = tmp1[0]
        else:
            shopdic[i] = KMeans(n_clusters=1).fit(posit_sit[i]).cluster_centers_[0].tolist()

    for ix, row in df_train.iterrows():
        tmp3 = []
        for i in shoplist:
            if geodistance(shopdic[i][0], shopdic[i][1], row['longitude'], row['latitude']) < 50:
                tmp3 = tmp3 + [True]
            else:
                tmp3 = tmp3 + [False]
                # tmp3 = tmp3 + [geodistance(shopdic[i][0], shopdic[i][1], row['longitude'], row['latitude'])]
        addfeature.append(tmp3)

    # 测试集增加位置属性
    for ix, row in df_test.iterrows():
        tmp3 = []
        for i in shoplist:
            if geodistance(shopdic[i][0], shopdic[i][1], row['longitude'], row['latitude']) < 50:
                tmp3 = tmp3 + [True]
            else:
                tmp3 = tmp3 + [False]
                # tmp3 = tmp3 + [geodistance(shopdic[i][0], shopdic[i][1], row['longitude'], row['latitude'])]
        addfeature.append(tmp3)

    addfea = pd.DataFrame(np.array(addfeature), index=[x for x in range(train1.shape[0])],
                          columns=['dis' + x for x in shoplist])
    train2 = pd.concat([train1, addfea], axis=1)
    return train2
    # xgbtrain2 = xgb.DMatrix(train2[feature].head(int(tshape[0] * 0.01)), train2['label'].head(int(tshape[0] * 0.01)))


def addlabel(train1):
    '''生成对应的标签店铺列表'''
    df_train = train1[train1.shop_id.notnull()]
    df_test = train1[train1.shop_id.isnull()]
    lbl = preprocessing.LabelEncoder()
    lbl.fit(list(df_train['shop_id'].values))
    df_train['label'] = lbl.transform(list(df_train['shop_id'].values))
    # num_class = df_train['label'].max() + 1
    listshop = list(lbl.classes_)
    train2 = df_train.append(df_test, ignore_index=True)
    return train2, listshop


def mean_confidence_interval(data, confidence=0.95):
    '''95%置信区间的数据,m为均值,m-h,m+h为置信区间下限和上限'''
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), scipy.stats.sem(a)
    h = se * sp.stats.t._ppf((1 + confidence) / 2., n - 1)
    outdata = (np.array(data) - (m - h)) / (2 * h)
    return [outdata, m, h]


if __name__ == '__main__':
    allfile = os.listdir('../data')
    result = pd.DataFrame()
    bestscore = []
    kk = 0
    for infile in allfile:
        with open('../data/' + infile, 'rb') as f:
            m = pickle.load(f)

        m_2, mshop = addlabel(m)  #
        m_21 = addfea(m_2)  #
        m_3 = addfea2(m_21, mshop)
        m_3['newlon'] = mean_confidence_interval(m_3['newlon'], confidence=0.95)[0]
        m_3['newlat'] = mean_confidence_interval(m_3['newlat'], confidence=0.95)[0]
        m_3['lweek'] = m_3['lweek'].astype('bool')
        m_3['flagtwice'] = m_3['flagtwice'].astype('bool')
        m_3['lday'] = m_3['lday'].astype('bool')
        m_3['lday'] = m_3['lday'].astype('int')
        with open('../data2/m' + infile, 'wb') as f:
            pickle.dump(m_3, f)
        with open('../data2/mshop' + infile, 'wb') as f:
            pickle.dump(mshop, f)
        kk = kk + 1
        print(kk, '个完成', infile)

    print('f')
# -*- coding: utf-8 -*-
"""
对比文件特征
"""
import pandas as pd
import numpy as np
from sklearn import preprocessing
import xgboost as xgb
import lightgbm as lgb
import pickle
import gc
import matplotlib.pyplot as plt
import xgboost as xgb
import itertools
import os
import math
import csv
import scipy as sp
import scipy.stats
from math import radians, cos, sin, asin, sqrt, pi, log, tan
from sklearn.cluster import KMeans
# import procewifi
from sklearn.cross_validation import train_test_split

path = '../'
from newmodel import sitechange


def crover(inparams):
    '''xgboost训练param网格搜索
    inparams:输入词典paramscans(需至少有一项参数为列表)
    alldic:输出每种网格搜索的参数
    outvar:输入变量是哪种
    '''
    listpara = []
    invir = []
    virdic = {}
    alldic = []
    outvar = []
    for i in inparams:
        if isinstance(inparams[i], type([])):
            listpara.append(inparams[i])
            invir.append(i)
        else:
            virdic[i] = inparams[i]
    if len(listpara) == 0:
        print('输入不是列表')
        os._exit()
    listpara2 = list(itertools.product(*listpara))
    for i in listpara2:
        outdic = virdic.copy()
        outlist = []
        for j in range(len(invir)):
            outdic[invir[j]] = i[j]
            outlist.append(i[j])
        # outvar.append()
        # print(outdic,outlist)
        # yield outdic,outlist
        alldic.append(outdic)
        outvar.append(outlist)
    return alldic, outvar


def parti(df_train, feature, testradio, count):
    '''交叉验证
    df_train:输入数据集
    feature:数据列类型
    testradio:测试集占比
    count:取第几部分当测试集
    比如testradio, count分别为0.2,1,则为取前面20%作为测试集'''
    lendata = df_train.shape[0]
    begin = int(lendata * testradio * (count - 1))
    end = int(lendata * testradio * count)
    test = df_train.iloc[begin:end, :]
    # train=pd.concat([df_train.iloc[:begin,:],df_train.iloc[end:,:] ], axis=1)
    train = df_train.iloc[:begin, :].append(df_train.iloc[end:, :])
    return xgb.DMatrix(train[feature], train['label']), xgb.DMatrix(test[feature], test['label'])






def genfeature(intrain, fea):
    '''调整需要验证的属性来优化结果,featall1:wifi连接属性'''
    featall1 = []
    featall2 = fea
    for i in intrain.columns:
        if 'b_' in i:
            featall1.append(i)
    feature = featall1 + featall2
    return feature


def trainxgbcro(train1, feature, result):
    '''带交叉验证的训练:网格搜索params里面的参数,使用交叉验证最好的作为最后仿真的参数'''
    df_train = train1[train1.shop_id.notnull()]
    df_test = train1[train1.shop_id.isnull()]
    lbl = preprocessing.LabelEncoder()
    lbl.fit(list(df_train['shop_id'].values))
    df_train['label'] = lbl.transform(list(df_train['shop_id'].values))
    num_class = df_train['label'].max() + 1

    params = {
        'objective': 'multi:softmax',
        'eta': 0.08,
        'max_depth': 10,
        'eval_metric': 'merror',
        # 'seed': 0,
        'missing': -999,
        'num_class': num_class,
        # 'nthread': 4,
        # 'min_child_weight': [1,5],
        'subsample': [0.5, 0.7],  # 随机采样训练样本
        'colsample_bytree': [0.301, 0.501],  # 生成树时进行的列采样!!!!!!![0.301,0.501,0.701]
        # 'seed': 1000,
        # 'gamma': [1e-5, 1e-2, 0.1, 1, 100],
        # 'lambda': [1e-5, 1e-2, 0.1, 1, 100],
        'silent': 1
    }

    tshape = df_train.shape

    # xgbtrain2 = xgb.DMatrix(df_train[feature].head(int(tshape[0] * 0.3)), df_train['label'].head(int(tshape[0] * 0.3)))
    # xgbtrain = xgb.DMatrix(df_train[feature].tail(int(tshape[0] * 0.7)), df_train['label'].tail(int(tshape[0] * 0.7)))

    xgbtrain, xgbtrain2 = parti(df_train, feature, 0.3, 2)
    # xgbtest = xgb.DMatrix(df_test[feature])
    watchlist = [(xgbtrain, 'train'), (xgbtrain2, 'test')]
    num_rounds = 400
    allvar = []
    paralist = []
    allpara, listvar = crover(params)
    for k in range(len(allpara)):
        # print (i,j)
        i = allpara[k]
        j = listvar[k]
        paralist.append(i)
        meanscor = 0
        meaniter = 0
        meanscor = []
        for cr in [1, 3, 5]:
            xgbtrain, xgbtrain2 = parti(df_train, feature, 0.2, cr)
            watchlist = [(xgbtrain, 'train'), (xgbtrain2, 'test')]
            model = xgb.train(i, xgbtrain, num_rounds, watchlist, early_stopping_rounds=15)
            # meanscor=meanscor+model.best_score
            meanscor.append(model.best_score)
            meaniter = meaniter + model.best_iteration
        # allvar.append([model.best_score, model.best_iteration, i,j])
        allvar.append([meanscor, sum(meanscor) / len(meanscor), meaniter / 3] + j)
        # print([xx[3] for xx in allvar])
    allvar2 = np.array(allvar)
    tmpsit = [x[0] for x in allvar]
    params = paralist[allvar2[:, 1].argmin()]

    xgbtrain = xgb.DMatrix(df_train[feature], df_train['label'])
    xgbtest = xgb.DMatrix(df_test[feature])
    watchlist = [(xgbtrain, 'train'), (xgbtrain, 'test')]
    num_rounds = int(allvar[allvar2[:, 1].argmin()][2]) + 10
    model = xgb.train(params, xgbtrain, num_rounds, watchlist, early_stopping_rounds=15)
    # print('本次迭代最优值和参数为',model.best_score,listvar[tmpsit.index(min(tmpsit))])

    df_test['label'] = model.predict(xgbtest)
    df_test['shop_id'] = df_test['label'].apply(lambda x: lbl.inverse_transform(int(x)))
    r = df_test[['row_id', 'shop_id']]
    result = pd.concat([result, r])
    result['row_id'] = result['row_id'].astype('int')
    result.to_csv('jcsub.csv', index=False)
    bestscore.append([infile, model.best_iteration, model.best_score, allvar[tmpsit.index(min(tmpsit))][3],
                      allvar[tmpsit.index(min(tmpsit))]])

    with open('jcbest.csv', 'w') as f:
        writer = csv.writer(f)
        writer.writerows(bestscore)

    return result, bestscore


if __name__ == '__main__':
    # mall='m_6167'
    allfile = os.listdir('../data')
    result = pd.DataFrame()
    bestscore = []
    kk = 0
    for infile in allfile:
        with open('../data2/m' + infile, 'rb') as f:
            m_3 = pickle.load(f)
        with open('../data2/mshop' + infile, 'rb') as f:
            mshop = pickle.load(f)
        feature = genfeature(m_3,
                             ['newlat', 'newlon', 'lweek',
                              'lday'])  # , 'flagtwice', 'lday'])# + ['dis' + x for x in mshop]
        result, bestscore = trainxgbcro(m_3, feature, result)
        kk = kk + 1
        print(kk, '个完成', infile)
    print('f')
  • 1
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值