Author:Zhang Felix
Introduction
In the age of internet, online shopping has created a revolution for consumers, growing rapidly year by year and offering a Golden Age of shopping. User account sharing is quite a common phenomenon in online shopping. People are willing to share user account in all purposes. Sellers are sharing their user accounts with employees to sell in a more efficient way. Buyers are sharing their user account with family, friends, roommates to enable more convenience user experience. In this paper, we present a logic based study to define account sharing behavior in eBay marketplace. Using data from a large eBay behavior data platform, we show that there are discernible and noteworthy patterns of account sharing behavior. We also build account sharing classification model to capture shared accounts using user demographic, pre-switch, post-switch, switch transition and meta session level features. Our findings show that our model can attain strong account sharing detection accuracy and improve personalization experience in eBay and enable analytics for device switch.
RELATED WORK
In this paper our focus is on identifying account sharing users. Related work falls in the following areas: (1) Connecting sessions in device switch. (2) Account sharing behavior classification.
Multi-screen usage is the key to the account sharing model. The concept of multi-screen has been raised by Google 2 years ago. Device switch has been studied intensively, behavior data from log server have proven to be extremely valuable in studying how people switch device to another. In our case of eBay marketplace, there are three device categories: desktop, cellphone, tablet. We are connecting these switching sessions from one device category to another. We call these sessions as “Meta Session” in this project. There are three types of meta session.
First is sequential, the session 2 starts after session 1 within 30 mins.
Second is overlapping, session 2 overlaps with session 1.
Third is subsuming, session1 contains session 2.
Modeling
In order to build the account sharing classification model, we created 5 groups of features.
Feature Group 1 - User Demographic Features
Specification: This is a group of features which is related to basic attributions for eBay user.
- Buyer/Seller Indicator
- Gender
- Age Group
- Number of Children
- Single/Family Indicator
- Seller Level (CSS)
Feature Group 2 - Pre-Switch Features
Specification: This is a group of features related to pre-switch session.
- Device category
- Avg leaf male fraction
- Avg leaf female fraction
- Session Duration
Feature Group 3: Post-Switch Features
Specification: This is a group of features related to post-switch session.
- Device category
- Avg leaf male fraction
- Avg leaf female fraction
- Session duration
Feature Group 4 - Switch Transition Features
Specification: This is a group of features related to transition attribution between pre-switch session and post-switch session.
- Distance
- Moving Speed
- Sequential gap duration
- Overlap gap duration
- Event Gap Count for 1 & 2 second threshold
- Switch hour bucket
- Device pair frequent count
- Leaf gender variance score
- Meta Category similarity Score
- Notification pair indicator
Feature Group 5 - Meta Session Features
Specification: This is a group of features related to overall account daily performance in the meta session.
- # unique devices
- # unique cellphones
- # unique desktops
- # unique tablets
We added feature groups step by step to train our GBM model, the following the Specificity vs Sensitivity plot.
M1: we use user demographic features + pre-switch session features
M2: we use M1 + post-switch feature
M3:we use M2 + between-switch transition feature
M4: we use M3 + meta session level features
M5: we use Ensemble model based on 8 M4 models
GBM Parameter Tuning
We have tuned the optimized model parameters by grid search,
finally, we got the best parameters: tree=350, depth=3 and shrinkage = 0.15,
so for the further model tuning, we will apply these 3 parameters to all the models.
# of Trees = 350:
Stochastic Gradient Boosting
67500 samples
24 predictors
2 classes: 'No', 'Yes'
No pre-processing
Resampling: Cross-Validated (3 fold)
Summary of sample sizes: 44999, 45001, 45000
Resampling results across tuning parameters:
n.trees ROC Sens Spec ROC SD Sens SD Spec SD
100 0.958 0.973 0.472 0.00064 0.000983 0.00284
150 0.961 0.971 0.512 0.000583 0.00121 0.0046
200 0.962 0.969 0.55 0.000406 0.000684 0.00861
250 0.963 0.968 0.568 0.000547 5.69e-05 0.0103
300 0.963 0.967 0.582 0.000506 0.000591 0.00743
350 0.964 0.966 0.595 0.000618 0.000402 0.00395
400 0.964 0.966 0.599 0.00063 0.000421 0.00409
450 0.964 0.966 0.599 0.000767 0.000427 0.00686
500 0.964 0.966 0.605 0.000733 0.000733 0.0112
# interaction.depth = 3:
67500 samples
24 predictors
2 classes: 'No', 'Yes'
No pre-processing
Resampling: Cross-Validated (3 fold)
Summary of sample sizes: 45000, 44999, 45001
Resampling results across tuning parameters:
interaction.depth ROC Sens Spec ROC SD Sens SD Spec SD
1 0.943 0.978 0.338 0.00238 0.00233 0.0177
3 0.963 0.965 0.597 0.00136 0.000994 0.00711
5 0.964 0.964 0.628 0.00159 0.00108 0.00653
7 0.964 0.963 0.629 0.00155 0.00158 0.00479
Shrinkage=0.15:
Stochastic Gradient Boosting
67500 samples
24 predictors
2 classes: 'No', 'Yes'
No pre-processing
Resampling: Cross-Validated (3 fold)
Summary of sample sizes: 45000, 44999, 45001
Resampling results across tuning parameters:
shrinkage ROC Sens Spec ROC SD Sens SD Spec SD
0.05 0.961 0.97 0.531 0.000171 0.00197 0.0166
0.1 0.963 0.967 0.588 0.000417 0.00118 0.0107
0.15 0.964 0.965 0.597 4e-04 0.00115 0.00869
0.2 0.964 0.964 0.607 0.000871 0.00155 0.00758
Ensemble Model
Under Sampling
Because the positive/negative data is imbalanced(ratio is nearly 1:8), so use under-sampling technics to partition 8 training data sets and build 8 GBM models
Mixture Model
Blending them together to generate a predicted positive label probability
And set the cutoff to be 90% to improve the precision
Mixing with rule engine
In order to complement the recall sacrifice, we combined the rule back to get the final labels.
ROC comparison:
Model | Model Description | ROC |
M1 | user demographics + pre_switch features | 0.6273 |
M2 | M1 + post_switch features | 0.6558 |
M3 | M2 + between_switch transition features | 0.9381 |
M4 | M3 + meta session level features | 0.938 |
M5 | 8 M4 Ensemble Model | 0.9696 |
Variable Importance Chart:
Feature | Overall |
mob_dev_cnt | 100 |
pc_dev_cnt | 71.4528 |
seq_gap_dur | 24.3231 |
notif_pair_as_label | 20.7925 |
tab_dev_cnt | 15.6279 |
device_pair_cnt | 9.9352 |
overlap_gap_dur | 7.6 |
meta_categ_similarity_score | 6.8111 |
ttl_dev_cnt | 5.7397 |
sec_gap_pct | 3.5903 |
sec_gap_cnt | 2.308 |
to_sess_dur | 1.5664 |
from_sess_dur | 1.1491 |
leaf_gender_diff_score | 1.0783 |
to_avg_female_pct | 1.0188 |
is_buyer | 0.5936 |
to_avg_male_pct | 0.424 |
user_age | 0.3801 |
from_avg_female_pct | 0.3113 |
switch_hour_bucket | 0.2983 |