Machine Learning Method In eBay Bot Detection
Zhao Kevin, Pengju Yan
Data Services and Solutions
September 4, 2014
Bot/Internet bot is a software application that runs automated tasks over the Internet to grab useful site information or hack a site. For analysis purpose, large site like eBay need to filter out those bot sessions before we do analysis. Bot detection is super important in eBay and currently we filter over 65% of sessions every day which are bot sessions. In this article, we will describe the current ebay bot rules first, then introduces a dimention reduction machine learning method called ”PCA”, this method will help generate useful features in order to separate bot session and non-bot sessions. Finally, we will evaluate this feature by Support Vector machine.
Keywords: machine learning, bot detection, PCA, SVM
Introduction to current eBay bot rules
In eBay everyday log record, more than 65% of sessions are bot sessions. This number might sounds unbelievable to some people. However, it is really the truth. Thanks to our bot filters, we are able to filter those 65% of session from our total traffic and site analyst will get much cleaner data set. In the first section, we will give a brief introduction to current bot detection system.
Sojourner bot rules
Sojourner platform is where eBay store sessionlized logs and use for analysis purpose. If you are using ubi session table of Sojourner records in eBay, you shall be very familiar with column ”BOT FLAG”. You used it as a filter to filter out all the bot sessions before you do any of your analysis.
Figure 1: SOJ bot rules
In Figure 1, We listed all 14 Sojourner bot rules. Bot rule 1,3,4,9,10,12,15 is intraday level bot rule which means it can detect whether the session is a bot when the session begins. The other 7 rules are EOD(end of day) bot rules. You can only mark those bots based on the aggregated summary at the end of the day.
Bot rule 1 and Bot rule 5 together makes up 97% of total bot traffic. Bot rule 1 is the so-called self-declared bot session. Some of large companies like google or Baidu will crawl some contents from our site so they can make index and people can search ebay out successfully. This bot rule will not do harm to our site, but for analysis for user behaviors, we also need to filter those sessions out.
For bot rule 5, it is an EOD bot. At the end of each day, all sessions will be grouped by agent and IP combination, if all sessions are single click session (which means this session have only one valid event) and also single click session count for the combination is larger than 50, furthermore there is no empty IP. If the above 3 conditions are satisfied, all sessions in that agent and IP combination will be marked as bot sessions.
There are also some other bot rules like if in one session, the search count or view count is larger than 400, it will be marked as a bot. If any 30 continuous events average dwell time is less than 0.75 and valid event click ID count is larger than 30, those session will also be marked as bot. Those bot filters help fil- ter lots of bot and help build a healthy analytical environment for eBay Analyst.
CLAV bot rules
This bot rule is created by Caleb and his team. If you use CLAV SESSION a lot, the data you got is filtered by both Sojourner bot rules and CLAV bot rules. This whole system of bot rule is a complementary to current Sojourner bot rules and it will filter more bots.
Some of the CLAV bot rules deals with CS piggybacking and others deal with auto-bidding. You can find all detailed information about the 22 bot rules at the end of this article as a reference.
Difference between Sojourner bot rules and CLAV bot rules
CLAV bot rules will apply filters to sessions which have already been filtered by Sojourner bot rules
This means CLAV bot rules will apply additional bot filters to data which has been filtered by Sojourner bot rules.
CLAV bot rules has no order, while Sojourner session has order
Sojourner bot rule is filtered with orders, bot rule 15 will take action at first, then all sessions which passed rule 15 will be checked by bot rule 1, and then bot rule 3 and so on. While CLAV bot rules have no order, all bot rules are stored in a 64 bit binary field and one session can be marked by multiple bot rule. For example, if a session satisfied bot rule 1 and 3, and not satisfy bot rule 2, it will be marked as ”....000101” in that field, the last three bit is for bot rule 1 to 3.
Machine learning system for bot detection
In this section, we will apply a machine learning method on bot detection which will help us separate bot and non-bot session.
Motivation
Current bot detection system is good, but can we do better? All the current bot rules are based on previous experience and metric analysis, we wonder if we can build a machine learning system which can combine lots of bot rules and also add more features so that it can be more accurate than the current bot rule system.
We got bunch of features like combination of some bot rules in Sojourner and CLAV bot rules. In this article, we will describe a feature which is created by ourselves using machine learning method called Principal Component analysis. This method could be implemented in lot of other areas and analysis when you got too many features and you really want to grab the main dimension or direction where the largest variance lies on.
After doing a preliminary analysis, we found out that bot session and non bot session will differ a lot in page id sequence. So we are trying to find a way to separate bot and non bot session based on different page id structures.
Big Rocks
The main task is to create a measure or feature which can best describe the page id sequence difference between sessions. This is not easy because for each session, number of events is not fixed and there are thousands of page ids. It is impossible to count all page ids and then compare the occurrence of each page id between sessions.
Methodology
Feature Requirements:
The aim is to create the feature which can fully represent page ID structure and also good for machine learning model. So we think the feature needs to meet the following requirements:
1. It shall be a fixed-length vector;
2. Not only capturing the occurrence count of each page IDs, but also capturing the sequence/order of page IDs.
Algorithm
Based on this idea, we will implement our algorithm as below:
1. Input: A sequence of page IDs. We grab all page ids in order from each session in Sojourner and use them as input. For example, we can have a page ID sequence as (p5, p1, p1, p9, , p3), where pi denotes the page IDs.
2. Transfer page ID sequence to raw vector. raw-vector = (#p1, #p2, ,
#pN), where #pi denotes number of occurrence of the page ID pi. This can be normalized.
3. Append raw-vector with Bi-gram page ID occurrence data like (#(p1, p1),
#(p1, p2), , #(pN, pN), where #(pi, pj) denotes number of occurrence of the adjacent page ID pair of (pi, pj).
4. Append the 3-dimentional (normalized) counts into raw-vector.
5. Since there are too many dimensions in our data set, so we need to reduce the dimension size and only use the principle component of our features
as model features. We can apply dimension reduction, say PCA, on the raw-vector to reduce it to a final fixed-length feature vector. The dimension of the feature vector could be as small as 20.
6. Evaluation: We can train some classifier, say SVM, by using this feature and find if the classification accuracy is satisfied.
Additional Implementation details:
The number of unique page IDs in the eBay site is greater than 8,000. So if we calculate raw-vector on the page ID space, then the PCA will take too much time to finish. Caleb maintains all the page IDs in a table called P SOJ CL V.PAGES. In that table there are 4 kinds of page families which group the page IDs according to some criteria. We choose page family 4 in our model.
PCA algorithm code analysis
PCA algorithm brief description
Definition:
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
I will skip the tedious math here on PCA, it will use SVD decomposition which is part of matrix decomposition, if you want to know details, you can find everything about this algorithm on Wikipedia.
Code analysis:
You can directly use svd function in R to decompose a feature matrix:
If your raw feature matrix is set to raw.matrix, then you can use the following R command to get svd decomposition value:
#svd decomposition
svd.500k <- svd(raw.matrix)
svd.d <- svd.500k$d svd.v <- svd.500k$v
After we got value of d and v, we will first calculate how many dimension shall we kept to make sure we did not lose too much information after we conduct PCA . We use the following R code to get how many dimensions shall we keep:
total.variance <- as.vector(svd.d %*% svd.d)
accumulated.variance <- 0 dimension.kept <- 0
for (i in 1:length(svd.d)) {
accumulated.variance <- accumulated.variance + svd.d[i] * svd.d[i]
if (accumulated.variance >= 0.99 * total.variance) {
dimension.kept <- i
writeLines(sprintf("Dimension reduced to %d (99%% variance)", dimension.kept))
break
}
The dimension we kept in our experiment is 18.
Finally, we can multiple our original data matrix with v to get a dimension reduced data matrix. This matrix will be one of our features to separate bot and non-bot sessions.
Evaluation
It is highly possible that the bots and non-bots session are not linearly separa- ble. So we used the SVM (Support vector machine) (with Gaussian kernel) to evaluate the effectiveness of the PCA feature. The SVM package we used is
”kernlab”.
We selected the Radial basis function kernel when we train the SVM classifier. There is a free parameter σ needs to be tuned. Fortunately, the kernlab imple- ments a smart mechanism to select σ automatically. Then we only need to well tune the parameter C. The parameter is tuned by a 5-fold cross-validation on a
1% (9796-sized) sub-dataset. The classification error rates are given in Figure 2.
Therefore we set C = 1 for the final experiment.
We trained a model using 5-fold cross-validation with 48980 sessions, the error rate is 0.14 which means using the page ID sequence PCA features only can achieve as high as a classification accuracy of 86%. It seems that these features are promising in bot detection.
Figure 2: classification error rates
The R code for SVM model training is:
data <-
load("data/spids-normalized/session-page-id-sequence-pca.RData")
session.page.id.sequence <- get(data)
x.indexes <- c()
x.indexes <- c(x.indexes,
grep("spids.pca", colnames(session.page.id.sequence), fixed = TRUE))
x.indexes <- c(x.indexes,
grep("^events$", colnames(session.page.id.sequence))) y.index <- grep("^group$", colnames(session.page.id.sequence)) num.training.samples <- round(0.05 * nrow(session.page.id.sequence)) row.indexes <- sample(nrow(session.page.id.sequence),
num.training.samples)
x <- as.data.frame(session.page.id.sequence)[row.indexes, x.indexes]
y <- as.data.frame(session.page.id.sequence)[row.indexes, y.index]
x <- cbind(x, group = y) x$events <- log(x$events) library(kernlab)
svp <- ksvm(group~., data = x, type = "C-svc", kernel = "rbfdot", C = 1, cross = 5)
Summary
In this article, we introduced current eBay bot rule first, including Sojourner bot rule and Clav bot rules. Then we are trying to build a machine learning system and combine all current bot rule into it to identify bot and nonbot. We use page id sequence feature generation as an example to illustrate how we build our model, we use PCA to do dimension reduction and use SVM to evaluate the result. It achieves 86% accuracy which is high and the 14% false ones might due to flaws in current bot rules or our system, we can add more features to the model to make it better.
Figure 3: Clav Bot Rules