【读书笔记】_Fraud analysis_Chp3

Chp3 Descriptive Analytics for Fraud Detection

3.1 Introduction

descriptive analytics /unsupervised learning aims at finding unusual anomalous behavior deviating from the average behavior or norm.
Norm means:

  • the behavior of the avearage customer at a snapshot in time
  • the average behavior of a given customer across a particular time period
  • a combination of both

in fraud detection, unsupervised learning is often refered to as anomaly detection.

Problem1. define the average behavior or norm
highly depend on the application field considered.
boundary beween the norm and the outliers is not clear-cut:

  • reason1. fraudsters will try to blend into the average or norm as good as possible.
  • reason2. the norm may change over time, so the analytical models built need to be continuously monitored and updated in real-time.
  • reason3. anomalies do not necessarily represent fraudulent observations.

3.2 Graphical Outlier Detection Procedures

  • methods:

    • histogram/box plot : one-dimensional outliers
    • scatter plot: two/three-dimensionnal outliers
  • complemented in :

    • multidimensional data analysis / online analyticl processing(OLAP) facilities
  • OLAP operations after the cube be populated from a data warehouse:

    • Roll-up:
    • Drill-down:
    • Slicing:
    • Dicing:
  • OLAP tools:

    • pivot tables

3.3 Statistical Outlier Detection Procedures1

z-scores:

  • Grubbs test: a more formal test
    • H0: There are no outliers in the data set.

    • H1: There is at least one outliers in the data set.

      calculated the z-scores of every observation, G is the maximum absolute value of them.
      if G > N − 1 N t α 2 N , N − 2 2 N − 2 + t α 2 N , N − 2 2 G>\frac{N-1}{\sqrt{N}}\sqrt{\frac{t^2_{\frac{\alpha}{2N},N-2}}{N-2+t^2_{\frac{\alpha}{2N},N-2}}} G>N N1N2+t2Nα,N22t2Nα,N22
      then the corresponding observation is considered an outlier at significance level α 2 N \frac{\alpha}{2N} 2Nα.
      When an outlier is detected, it is removed from the data set and the test can be run again.

The Test can also be run for multivariate outliers whereby the z-score can be replaced by the Mahalanobis distance:
( x − x ˉ ) t S − 1 ( x − x ˉ \sqrt{(x-\bar{x})^tS^{-1}(x-\bar{x}} (xxˉ)tS1(xxˉ

Key Weakness:it assumes an underlying normal distribution.

3.3.1 Break-point Analysis

Break-point analysis is an intra-account fraud detection method.
indicates a sudden change in account behavior
First, define a fixed time window.
Second, split the time window into an old and new part, the old part represent the local model or profile against which the new observations wil be compared.
Then, a student’s t-test can be used to compare the averages of the new and old parts.

3.3.2 Peer-Group Analysis

A peer group is a group of accounts that behave similarly to the target accout.

  • Step1. identified the peer group of a particular account.
    by prior business knowledge
    or in statistical way: statistical similarity metrics/ Euclidean-based metrics
  • Step2. constrasted the behavior of the target account with its peers using a statistical test such as a Student’s t-test, or distance metric(Mahalanobis distance…)

3.3.3 Association Rule Analysis

  • Step1. identify the frequent item sets:
    s u p p o r t ( X ) = n u m b e r o f t r a n s a c t i o n s s u p p o r t i n g ( X ) t o t a l n u m b e r o f t r a n s a c t i o n s support(X)={number of transactions supporting(X) \over total number of transactions} support(X)=totalnumberoftransactionsnumberoftransactionssupporting(X)
  • Step2. quantify the strength of an association rule by means of its confidence:
    c o n f i d e n c e ( X − > Y ) = P ( Y ∣ X ) = s u p p o r t ( X + Y ) s u p p o r t ( X ) confidence(X -> Y) = P(Y|X) = {support(X+Y) \over support(X)} confidence(X>Y)=P(YX)=support(X)support(X+Y)

3.4 Clustering

3.4.1 Introduciton

The aim of clustering is to split up a set of observations into segments such that the homogeneity within a segment is maximized(cohesive), and the heterogeneity between segments is maximized(separated).

  • Using data: various types of clustering data can be used:
    • customer/account/transaction characteristics…
    • RFM variables…
    • unstructured information: emails/ call records / socail media informations
  • Select data:
    to avoid excessive ammounts of correlated data by applying unsupervised feature selection methods(Pearson correlation)
    Used for fraud detection:
    notice groups anomalies into small, sparse clusters.
  • Types of cluster:
    • Hierarchical
    • Nonhierarchical

3.4.2 Distance Metrics

  • Continous data:

    1. Minkowski distance or Lp norm:

      when p = 1 , it’s Manhattan or City Block distance
      when p = 2 , it’s Euclidean distance

    2. Pearson correlation

    3. Cosine measure

  • Categorical data:

    • Binary variables:

      1. Simple matching coefficient(SMC) = same/total
        a tacit assumption behind the SMC is that both states of the variable are equally important.

      2. Jaccard index = (same and values 1) / (one are value 1)

    • Other categorical variables:

      1. code the categorical variables as 0/1 dummies and apply the Manhattan distance metrics.
        coarse classification (by experience) can be considered to reduce the number of dummy variables.
      2. SMC

3.4.2 Hierarchical Clustering

Two method:

  • Divisive:
  • Agglomerative:

Distance between two cluster:

  • single linkage method: smallest possible distance
  • complete linkage method: biggest possible distance
  • avarage linkage method: average of all possible distance
  • centroid methold: distance between the centroids of both clusters
  • Ward’s distance: 先计算 C i C_i Ci C j C_j Cj各组内离差平方和,再计算如果新组Cij并入,离差平方和增加多少:
    D w a r d ( C i , C j ) = Σ x ∈ C i ( x − c i ) 2 + Σ x ∈ C j ( x − c j ) 2 − Σ x ∈ C i j ( x − c i j ) 2 D_{ward}(C_i,C_j)=\Sigma_{x\in C_i}(x-c_i)^2 + \Sigma_{x\in C_j}(x-c_j)^2 - \Sigma_{x\in C_{ij}}(x-c_{ij})^2 Dward(Ci,Cj)=ΣxCi(xci)2+ΣxCj(xcj)2ΣxCij(xcij)2

Dendrogram / Screen plot

Advantge: the number of clusters does not need to be specified prior to the analysis.
Disadvantage:
the methods do not scale very well to large data sets.
the interpretation of clusters if often subjective and depends on the business expert/ data scientist.

#sklearn实现hierarchical聚类
from sklearn import cluster
clustering= cluster.AgglomerativeClustering(linkage='ward',n_clusters=3)
clustering.fit(df)

#sklearn实现hierarchical聚类
from sklearn import cluster
clustering= cluster.AgglomerativeClustering(linkage='ward',n_clusters=3)
clustering.fit(df)

#画dendrogram的方法
import scipy.cluster.hierarchy as shc
# Import Data
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/USArrests.csv')
# Plot
plt.figure(figsize=(16, 10), dpi= 80)  
plt.title("USArrests Dendograms", fontsize=22)  
dend = shc.dendrogram(shc.linkage(df[['Murder', 'Assault', 'UrbanPop', 'Rape']], method='ward'), labels=df.State.values, color_threshold=100)  
plt.xticks(fontsize=12)

plt.show()

3.4.3 Example of Hierachical Cluster Procedures

  • single linkage results in thin, long, elongated clusters.
  • complete linkage will make the cluster tighter, more balanced and spherical.
    average linkage prefers to merge clusters with small variances.
  • centroid method results different.
    wald’s methold prefers to merge clusters with a small number of observations and often results into balanced clusters.

3.4.4 k-means Clustering

k-means clusering is a nonhierarchical procedure:

  • step1: select k observations as initial cluster centroids(seeds)
  • step2: assign each observation to the cluster that has the closest centroid
  • step3: when all observations have been assigned, recalculate the positions of the k centroid
  • step4: repeat until the cluster centroids no longer change or a fixed number of iterations is reached.

Advises:

  • try multiple values of k
    try out different seeds to verify the stability of the clustering Solutions
  • use the median instead the mean ,since mean is sensitive to outliers,especially in a fraud detection setting
  • k-means is most often used in combination with a Euclidean distance, which typically results into spherical or ball-shaped clusters.

sklearn 实现kmeans

from __future__ import print_function
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np

# Generating the sample data from make_blobs
# This particular setting has one distinct cluster and 3 clusters placed close
# together.
X, y = make_blobs(n_samples=500,
                  n_features=2,
                  centers=4,
                  cluster_std=1,
                  center_box=(-10.0, 10.0),
                  shuffle=True,
                  random_state=1)  # For reproducibility

range_n_clusters = [2, 3, 4, 5, 6]
for n_clusters in range_n_clusters:
    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)

    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])
    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
    clusterer = KMeans(n_clusters=n_clusters, random_state=10)
    cluster_labels = clusterer.fit_predict(X)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(X, cluster_labels)
    print("For n_clusters =", n_clusters,
          "The average silhouette_score is :", silhouette_avg)

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(X, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
    ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7,
                c=colors, edgecolor='k')

    # Labeling the clusters
    centers = clusterer.cluster_centers_
    # Draw white circles at cluster centers
    ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
                c="white", alpha=1, s=200, edgecolor='k')

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
                    s=50, edgecolor='k')

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")

    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')

plt.show()

scipy实现kmeans
scipy.cluster.vp包下有四个函数:
whiten(obs[, check_finite]) Normalize a group of observations on a per feature basis.
vq(obs, code_book[, check_finite]) Assign codes from a code book to observations.
kmeans(obs, k_or_guess[, iter, thresh, …]) Performs k-means on a set of observation vectors forming k clusters.
kmeans2(data, k[, iter, thresh, minit, …]) Classify a set of observations into k clusters using the k-means algorithm.
注意whiten实际上是对向量实现列的标准差归1化,即:np.dot(features,np.diag(1/np.array([np.std(features[:,0]),np.std(features[:,1])])))

3.4.5 Self-Organizing Maps

A SOM(self-organizing map) is a feedforward neural network with two layers: an input and an output layer.
the neurons from the output layer are usually ordered in a two-dimensional rectangular or hexagonal grid.
each input is connected to all neurons in the output layer with weight w = [ w 1 , . . . , w N ] w=[w_1,...,w_N] w=[w1,...,wN](N is the number of variables).
all weights are randomly initialized
the weights of a training vecter x \boldsymbol{x} x , w = [ w 1 , . . . , w N ] w =[w_1,...,w_N] w=[w1,...,wN], w c w_c wc of each neuron c is compared with x \boldsymbol{x} x, using for example, the Euclidean distance metric
the neuron that is most similar to x \boldsymbol{x} x in Euclidean sense is called the best matching unit(BMU).
the weight vector of the BMU and its neighbors in the grid are then adapted using the following learning rule:
w i ( t + 1 ) = w i ( t ) + h c i ( t ) [ x ( t ) − w i ( t ) ] . w_i(t+1)=w_i(t)+h_{ci}(t)[x(t)-w_i(t)]. wi(t+1)=wi(t)+hci(t)[x(t)wi(t)].

  • t t t represents the time index during training,
  • h c i ( t ) h_{ci}(t) hci(t) defines the neighborhood of the BMU c, specifying the region of influence.
    • h c i h_{ci} hci, the neighborhood function, should be a nonincreasing function of time the the distance from the BMU, some popular choices are:
      h c i ( t ) = α ( t ) , i f ∣ ∣ r c − r i ∣ ∣ 2 < = t h r e s h o l d , 0 o t h e r w i s e h c i ( t ) = α ( t ) ∗ e x p ( − ∣ ∣ r c − r i ∣ ∣ 2 2 σ 2 ( t ) ) \begin{array}{l} h_{ci}(t) = \alpha(t) ,if ||r_c-r_i||^2 <= threshold, 0 otherwise\\ h_{ci}(t) = \alpha(t) * exp {(\frac{-||r_c-r_i||^2} {2 \sigma^2(t)})} \end{array} hci(t)=α(t),ifrcri2<=threshold,0otherwisehci(t)=α(t)exp(2σ2(t)rcri2)
    • σ ( t ) 2 \sigma(t)^2 σ(t)2 represents the decreasing radius
    • α ( t ) \alpha(t) α(t): α ( t ) ∈ [ 0 , 1 ] \alpha(t)\in [0,1] α(t)[0,1] is the learing rate, it can be choose like:
      α ( t ) = A / ( t + B ) α ( t ) = e x p ( − A t ) \begin{array}{l} \alpha(t) = A/(t+B)\\ \alpha(t) = exp(-At) \end{array} α(t)=A/(t+B)α(t)=exp(At)

SOMs can be visualized by:

  • a U-matrix (unified distance - matrix)
  • a component plane
'''
SOM ,也称为Kohonen network,补充资料:
https://www.cnblogs.com/LittleHann/p/7101992.html
算法步骤
1. 输入样本的特征为2(分别是x与y坐标值),共有8个输入样本,所以输入层的节点数为8
2. 因为最终要划分为两类,所以需要定义两个输出样本,所以输出节点为2,且两个输出节点的特征数为2(x,y)。
3. 根据以上规则随机初始化两个输出节点W。
4. for 每一个输入节点 INPUT{
        for 每一个输出节点W{
           计算当前输入节点i与输出节点w之间的欧式距离;
        }
        找到离当前输入节点i最近(欧式距离最小)的那个输出节点w作为获胜节点;
        调整w的特征值,使该w的特征值趋近于当前的输入节点(有个阈值(步长)控制幅度);
    }
    衰减阈值(步长);
5. 循环执行步数<4>,直到输出节点W趋于稳定(阈值(步长)很小) 
'''
#----------算法CODE实现-----------

# -*- coding:utf-8 -*-
# som 实例算法 by 自由爸爸
import random
import math

input_layer = [[39, 281], [18, 307], [24, 242], [54, 333], [322, 35], [352, 17], [278, 22], [382, 48]]  # 输入节点
category = 2


class Som_simple_zybb():
    def __init__(self, category):
        self.input_layer = input_layer  # 输入样本
        self.output_layer = []  # 输出数据
        self.step_alpha = 0.5  # 步长 初始化为0.5
        self.step_alpha_del_rate = 0.95  # 步长衰变率
        self.category = category  # 类别个数
        self.output_layer_length = len(self.input_layer[0])  # 输出节点个数 2
        self.d = [0.0] * self.category

    # 初始化 output_layer
    def initial_output_layer(self):
        for i in range(self.category):
            self.output_layer.append([])
            for _ in range(self.output_layer_length):
                self.output_layer[i].append(random.randint(0, 400))

    # som 算法的主要逻辑
    # 计算某个输入样本 与 所有的输出节点之间的距离,存储于 self.d 之中
    def calc_distance(self, a_input):
        self.d = [0.0] * self.category
        for i in range(self.category):
            w = self.output_layer[i]
            # self.d[i] =
            for j in range(len(a_input)):
                self.d[i] += math.pow((a_input[j] - w[j]), 2)  # 就不开根号了

    # 计算一个列表中的最小值 ,并将最小值的索引返回
    def get_min(self, a_list):
        min_index = a_list.index(min(a_list))
        return min_index

    # 将输出节点朝着当前的节点逼近(对node节点的所有维度,这里是x,y分别都进行更新)
    def move(self, a_input, min_output_index):
        for i in range(len(self.output_layer[min_output_index])):
            # 这里不考虑距离winner神经元越远,更新率的衰减问题
            self.output_layer[min_output_index][i] = self.output_layer[min_output_index][i] + self.step_alpha * (a_input[i] - self.output_layer[min_output_index][i])

    # som 逻辑 (一次循环)
    def train(self):
        for a_input in self.input_layer:
            self.calc_distance(a_input)
            min_output_index = self.get_min(self.d)
            self.move(a_input, min_output_index)

    # 循环执行som_train 直到稳定
    def som_looper(self):
        generate = 0
        while self.step_alpha >= 0.0001:  # 这样子会执行167代
            self.train()
            generate += 1
            print("代数:{0} 此时步长:{1} 输出节点:{2}".format(generate, self.step_alpha, self.output_layer))
            self.step_alpha *= self.step_alpha_del_rate  # 步长衰减


if __name__ == '__main__':
    som_zybb = Som_simple_zybb(category)
    som_zybb.initial_output_layer()

    som_zybb.som_looper()

3.4.6 Clustering with Constraints

semi-supervised clustering / clustering with constraints:

  • observation-level constraints:
    • a must-link constraints: enforces two observations be assigned to the same cluster
    • a cannot-link constraint: separated them into different clusters
      EXAMPLE: in a fraud-detection setting, one can forced a few observations that is known as fraud ones into a same cluster.
  • cluster-level constraints:
    • a minimum separation / delta constraint: specifies that the distance between any pair of observations in two different clusters must be at least delta
    • epsilon constraint: specifies that each observation in a cluster with more than one observation must have another observation within a distance of at most epsilon.
      EXAMPLE: a constraint includes the requirement to have balanced clusters, wherely each cluster contains the same amount of observations.

3.4.7 Evaluating and Interpreting Clustering Solutions

Evaluating: there exists no universal criterion

  • a statistical evaluation:SSE(sum of squared erros)
    S S E = Σ i = 1 K Σ x ∈ C i ( d i s t 2 ( x , m i ) ) ) SSE = \Sigma_{i=1}^K \Sigma_{x\in C_i}(dist^2(x,m_i)) ) SSE=Σi=1KΣxCi(dist2(x,mi))),where m i m_i mi is the centroid of cluster i
  • Besides statistical evaluations, a clustering solution will also be evaluated in terms of its interpretation. various options are available:
    1. compare cluster distributions with population distributions across all variables on a cluster-by-cluster basis.
    2. build a decision tree with the ClusterID as the target variable.

3.5 One-Class SVMs

One-class SVMs try to maximize the distance between a hyperplane and the origin.
The idea is to separate the majority of the observations from the origin.
The observations that lie on the other side of the hyperplane, closest to the origin, are then considered as outliers.

A hyperplane is defined as :

Normal observations lie above the hyperplane and outliers below it
, or in other words normal observations /outliers will return a positive / negative value for:

One-class SVMs then aim at solving the following optimization fuction:

The error variables e i e_i ei are introducted to allow abservations to lie on the side of the hyperplane closest to the origin.
The parameter v is a regularization term.

Mathematically, it can be shown that the distance between the hyperplane and the origin equals ρ / ∣ ∣ w ∣ ∣ \rho / ||w|| ρ/w.
The distance is now maximized by minimizing 1 2 Σ i = 1 N w i 2 − ρ {1\over 2} \Sigma_{i=1}^N w_i ^2 - \rho 21Σi=1Nwi2ρ.

As with SVMs for supervised learning, the optimization problem can be solved by formulating its dual variant, which also here yields a quadratic programming(QP) problem, and applying the kernel trick.
By again using Lagrangian optimization, the following decision function is obtained:


  1. 在这里插入图片描述 ↩︎

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值