【读书笔记】_Fraud analysis_Chp2

_依旧_

于 2020-11-04 13:56:09 发布

阅读量376

点赞数

本文链接：https://blog.csdn.net/qq_41540498/article/details/109488984

版权

本文详细介绍了欺诈分析中数据收集、采样和预处理的各个环节，包括不同类型的数据源、数据合并、采样方法、数据元素类型、数据可视化、Benford's Law、描述性统计、缺失值处理、异常值检测、标准化、分类、权重证据编码、变量选择、主成分分析、RIDITs和PRIDIT Analytics等关键概念和方法。

摘要由CSDN通过智能技术生成

Chp2 Data Collection, Sampling, and Preprocessing

2.1 Introduction

GIGO: garbage in garbage out principal

2.2 Types of data sources

Transactional data:
stored in a OLTP(online transaction processing) relational databases
RFM variables
Contractual, subscription, or account data:
stored in a CRM(customer relationship management) database
a source of sociodemographic information： slow-moving data dimensions
sources to retrieve sociademographic or factual data : subscription data, data poolers, survey data, publicly available data sources
data poolers :
to gather data and sell it to interested customers
to build predictive models and sell the output of these models as risk scores.
data poolers E.g: Experian, Equifax, CIFAS, Dun&Bradstreet, Thomson Reuters.
FICO score :use Experian, Equifax and Transunion
Surveys:
online: Facebook, LinkedIn, Twitter
offline
Behavioral information:
fast-moving data / dynamic characteristics
include: preference of customers/ usage information/ frequencis of events/ treend variables… turnover/ solvency/ umber of employees
unstructured data: text documents
unstructured data : contextual or network information
qualitative, expert-based data:
a popular example of applying expert-based validation is checking the univariate signs of a regression model.
publicly available data:
such as macroecomomic data(GDP, inflation, unemployment)/ weather observations/ socail media data from Facebook,Twitter,LinkedIn… …

2.3 Merging data sources

rows: instances/ observations/lines
columns: variables/ fields/ characteristics/ attributes/ indicators /features
keys:

2.4 Sampling

a good sampling should be representative for the future entities on which the analytical model will be run.—> choosing the optimal time window
stratified sampling

2.5 types of data elements

continuous data
categorical data
2.1 nominal
2.2 ordinal
2.3 binary

2.6 Visual data exploration and exploratory statistical analysis

2.7 Benford’s Law

the frequency distribution of the first digit in many real-life data sets complies with the Benford’s law expresses as follow:
$p(d)=log_{10}(\frac{d+1}{d})=log_{10}(d+1)-log_{10}(d)$
a partially negative rule: if the law is not satisfied, it’s probable that the involved data were manipulated/tampered and further investigation or testing is required. Conversely, if a data set complies with the Benford’s law ,it can still be fraudulent.

2.8 descriptive statistics

continuous variables:
- basic descriptive statistics:
  - mean
  - median
  - variation/ standard deviation
  - percentile values
- specific descriptive statistics:
  - skewness: symmetry/asymmetry of a distribution
  - kurtosis: peakedness/flatness of a distribution
categorical variables:
- mode: the most frequently occurring value

2.9 missing values

originated because of : nonapplicable info/ undisclosed info/ an error during merging
deal with missing values:
- replace(impute):the average /median of the known values / regression-based imputation/
- delete: if information is missing at random and has no meaningful interpretation /relationship to the target
- keep: if missing values means something

2.10 outlier detection and treatment

valid observations(salary $1000000)/invalid observations(age 300)
unidimensional/multivariate

detection:
- univariate outliers detection:
  - calculate the minimum/maximum values/histograms/box plot/
  - too far away from the edges of the box(1.5*IQR¹) IQR=Q3-Q1(interquartile range)
  - calculate z-scores. $zi=(x_i-\mu)/\sigma.$ if $z_i>3$