【读书笔记】_Fraud analysis_Chp2

本文详细介绍了欺诈分析中数据收集、采样和预处理的各个环节,包括不同类型的数据源、数据合并、采样方法、数据元素类型、数据可视化、Benford's Law、描述性统计、缺失值处理、异常值检测、标准化、分类、权重证据编码、变量选择、主成分分析、RIDITs和PRIDIT Analytics等关键概念和方法。
摘要由CSDN通过智能技术生成

Chp2 Data Collection, Sampling, and Preprocessing

2.1 Introduction

GIGO: garbage in garbage out principal

2.2 Types of data sources

  • Transactional data:
    stored in a OLTP(online transaction processing) relational databases
    RFM variables
  • Contractual, subscription, or account data:
    stored in a CRM(customer relationship management) database
    a source of sociodemographic information: slow-moving data dimensions
    sources to retrieve sociademographic or factual data : subscription data, data poolers, survey data, publicly available data sources
  • data poolers :
    to gather data and sell it to interested customers
    to build predictive models and sell the output of these models as risk scores.
    data poolers E.g: Experian, Equifax, CIFAS, Dun&Bradstreet, Thomson Reuters.
    FICO score :use Experian, Equifax and Transunion
  • Surveys:
    online: Facebook, LinkedIn, Twitter
    offline
  • Behavioral information:
    fast-moving data / dynamic characteristics
    include: preference of customers/ usage information/ frequencis of events/ treend variables… turnover/ solvency/ umber of employees
  • unstructured data: text documents
  • unstructured data : contextual or network information
  • qualitative, expert-based data:
    a popular example of applying expert-based validation is checking the univariate signs of a regression model.
  • publicly available data:
    such as macroecomomic data(GDP, inflation, unemployment)/ weather observations/ socail media data from Facebook,Twitter,LinkedIn… …

2.3 Merging data sources

rows: instances/ observations/lines
columns: variables/ fields/ characteristics/ attributes/ indicators /features
keys:

2.4 Sampling

a good sampling should be representative for the future entities on which the analytical model will be run.—> choosing the optimal time window
stratified sampling

2.5 types of data elements

  1. continuous data
  2. categorical data
    2.1 nominal
    2.2 ordinal
    2.3 binary

2.6 Visual data exploration and exploratory statistical analysis

2.7 Benford’s Law

the frequency distribution of the first digit in many real-life data sets complies with the Benford’s law expresses as follow:
p ( d ) = l o g 10 ( d + 1 d ) = l o g 10 ( d + 1 ) − l o g 10 ( d ) p(d)=log_{10}(\frac{d+1}{d})=log_{10}(d+1)-log_{10}(d) p(d)=log10(dd+1)=log10(d+1)log10(d)
a partially negative rule: if the law is not satisfied, it’s probable that the involved data were manipulated/tampered and further investigation or testing is required. Conversely, if a data set complies with the Benford’s law ,it can still be fraudulent.

2.8 descriptive statistics

  • continuous variables:
    • basic descriptive statistics:
      • mean
      • median
      • variation/ standard deviation
      • percentile values
    • specific descriptive statistics:
      • skewness: symmetry/asymmetry of a distribution
      • kurtosis: peakedness/flatness of a distribution
  • categorical variables:
    • mode: the most frequently occurring value

2.9 missing values

  • originated because of : nonapplicable info/ undisclosed info/ an error during merging
  • deal with missing values:
    • replace(impute):the average /median of the known values / regression-based imputation/
    • delete: if information is missing at random and has no meaningful interpretation /relationship to the target
    • keep: if missing values means something

2.10 outlier detection and treatment

  • valid observations(salary $1000000)/invalid observations(age 300)
  • unidimensional/multivariate

  • detection:
    • univariate outliers detection:
      • calculate the minimum/maximum values/histograms/box plot/
      • too far away from the edges of the box(1.5*IQR1) IQR=Q3-Q1(interquartile range)
      • calculate z-scores. z i = ( x i − μ ) / σ . zi=(x_i-\mu)/\sigma. zi=(xiμ)/σ. if z i > 3 z_i>3
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值