Chp2 Data Collection, Sampling, and Preprocessing
2.1 Introduction
GIGO: garbage in garbage out principal
2.2 Types of data sources
- Transactional data:
stored in a OLTP(online transaction processing) relational databases
RFM variables - Contractual, subscription, or account data:
stored in a CRM(customer relationship management) database
a source of sociodemographic information: slow-moving data dimensions
sources to retrieve sociademographic or factual data : subscription data, data poolers, survey data, publicly available data sources - data poolers :
to gather data and sell it to interested customers
to build predictive models and sell the output of these models as risk scores.
data poolers E.g: Experian, Equifax, CIFAS, Dun&Bradstreet, Thomson Reuters.
FICO score :use Experian, Equifax and Transunion - Surveys:
online: Facebook, LinkedIn, Twitter
offline - Behavioral information:
fast-moving data / dynamic characteristics
include: preference of customers/ usage information/ frequencis of events/ treend variables… turnover/ solvency/ umber of employees - unstructured data: text documents
- unstructured data : contextual or network information
- qualitative, expert-based data:
a popular example of applying expert-based validation is checking the univariate signs of a regression model. - publicly available data:
such as macroecomomic data(GDP, inflation, unemployment)/ weather observations/ socail media data from Facebook,Twitter,LinkedIn… …
2.3 Merging data sources
rows: instances/ observations/lines
columns: variables/ fields/ characteristics/ attributes/ indicators /features
keys:
2.4 Sampling
a good sampling should be representative for the future entities on which the analytical model will be run.—> choosing the optimal time window
stratified sampling
2.5 types of data elements
- continuous data
- categorical data
2.1 nominal
2.2 ordinal
2.3 binary
2.6 Visual data exploration and exploratory statistical analysis
2.7 Benford’s Law
the frequency distribution of the first digit in many real-life data sets complies with the Benford’s law expresses as follow:
p ( d ) = l o g 10 ( d + 1 d ) = l o g 10 ( d + 1 ) − l o g 10 ( d ) p(d)=log_{10}(\frac{d+1}{d})=log_{10}(d+1)-log_{10}(d) p(d)=log10(dd+1)=log10(d+1)−log10(d)
a partially negative rule: if the law is not satisfied, it’s probable that the involved data were manipulated/tampered and further investigation or testing is required. Conversely, if a data set complies with the Benford’s law ,it can still be fraudulent.
2.8 descriptive statistics
- continuous variables:
- basic descriptive statistics:
- mean
- median
- variation/ standard deviation
- percentile values
- specific descriptive statistics:
- skewness: symmetry/asymmetry of a distribution
- kurtosis: peakedness/flatness of a distribution
- basic descriptive statistics:
- categorical variables:
- mode: the most frequently occurring value
2.9 missing values
- originated because of : nonapplicable info/ undisclosed info/ an error during merging
- deal with missing values:
- replace(impute):the average /median of the known values / regression-based imputation/
- delete: if information is missing at random and has no meaningful interpretation /relationship to the target
- keep: if missing values means something
2.10 outlier detection and treatment
- valid observations(salary $1000000)/invalid observations(age 300)
- unidimensional/multivariate
- detection:
- univariate outliers detection:
- calculate the minimum/maximum values/histograms/box plot/
- too far away from the edges of the box(1.5*IQR1) IQR=Q3-Q1(interquartile range)
- calculate z-scores. z i = ( x i − μ ) / σ . zi=(x_i-\mu)/\sigma. zi=(xi−μ)/σ. if z i > 3 z_i>3
- univariate outliers detection: