数据挖掘基础总结

Data Engineering Outlines

Content

0. Background

Before the Start: 6 Objectives in this module

  1. Understanding of data lifecycle
  2. Understanding of the need for data lifecycle management
  3. Understanding of data engineering techniques
  4. Understanding of technical, ethical and societal issues related to data collection, storage, access, dissemination and maintenance
  5. Understanding the main principles of data analysis and how it can be applied into various application areas
  6. Demonstrate of relevant standards and typical practice in data engineering

1. Data Lifecycle Management

1.1 [Why] Motivation - (8 reasons)
  • Why DLM? 8 reasons
    1. Nowadays, business deal with data and no longer with paper-based documents
    2. The generic demand of a complex information ecosystem is rising
      • ==> Due to the data need to be accessed, edited and transmitted with a great variety if applications that challenge our understanding and results
    3. Due to the developing of techniques, Issues about information has arisen
      • ==> Due to the sudden information evolution
    4. The demand from enterprise only as far as its ability to control information.
      • ==> Information is always the main asset of a survived company in this competitive world. Actually, there is no choice for enterprise but to adapt and find mechanisms to control its share of the complex information ecosystem/
    5. Data generation and storage has become simple and cheap.
      • We are overwhelmed in electronic information
    6. Business is interested in generating, storing and analyzing data to improve competitiveness
    7. For legal and business purpose, retention of certain data sets has become an issue of urgency
    8. Not all the data is useful, they need to be processed.
1.2 [What] Definition - (2 aspects)

What is DLM? 2 concepts

  • DLM is the process of managing data throughout its lifecycle, from conception until disposal
  • DLM encompasses the strategies of filtering, discarding and utilizing.
1.3 [How] Develop a DLM plan - (4 aspects)

How to implement a DLM plan? 4 aspects

A DLM plan should cover the procedures about the data generating, organizing, resourcing and managing and consider about following issues:

  1. Identification of Objective in Maintaining Data
    • Decide objectives in maintaining its electronic records
    • Categorize the records
    • Devising the retaining and security policy for the objectives
  2. Minimalism
    • Principle: data should be discarded unless there is a good reason for retaining it
    • Strategy: take a hard look both the types and the regulatory constraints relating to the data
  3. Information Security
    • Protecting its information assets is fundamental for the survival of any organization. Appropriate information security procedures must ensure that shared records are not improperly accessed or edited, or stolen and sold to spammers
  4. Distribution control
    • Unauthorized access can largely be eliminated by employing network security controls
1.3 [How]The Phases of Data Lifecycle - (4 phases)

data lifecycle
As the above image shows, there 4 major steps in data lifecycle:

  1. Data collection
  2. Repository and Storage
  3. Pre-processing
  4. Dissemination
a) Data collection methods

[What] Data collection methods are wide and varied and any one method of collection is not inherently better than any other. An approach can only has cons and pros when it is weighted up in a view of a rich and complex content.

[How] Some popular methods:

  1. Surveys
  2. Interviews
  3. Observations
  4. Unobtrusive Measures
  5. Experimentation
b) Data Storage

[What] Storage is the container resource for data objects.

[How] Data could be stored in several ways:

  1. servers
  2. laptop computers
  3. handheld devices
  4. old back-up tapes
  5. CD-ROM disks
  6. thumb drives
  7. cellular phones and pagers

dataStorage

[Access] Logically, the data is collected from clients. After processing in the server or other administer node, the data would be sent and eventually stored in databases.

c) Pre-processing

[Background] Normally, an automated or semi-automated method would be applied to process commercial or scientific data. Typically, this use relatively simple, repetitive activities to process large volumes of similar information.

[Purpose] get high quality data

  1. This procedure is the preparation for data dissemination (formating), improving the efficiency and ease in latter steps
  2. This process would involve the obtaining of a descriptive summarization of the data to help people identify its general characteristics, as well as the presence of noise and outliers
  3. Data will be invested in terms of consistency, completeness and accuracy.

[How] Data Pre-processing Techniques

  1. data cleaning
  2. data integration
  3. data transformation
  4. data reduction
d) Data Dissemination methods
ItemIntance
Digital Content for sale/
List/report/queriesWeb Forms
Inter/Intra NetPDA
Business intelligencePersonalized Email
ForecastingPublications
New Product/ServiceFile Transfer
Knowledge Creation/MiningConversation
An example of a Complex Information Ecosystem

这里写图片描述

1.4 Issues in the Development of DLM Procedures
  • Identification of Objectives in Maintaining Data.

  • Application of the Minimalism Principle.

  • Consideration for Information Security Issues.
  • Consideration for Distribution Controls Issues.
    • Etc
1.5[Conclusion] DL Focus

DL focus is following on DL functions:

  1. Data Acquisition and Extraction
  2. Descriptive Data Summarization
  3. Data Pre-processing
    • Data cleaning(and quality)
    • Data intergration
  4. Data Quality Management
    • Overview of data Quality and exiting methodologies
1.6 Reading Materials

2. Data Quality and Cleaning

2.1[BG] Data Quality
  • DQ plays a critical role in all business and governmental applications
  • It is a relevant Performance issue of operating processes, of decision-making activities, and of inter-organizational cooperation requirements
  • The information system evolution into a network-based structure has caused the issue of DQ to become more complex
2.2 [BG] DQ in the Data Lifecycle
  • DQ should be in all phase of data lifecycle, but typically considered during data pre-processing step which precedes knowledge extraction and decision-making activities(data dissemination)
  • DQ applied over data during pre-processing phase are what we call Data Cleaning
2.2 Data Cleaning

[What is Data Cleaning?] Data Cleaning involves the filling in of missing values, the smoothing of noisy data, the identification or removal of outliers, and the resolving of inconsistencies and redundancies caused by data integration.

[Relevance] The Relevance of Data Cleaning

No quality data, no quality mining results!

  • High quality decision must be based on high quality data
  • Data cleaning is the number one problem in data warehouse
    • Data extraction, cleaning, and transformation comprises the majority of the work building a data warehouse

[BG] Data Quality Methodologies

  • The literature provides a wide range of techniques to assess and improve the quality of data
  • Due to the diversity and complexity of these techniques, research has focused on defining methodologies that help select, customise and apply DQ assessment and improvement techniques
  • DQ methodologies is a set of guidelines and techniques that defines a rational process to assess and improve the quality of data

[How] How to compare different types of DQ methodologies

  1. Phases and steps
    • i.e state reconstruction, assessment/measurement and improvement
  2. Strategies and techniques: can be data-driven and process-driven
    • Data-driven strategies improve the quality of data by directly modifying the value of data. For example, acquisition of new data, record linkage. etc
    • Process-driven strategies improve quality by redesigning the processes that create or modify data. For example, process control and process redesign.
  3. Types of data
    • For example: structured, un-structured and semi-structured
  4. Types of Information systems
    • For example: distributed, peer-to-peer, data warehouse, etc.
  5. Dimensions and metrics
  6. Etc.

[How] How to measure the quality of data

A well-accepted multidimensional view:

  1. Accuracy
  2. Completeness
  3. Consistency
  4. Timeliness
  5. Believability
  6. Value added
  7. Interpretability
  8. Accessibility

[Why] Why data in the Real World is Dirty?

“Duplicate records also need data cleaning!”

itemreasoncause
incomplete datameasuring Completeness not applicable data value when collected; 2. Different consideration between the time when the data was collected and when it is analyzed; 3. human, hardware or software problems
noisy datameasuring Accuracy Faulty data collection instruments; 2. Human or computer error at data entry; 3.Errors in data transmission
inconsistent datameasuring consistency Different data sources; 2.Functional dependency violation
2.3 Reading Materials
  • C. Batini, C. Cappiello, C. Francalanci and A. Maurino: Methodologies for Data Quality Assessment and Improvement. ACM Computing Surveys, Vol. 41, No. 3 Article 16, July 2009.
  • H. V. Jagadish, J. Gehrke, A. Labrinidis, Y. Papakonstantinou, J. M. Patel, R. Ramakrishnan and C. Shahabi: Big data and its technical challenges. Communications of the ACM, Volume 57, Issue 7, 2014, Pages 86-94.
  • H. V. Jagadish, J. Gehrke, A. Labrinidis, Y. Papakonstantinou, J. M. Patel, R. Ramakrishnan and C. Shahabi: Big data and its technical challenges. Communications of the ACM, Volume 57, Issue 7, 2014, Pages 86-94.
  • Naumann, F. (2014). Data Profiling Revisited. ACMSIGMOD Record, 42(4), 40-49.

3. Data Acquisition and Information Extraction

这里写图片描述

3.1 Data Acquisition

[Where] Sensor, simulations and scientific experiments can produce large volumes of data.

[What] taking filtering and compression criterion

[How] Static data. Give an initial profile of the data coming from the relevant sources, made at the first acquisition. The example data should involve both data and metadata, including any values inserted, deleted or updated, any entities inserted or deleted, any properties that were added or re-purposed, etc.

  • The decision regarding what data to filter is tied to the intended purpose of the data in intimate ways(so a fixed filtering strategy will often not work well)

    The decision should be based on the common use of data to the edge devices where the data is generated

The changes of data and metadata:

  1. what (e.g. a value)
  2. where (e.g. record and column information)
  3. when (e.g. timestamp)
  4. how (e.g. by update - presence of old and new values)

    • The information about such changes often indicate the level of quality and volatility of the data.

Data Ingestion Techniques

[What] The process of obtaining, importing, and processing data for later use or storage in a database

[Major Challenge] Maintain reasonable speed and efficiency in data ingestion, when numerous data sources exist in diverse formats.

[Solution of big data] The process of incrementally ingesting data, as it comes while performing data filtering and compression.

3.3 Information Extraction

[What] A process to pull out the required information from the underlying source and to express this information in a structure form suitable for analysis

[Challenges]

  1. Specify information extraction tasks
  2. optimize the execution during processing new data
3.4 Data Lake

[What] A data lake is an architecture focused on data storage and access whose main feature is a repository capable of storing vast quantities of dat in various formats

[Resource] Examples of data can from: Web server logs, databases, social media and third-party data. Either batch processing or real-time processing of streaming data.

[How] Data is no longer restrained by initial schema decisions, and can be exploited more freely by its consumers

[DAaas] The capabilities of data lake allow IT to provide Data and Analytics as a Service(DAssS) can be linked to the lakes’ repository, where IT acts as the data provider(supplier) and analysts, as data consumers

这里写图片描述

4. Data Pre-processing

4.1 Basic ideas of Data Pre-processing

[What] Something basic idea about data pre-processing

The main objective of data preprocessing is to prepare the data for dissemination(particularly in the form of knowledge creation or analysis through mining). This process may involve the obtaining of a descriptive summarization(profile) of the data to help us identify its general characteristics, as well as the presence of noise or outliers.

[Results] High quality data, which ease the process of the mining process(a step in dissemination)

4.2 Data Pre-processing Techniques (4 kinds)

These techniques could be used together.

  1. Data Cleaning: associated with the removal of noise and the correction of inconsistencies in the data.
  2. Data Integration: associated with the merging of data form multiple source into a coherent data store, e.g., a data warehouse
  3. Data Transformation: associated with the transformation of the data into a format which improved the accuracy and efficiency of dissemination methods
  4. Data Reduction: associated with the reduction of data size by, for example, aggregating, eliminating redundant features, or clustering.
4.2 Descriptive Data Summarization

[What] DDS techniques identify the typical properties of your data and highlight which data values should be treated as noise pr outliers. Basically, identify the errors and outliers

[Aim] Provides an overall picture of your data

[Techniques]

  1. measure of the central tendency of data, e.g., mean, median, mode and midrange
  2. measure of data dispersion: quantiles, interquartile range(IQR, middle half of all), and variance

[How] How , can DDS techniques be applied efficiently over large dataset?

  1. need to choose the most efficient implementation of the techniiques
  2. each techniques should be categorized according to its type
4.3 Types of DDS techniques
  1. Distributive Measures: can computed the measure for some subsets and then merge the results together.

    • sum, count, max and min

这里写图片描述

  1. Algebraic Measure: computed by applying an algebraic function to one or more distributive measures.

    • mean, weighted average

这里写图片描述

  1. Holistic Measures

    • e.g. median

I.e., Range, K^th^ percentile, median and Quartiles

*[What]***IQR ( Q3Q1 Q 3 − Q 1 )

QuartileQuantilePercentile
1^st^ quartile(Q1)2525%
2^nd^ quartile(Q2)5050%
3^rd^ quartile(Q3)7575%
4^th^ quartile(Q4)100100%
4.3 Dealing with skewed data
a) Numeral Analysis

No single numerical measure of spread, such as IQR, is very useful for describing skewed distributions, as the spread of two sides of this type of distribution are unequal.

[How to describe the shape of distribution (Five-number summary)] Use Q1, Q3, median and the range(min and max) of the data

[How to identify outliers] Use Boxplots based on five-number summary.

problemsolution
shape of distributionfive-number summary(minimum, maximum, Q1, Q3 and median
identify outlierboxplots which is based on five-number summary
central tendencya) mean; b) median; c)mode(most popular one); d)midrange( (Max+Min)/2)
data dispersionQuartiles, IQR(Q3-Q1), Variance, standard deviation and standard error
b) Visualized Analysis
  1. Histogram: display the distribution of a given attribute

  2. Quantile plots: displays the distribution of a given attribute

  3. Scatter plots: display relationship, patterns or trends between two numerical attributes.

  4. Loess curves: adds a smooth curves to a scatter plot to provide better perception of the pattern of dependence

这里写图片描述

这里写图片描述

4.3 Data Profile

[Data Profiling Workflow]

这里写图片描述

PerspectiveObjectives
Analysis of values in a single columnanalysis of data distributions; discovery of patterns
Analysis of inter-value dependencies across columnsapplication of association rules; clustering and outlier detection algorithm

5. Data Cleaning

[What - 3 aspects] Prescribes routines to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data.

  1. fill missing value
  2. identify outliers
  3. correct errors
5.1 Dealing with missing values
[How] How to deal with missing value - (6 ways)

a) By ignoring a tuple that has no recorded value for a number of attributes, such as customer income

b) By filling in the missing value manually

c) By using a globe constant to fill in the missing value, for example “Unknown” or -∞

d) By using the attribute mean to fill in the missing value

e) By using the attribute mean for all samples belonging to the class as the given tuple

f) By using the most popular value to fill in the missing value, which may be determined with regression, inference-based tools using a Bayesian formalist, or decision tree induction,

Is missing value always an error?

*[How]***How to deal with Noisy Data - (3 ways)

a) By binning a stored data value, by consulting its “neighborhood” (i.e., the value around it)

  • Data values are sorted partitioned into a number of bins.
    • Binning can be by means, medians and boundaries.

b) By fitting the data to a function, such as with regression

  • It is similar to finding the “best” line to fit two attributes, so that one attribute can be used to predict the other

c) By clustering the data values, where similar values are organized into groups, or “clusters”

  • Values that fall outside of the set of clusters may be considered outliers
5.2 Data Cleaning as a Process - (2 Steps)
  • Step 1: Discrepancy detection
    • Examine sources of discrepancies: poorly designed dataentry forms, human error in data entry, informationdenial (deliberate errors), data decay, inconsistent datarepresentation, inconsistent use of codes, errors ininstrumentation devices that record data, system errors,data inadequately used for purposes other than originallyintended, data integration, etc.
    • Proceeding with discrepancy detection: use of metadata(knowledge about the properties of data), use unique, consecutive and null rules, use tools for data scrubbing, auditing, migration, ETL.
  • Step 2: Data Transformation
    • An error-prone and time consuming process.
    • Some transformations may introduce some discrepancies.
    • This means that numerous interactions (Steps 1 and 2)are required.
5.3 Limitation of Data Cleaning as a Process - (3 aspects)
  1. No integration of discrepancy detection and transformation steps
  2. Lack of languages to specify how to detect and clean the data
  3. Lack of mechanism for metadata updating, as more about the data is discovered
5.3 Reading Material
  1. Data Mining Concepts and Techniques. J. Han andM. Kamber.
  2. Naumann, F. (2014). Data Profiling Revisited. ACMSIGMOD Record, 42(4), 40-49.

6. Data Integrtion

[What] It is the combination of data from multiple sources into a coherent data store, as in data warehousing

  • The source may include multiple databases, flat files, data cubes, etc.
  • It needs to consider about following issues:
    1. Schema integration
    2. Conflict at the data level
      • Entity Resolution
      • Redundancy
6.1 Schema Integration
  • Different schemas may record different information, represent the same data in different ways or different data in similar ways
  • Examples of schema conflicts:
    • Table vs Table
    • Table vs Attribute
    • Attribute vs Attribute
a) Table vs Table Conflicts
a-1 One to One table conflicts
  1. Table name conflicts
    • different name for equivalent tables
    • same name for different tables
  2. Table structure conflicts
    • missing attribute
    • missing but implicit attributes
  3. table constraints
  4. table inclusion
a-2 One to One table conflicts
b) Attribute vs Attribute Conflicts
b-1 One to One attribute conflicts
  1. Attribute Name Conflicts
    • Different names for equivalent attribute
    • Same name for different attribute
  2. Attribute constaints
    • Integrity Constraints
    • Data type
b-2 Many to Many attribute conflicts
  1. one attribute is represented by a few attributes in another table
c) Conflicts at the Data Level

Different representations of the same data:

  1. Data type conflicts: char vs integer

  2. Different units: week vs hours

  3. Different levels of precision(bin size)

  4. Different encoding denoting same information

    Employee.bonus(%) <=> EmpTax.Bouns($)

    age <=> dateOfBirth

6.2 Views for Reconciliation
  • Some conflicts can be overcome using views, but it is tricky

这里写图片描述

6.3 Entity Resolution (anther strategy)
  • Is the processing of identifying records that represent the same real-world entity
    • For example, two companies that merge may want to combine their customer records. In such case, the same customer may be represented by multiple records, so these marching records must be identified and combined (into a cluster).
  • Extremely expensive due to the large dataset scale and complex logic, which decide if the records represent the same entity
  • In practice, an entity resolution is an constantly improved process, based on the better understanding of data, and the logic that examines and compares records

这里写图片描述

7. Summary

  • There is a growing need for the management ofthe whole data life cycle, particularly with theevolution of information systems into complexnetworked structures and the continuousincrease in sizes of data sets.
  • While general management principles areavailable, more specific and precise methods areharder to devise.
  • At each phase of the data life cycle, numerous data management challenges remain to be solved before Big Data fulfills its potential
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值