数据挖掘基础总结

最新推荐文章于 2024-07-19 17:17:26 发布

Zewen_Liu

最新推荐文章于 2024-07-19 17:17:26 发布

阅读量2.1k

点赞数 1

分类专栏： Notes 文章标签： machine data mining 大数据数据挖掘笔记

本文链接：https://blog.csdn.net/weng_o/article/details/79250768

版权

Notes 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

Data Engineering Outlines

Content

Data Engineering Outlines

0. Background

Before the Start: 6 Objectives in this module

Understanding of data lifecycle
Understanding of the need for data lifecycle management
Understanding of data engineering techniques
Understanding of technical, ethical and societal issues related to data collection, storage, access, dissemination and maintenance
Understanding the main principles of data analysis and how it can be applied into various application areas
Demonstrate of relevant standards and typical practice in data engineering

1. Data Lifecycle Management

1.1 [Why] Motivation - (8 reasons)

Why DLM? 8 reasons
1. Nowadays, business deal with data and no longer with paper-based documents
2. The generic demand of a complex information ecosystem is rising
  - ==> Due to the data need to be accessed, edited and transmitted with a great variety if applications that challenge our understanding and results
3. Due to the developing of techniques, Issues about information has arisen
  - ==> Due to the sudden information evolution
4. The demand from enterprise only as far as its ability to control information.
  - ==> Information is always the main asset of a survived company in this competitive world. Actually, there is no choice for enterprise but to adapt and find mechanisms to control its share of the complex information ecosystem/
5. Data generation and storage has become simple and cheap.
  - We are overwhelmed in electronic information
6. Business is interested in generating, storing and analyzing data to improve competitiveness
7. For legal and business purpose, retention of certain data sets has become an issue of urgency
8. Not all the data is useful, they need to be processed.

1.2 [What] Definition - (2 aspects)

What is DLM? 2 concepts

DLM is the process of managing data throughout its lifecycle, from conception until disposal
DLM encompasses the strategies of filtering, discarding and utilizing.

1.3 [How] Develop a DLM plan - (4 aspects)

How to implement a DLM plan? 4 aspects

A DLM plan should cover the procedures about the data generating, organizing, resourcing and managing and consider about following issues:

Identification of Objective in Maintaining Data
- Decide objectives in maintaining its electronic records
- Categorize the records
- Devising the retaining and security policy for the objectives
Minimalism
- Principle: data should be discarded unless there is a good reason for retaining it
- Strategy: take a hard look both the types and the regulatory constraints relating to the data
Information Security
- Protecting its information assets is fundamental for the survival of any organization. Appropriate information security procedures must ensure that shared records are not improperly accessed or edited, or stolen and sold to spammers
Distribution control
- Unauthorized access can largely be eliminated by employing network security controls

1.3 [How]The Phases of Data Lifecycle - (4 phases)

As the above image shows, there 4 major steps in data lifecycle:

Data collection
Repository and Storage
Pre-processing
Dissemination

a) Data collection methods

[What] Data collection methods are wide and varied and any one method of collection is not inherently better than any other. An approach can only has cons and pros when it is weighted up in a view of a rich and complex content.

[How] Some popular methods:

Surveys
Interviews
Observations
Unobtrusive Measures
Experimentation

b) Data Storage

[What] Storage is the container resource for data objects.

[How] Data could be stored in several ways:

servers
laptop computers
handheld devices
old back-up tapes
CD-ROM disks
thumb drives
cellular phones and pagers

dataStorage

[Access] Logically, the data is collected from clients. After processing in the server or other administer node, the data would be sent and eventually stored in databases.

c) Pre-processing

[Background] Normally, an automated or semi-automated method would be applied to process commercial or scientific data. Typically, this use relatively simple, repetitive activities to process large volumes of similar information.

[Purpose] get high quality data

This procedure is the preparation for data dissemination (formating), improving the efficiency and ease in latter steps
This process would involve the obtaining of a descriptive summarization of the data to help people identify its general characteristics, as well as the presence of noise and outliers
Data will be invested in terms of consistency, completeness and accuracy.

[How] Data Pre-processing Techniques

data cleaning
data integration
data transformation
data reduction

d) Data Dissemination methods

Item	Intance
Digital Content for sale	/
List/report/queries	Web Forms
Inter/Intra Net	PDA
Business intelligence	Personalized Email
Forecasting	Publications
New Product/Service	File Transfer
Knowledge Creation/Mining	Conversation

An example of a Complex Information Ecosystem

这里写图片描述

1.4 Issues in the Development of DLM Procedures

Identification of Objectives in Maintaining Data.
Application of the Minimalism Principle.
Consideration for Information Security Issues.
Consideration for Distribution Controls Issues.
- Etc

1.5[Conclusion] DL Focus

DL focus is following on DL functions:

Data Acquisition and Extraction
Descriptive Data Summarization
Data Pre-processing
- Data cleaning(and quality)
- Data intergration
Data Quality Management
- Overview of data Quality and exiting methodologies

1.6 Reading Materials

2. Data Quality and Cleaning

2.1[BG] Data Quality

DQ plays a critical role in all business and governmental applications
It is a relevant Performance issue of operating processes, of decision-making activities, and of inter-organizational cooperation requirements
The information system evolution into a network-based structure has caused the issue of DQ to become more complex

2.2 [BG] DQ in the Data Lifecycle

DQ should be in all phase of data lifecycle, but typically considered during data pre-processing step which precedes knowledge extraction and decision-making activities(data dissemination)
DQ applied over data during pre-processing phase are what we call Data Cleaning

2.2 Data Cleaning

[What is Data Cleaning?] Data Cleaning involves the filling in of missing values, the smoothing of noisy data, the identification or removal of outliers, and the resolving of inconsistencies and redundancies caused by data integration.

[Relevance] The Relevance of Data Cleaning

No quality data, no quality mining results!

High quality decision must be based on high quality data
Data cleaning is the number one problem in data warehouse
- Data extraction, cleaning, and transformation comprises the majority of the work building a data warehouse

[BG] Data Quality Methodologies

The literature provides a wide range of techniques to assess and improve the quality of data
Due to the diversity and complexity of these techniques, research has focused on defining methodologies that help select, customise and apply DQ assessment and improvement techniques
DQ methodologies is a set of guidelines and techniques that defines a rational process to assess and improve the quality of data

[How] How to compare different types of DQ methodologies

Phases and steps
- i.e state reconstruction, assessment/measurement and improvement
Strategies and techniques: can be data-driven and process-driven
- Data-driven strategies improve the quality of data by directly modifying the value of data. For example, acquisition of new data, record linkage. etc
- Process-driven strategies improve quality by redesigning the processes that create or modify data. For example, process control and process redesign.
Types of data
- For example: structured, un-structured and semi-structured
Types of Information systems
- For example: distributed, peer-to-peer, data warehouse, etc.
Dimensions and metrics
Etc.

[How] How to measure the quality of data

A well-accepted multidimensional view:

Accuracy
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility

[Why] Why data in the Real World is Dirty?

“Duplicate records also need data cleaning!”

item	reason	cause
incomplete data	measuring Completeness	not applicable data value when collected; 2. Different consideration between the time when the data was collected and when it is analyzed; 3. human, hardware or software problems
noisy data	measuring Accuracy	Faulty data collection instruments; 2. Human or computer error at data entry; 3.Errors in data transmission
inconsistent data	measuring consistency	Different data sources; 2.Functional dependency violation

2.3 Reading Materials

C. Batini, C. Cappiello, C. Francalanci and A. Maurino: Methodologies for Data Quality Assessment and Improvement. ACM Computing Surveys, Vol. 41, No. 3 Article 16, July 2009.
H. V. Jagadish, J. Gehrke, A. Labrinidis, Y. Papakonstantinou, J. M. Patel, R. Ramakrishnan and C. Shahabi: Big data and its technical challenges. Communications of the ACM, Volume 57, Issue 7, 2014, Pages 86-94.
H. V. Jagadish, J. Gehrke, A. Labrinidis, Y. Papakonstantinou, J. M. Patel, R. Ramakrishnan and C. Shahabi: Big data and its technical challenges. Communications of the ACM, Volume 57, Issue 7, 2014, Pages 86-94.
Naumann, F. (2014). Data Profiling Revisited. ACMSIGMOD Record, 42(4), 40-49.
- Profiling Relational Data: A Survey. VLDB Journal 2015.

3. Data Acquisition and Information Extraction

这里写图片描述

3.1 Data Acquisition

[Where] Sensor, simulations and scientific experiments can produce large volumes of data.

[What] taking filtering and compression criterion

[How] Static data. Give an initial profile of the data coming from the relevant sources, made at the first acquisition. The example data should involve both data and metadata, including any values inserted, deleted or updated, any entities inserted or deleted, any properties that were added or re-purposed, etc.

The decision regarding what data to filter is tied to the intended purpose of the data in intimate ways(so a fixed filtering strategy will often not work well)

The decision should be based on the common use of data to the edge devices where the data is generated

The changes of data and metadata:

what (e.g. a value)
where (e.g. record and column information)
when (e.g. timestamp)
how (e.g. by update - presence of old and new values)
- The information about such changes often indicate the level of quality and volatility of the data.

Data Ingestion Techniques

[What] The process of obtaining, importing, and processing data for later use or storage in a database

[Major Challenge] Maintain reasonable speed and efficiency in data ingestion, when numerous data sources exist in diverse formats.

[Solution of big data] The process of incrementally ingesting data, as it comes while performing data filtering and compression.

3.3 Information Extraction

[What] A process to pull out the required information from the underlying source and to express this information in a structure form suitable for analysis

[Challenges]

Specify information extraction tasks
optimize the execution during processing new data

3.4 Data Lake

[What] A data lake is an architecture focused on data storage and access whose main feature is a repository capable of storing vast quantities of dat in various formats

[Resource] Examples of data can from: Web server logs, databases, social media and third-party data. Either batch processing or real-time processing of streaming data.

[How] Data is no longer restrained by initial schema decisions, and can be exploited more freely by its consumers

[DAaas] The capabilities of data lake allow IT to provide Data and Analytics as a Service(DAssS) can be linked to the lakes’ repository, where IT acts as the data provider(supplier) and analysts, as data consumers

这里写图片描述

4. Data Pre-processing

4.1 Basic ideas of Data Pre-processing

[What] Something basic idea about data pre-processing

The main objective of data preprocessing is to prepare the data for dissemination(particularly in the form of knowledge creation or analysis through mining). This process may involve the obtaining of a descriptive summarization(profile) of the data to help us identify its general characteristics, as well as the presence of noise or outliers.

[Results] High quality data, which ease the process of the mining process(a step in dissemination)

4.2 Data Pre-processing Techniques (4 kinds)

These techniques could be used together.

Data Cleaning: associated with the removal of noise and the correction of inconsistencies in the data.
Data Integration: associated with the merging of data form multiple source into a coherent data store, e.g., a data warehouse
Data Transformation: associated with the transformation of the data into a format which improved the accuracy and efficiency of dissemination methods
Data Reduction: associated with the reduction of data size by, for example, aggregating, eliminating redundant features, or clustering.

4.2 Descriptive Data Summarization

[What] DDS techniques identify the typical properties of your data and highlight which data values should be treated as noise pr outliers. Basically, identify the errors and outliers

[Aim] Provides an overall picture of your data

[Techniques]

measure of the central tendency of data, e.g., mean, median, mode and midrange
measure of data dispersion: quantiles, interquartile range(IQR, middle half of all), and variance

[How] How , can DDS techniques be applied efficiently over large dataset?

need to choose the most efficient implementation of the techniiques
each techniques should be categorized according to its type

4.3 Types of DDS techniques

Distributive Measures: can computed the measure for some subsets and then merge the results together.
- sum, count, max and min

这里写图片描述

Algebraic Measure: computed by applying an algebraic function to one or more distributive measures.
- mean, weighted average

这里写图片描述

Holistic Measures
- e.g. median

I.e., Range, K^th^ percentile, median and Quartiles

*[What]***IQR ( $Q3 - Q1$ )

Quartile	Quantile	Percentile
1^st^ quartile(Q1)	25	25%
2^nd^ quartile(Q2)	50	50%
3^rd^ quartile(Q3)	75	75%
4^th^ quartile(Q4)	100	100%

4.3 Dealing with skewed data

a) Numeral Analysis

No single numerical measure of spread, such as IQR, is very useful for describing skewed distributions, as the spread of two sides of this type of distribution are unequal.

[How to describe the shape of distribution (Five-number summary)] Use Q1, Q3, median and the range(min and max) of the data

[How to identify outliers] Use Boxplots based on five-number summary.

problem	solution
shape of distribution	five-number summary(minimum, maximum, Q1, Q3 and median
identify outlier	boxplots which is based on five-number summary
central tendency	a) mean; b) median; c)mode(most popular one); d)midrange( (Max+Min)/2)
data dispersion	Quartiles, IQR(Q3-Q1), Variance, standard deviation and standard error

b) Visualized Analysis

Histogram: display the distribution of a given attribute
Quantile plots: displays the distribution of a given attribute
Scatter plots: display relationship, patterns or trends between two numerical attributes.
Loess curves: adds a smooth curves to a scatter plot to provide better perception of the pattern of dependence

这里写图片描述

4.3 Data Profile

[Data Profiling Workflow]

这里写图片描述

Perspective	Objectives
Analysis of values in a single column	analysis of data distributions; discovery of patterns
Analysis of inter-value dependencies across columns	application of association rules; clustering and outlier detection algorithm

5. Data Cleaning

[What - 3 aspects] Prescribes routines to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data.

fill missing value
identify outliers
correct errors

5.1 Dealing with missing values

[How] How to deal with missing value - (6 ways)

a) By ignoring a tuple that has no recorded value for a number of attributes, such as customer income

b) By filling in the missing value manually

c) By using a globe constant to fill in the missing value, for example “Unknown” or -∞

d) By using the attribute mean to fill in the missing value

e) By using the attribute mean for all samples belonging to the class as the given tuple

f) By using the most popular value to fill in the missing value, which may be determined with regression, inference-based tools using a Bayesian formalist, or decision tree induction,

Is missing value always an error?

*[How]***How to deal with Noisy Data - (3 ways)

a) By binning a stored data value, by consulting its “neighborhood” (i.e., the value around it)

Data values are sorted partitioned into a number of bins.
- Binning can be by means, medians and boundaries.

b) By fitting the data to a function, such as with regression

It is similar to finding the “best” line to fit two attributes, so that one attribute can be used to predict the other

c) By clustering the data values, where similar values are organized into groups, or “clusters”

Values that fall outside of the set of clusters may be considered outliers

5.2 Data Cleaning as a Process - (2 Steps)

Step 1: Discrepancy detection
- Examine sources of discrepancies: poorly designed dataentry forms, human error in data entry, informationdenial (deliberate errors), data decay, inconsistent datarepresentation, inconsistent use of codes, errors ininstrumentation devices that record data, system errors,data inadequately used for purposes other than originallyintended, data integration, etc.
- Proceeding with discrepancy detection: use of metadata(knowledge about the properties of data), use unique, consecutive and null rules, use tools for data scrubbing, auditing, migration, ETL.
Step 2: Data Transformation
- An error-prone and time consuming process.
- Some transformations may introduce some discrepancies.
- This means that numerous interactions (Steps 1 and 2)are required.

5.3 Limitation of Data Cleaning as a Process - (3 aspects)

No integration of discrepancy detection and transformation steps
Lack of languages to specify how to detect and clean the data
Lack of mechanism for metadata updating, as more about the data is discovered

5.3 Reading Material

Data Mining Concepts and Techniques. J. Han andM. Kamber.
Naumann, F. (2014). Data Profiling Revisited. ACMSIGMOD Record, 42(4), 40-49.

6. Data Integrtion

[What] It is the combination of data from multiple sources into a coherent data store, as in data warehousing

The source may include multiple databases, flat files, data cubes, etc.
It needs to consider about following issues:
1. Schema integration
2. Conflict at the data level
  - Entity Resolution
  - Redundancy

6.1 Schema Integration

Different schemas may record different information, represent the same data in different ways or different data in similar ways
Examples of schema conflicts:
- Table vs Table
- Table vs Attribute
- Attribute vs Attribute

a) Table vs Table Conflicts

a-1 One to One table conflicts

Table name conflicts
- different name for equivalent tables
- same name for different tables
Table structure conflicts
- missing attribute
- missing but implicit attributes
table constraints
table inclusion

a-2 One to One table conflicts

b) Attribute vs Attribute Conflicts

b-1 One to One attribute conflicts

Attribute Name Conflicts
- Different names for equivalent attribute
- Same name for different attribute
Attribute constaints
- Integrity Constraints
- Data type

b-2 Many to Many attribute conflicts

one attribute is represented by a few attributes in another table

c) Conflicts at the Data Level

Different representations of the same data:

Data type conflicts: char vs integer
Different units: week vs hours
Different levels of precision(bin size)
Different encoding denoting same information

Employee.bonus(%) <=> EmpTax.Bouns($)

age <=> dateOfBirth

6.2 Views for Reconciliation

Some conflicts can be overcome using views, but it is tricky

这里写图片描述

6.3 Entity Resolution (anther strategy)

Is the processing of identifying records that represent the same real-world entity
- For example, two companies that merge may want to combine their customer records. In such case, the same customer may be represented by multiple records, so these marching records must be identified and combined (into a cluster).
Extremely expensive due to the large dataset scale and complex logic, which decide if the records represent the same entity
In practice, an entity resolution is an constantly improved process, based on the better understanding of data, and the logic that examines and compares records

这里写图片描述

7. Summary

There is a growing need for the management ofthe whole data life cycle, particularly with theevolution of information systems into complexnetworked structures and the continuousincrease in sizes of data sets.
While general management principles areavailable, more specific and precise methods areharder to devise.
At each phase of the data life cycle, numerous data management challenges remain to be solved before Big Data fulfills its potential

Zewen_Liu

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
数据挖掘基础总结

Data Engineering OutlinesContentData Engineering OutlinesContentBackgroundBefore the Start 6 Objectives in this moduleData Lifecycle Management1 Why Motivation - 8 reasons2 What Definit
复制链接

扫一扫