Big data 001 -- what is big data?

KnightHacker2077

已于 2022-09-01 08:09:48 修改

阅读量548

点赞数

分类专栏： Artificial Intelligence 文章标签： big data

于 2022-08-26 02:38:12 首次发布

本文链接：https://blog.csdn.net/DOITJT/article/details/126535529

版权

Artificial Intelligence 专栏收录该内容

7 篇文章 1 订阅

订阅专栏

Why big data so popular?

Easy data collection
Cheap data storage
Cheap parallel computing

What is big data?

To understand and predict phenomena
Create models with algorithm and data
Create detectors / classifiers
Ultimately creating diagnosis

Prediction methods

Model-first: start with a model, then simulate and predict a phenomena
Data-first: in data analytics, we start with lots of sample or observation and create a model, and create the prediction.

** There is a spectrum of model-first vs. data-first aproaches used in actual research.

The course focus on --

How to manage and create predictive models
- Converting raw data into the right form
- Building the actual machine learning model
machine learning, cloud systems, distributed computing, etc.

What makes data big?

Data is complex with many many dimensions and params, people cannot understand
Cannot be stored in a single 16G machine memory
Cannot be analyzed with brute force methods
Change too rapidly (velocity)

What to do with raw data?

Data needs to be cleaned and integrated before analyzing.

Because data are usually taken from various places
We need to extract, model, clean and link them into one coherent form

Data analytics pipeline --
- Acquision of raw data
  - Input raw data
- Data wrangling
  - put data in the right form
- Provide interpretation
  - Raw data is meaningless without a good context
  - It is important to have meta-data/typing to interpret data
  - e.g. add table headers and column names
- Provide provenance
  - The source of the data, who, how and why these data are collected
Process data

e.g. linking -- a geo-coordinate with an address and street view

How to pull features and build models?

What is a feature? Values used to predict classes or groups

Structure data: Identify features (duck: beak, quacks, etc.) associated with a certain class
- text: keywords
- image: shapes, patterns
- social data, network data: clusters, paths, etc.
Machine learning
- take raw data
- extract features
- map features to classes

What is the goal of DA?

Pattern detection: categorizing data, i.e. descriptive statistics
- e.g. put data into clusters, put sales data into groups of regions
- shows trends
Hypothesis: data significance, i.e. inferential statistics
- e.g. if a data leads to a particular result
- with the knowledge of significance of some data, we can use them as features and build models