Big data 001 -- what is big data?

Why big data so popular? 

  • Easy data collection
  • Cheap data storage
  • Cheap parallel computing

What is big data?

  • To understand and predict phenomena
  • Create models with algorithm and data
  • Create detectors / classifiers
  • Ultimately creating diagnosis

Prediction methods

  • Model-first: start with a model, then simulate and predict a phenomena
  • Data-first: in data analytics, we start with lots of sample or observation and create a model, and create the prediction.

** There is a spectrum of model-first vs. data-first aproaches used in actual research.

The course focus on --

  • How to manage and create predictive models
    • Converting raw data into the right form
    • Building the actual machine learning model
  • machine learning, cloud systems, distributed computing, etc.

What makes data big?

  • Data is complex with many many dimensions and params, people cannot understand
  • Cannot be stored in a single 16G machine memory
  • Cannot be analyzed with brute force methods
  • Change too rapidly (velocity)

What to do with raw data?

Data needs to be cleaned and integrated before analyzing.

  • Because data are usually taken from various places
  • We need to extract, model, clean and link them into one coherent form
  • Data analytics pipeline -- 
    • Acquision of raw data
      • Input raw data
    • Data wrangling
      • put data in the right form
    • Provide interpretation
      • Raw data is meaningless without a good context
      • It is important to have meta-data/typing to interpret data
      • e.g. add table headers and column names
    • Provide provenance
      • The source of the data, who, how and why these data are collected
  • Process data 

        

        e.g. linking -- a geo-coordinate with an address and street view

How to pull features and build models?

What is a feature? ​​​​​​​Values used to predict classes or groups

  • Structure data: Identify features (duck: beak, quacks, etc.) associated with a certain class
    • text: keywords
    • image: shapes, patterns
    • social data, network data: clusters, paths, etc.
  • Machine learning
    • take raw data
    • extract features
    • map features to classes

What is the goal of DA?

  • Pattern detection: categorizing data, i.e. descriptive statistics
    • e.g. put data into clusters, put sales data into groups of regions
    • shows trends
  • Hypothesis: data significance, i.e. inferential statistics
    • e.g. if a data leads to a particular result
    • with the knowledge of significance of some data, we can use them as features and build models
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值