Why big data so popular?
- Easy data collection
- Cheap data storage
- Cheap parallel computing
What is big data?
- To understand and predict phenomena
- Create models with algorithm and data
- Create detectors / classifiers
- Ultimately creating diagnosis
Prediction methods
- Model-first: start with a model, then simulate and predict a phenomena
- Data-first: in data analytics, we start with lots of sample or observation and create a model, and create the prediction.
** There is a spectrum of model-first vs. data-first aproaches used in actual research.
The course focus on --
- How to manage and create predictive models
- Converting raw data into the right form
- Building the actual machine learning model
- machine learning, cloud systems, distributed computing, etc.
What makes data big?
- Data is complex with many many dimensions and params, people cannot understand
- Cannot be stored in a single 16G machine memory
- Cannot be analyzed with brute force methods
- Change too rapidly (velocity)
What to do with raw data?
Data needs to be cleaned and integrated before analyzing.
- Because data are usually taken from various places
- We need to extract, model, clean and link them into one coherent form
- Data analytics pipeline --
- Acquision of raw data
- Input raw data
- Data wrangling
- put data in the right form
- Provide interpretation
- Raw data is meaningless without a good context
- It is important to have meta-data/typing to interpret data
- e.g. add table headers and column names
- Provide provenance
- The source of the data, who, how and why these data are collected
- Acquision of raw data
- Process data
e.g. linking -- a geo-coordinate with an address and street view
How to pull features and build models?
What is a feature? Values used to predict classes or groups
- Structure data: Identify features (duck: beak, quacks, etc.) associated with a certain class
- text: keywords
- image: shapes, patterns
- social data, network data: clusters, paths, etc.
- Machine learning
- take raw data
- extract features
- map features to classes
What is the goal of DA?
- Pattern detection: categorizing data, i.e. descriptive statistics
- e.g. put data into clusters, put sales data into groups of regions
- shows trends
- Hypothesis: data significance, i.e. inferential statistics
- e.g. if a data leads to a particular result
- with the knowledge of significance of some data, we can use them as features and build models