Chapter 1: Data Analytics and Big Data

Global Datasphere is a measure of all new data that is captured, created, and replicated in any given year across the globe.

  • One Terabyte (TB) = 1,000 Gigabytes (GB)
    A single TB could hold 1,000 copies of the Encyclopedia Brittanica
  • All the X rays in a large hospital

1.1 Four data analytic capabilities

Data: Any piece of information stored and/or processed by a computer or mobile

Data Analytics refers to the technologies and processes that turn raw data into
insight for making decisions and facilitates drawing conclusion from data


1.1.1 Descriptive Analytics

What has happened?
It is estimated that 80% of generated analytics results are descriptive in nature.
Descriptive analytics are often carried out via ad hoc reporting or dashboards


  • What was the sales volume over the past 12 months?
  • What is the number of support calls received as categorized by severity and geographic location?

1.1.2 Diagnostic Analytics

Diagnostic analytics aim to determine the cause of a phenomenon that occurred in the past using questions that focus on the reason behind the event.

Sample questions

  • Why were Q2 sales less than Q1 sales?
  • Why have there been more support calls originating from the Eastern region than from the Western region?

1.1.3 Predictive Analytics

Generate future predictions based upon past events.

Sample questions

  • What are the chances that a customer will default on a loan if they have
    missed a monthly payment?
  • What will be the patient survival rate if Drug B is administered instead of
    Drug A?

1.1.4 Prescriptive Analytics

What should I do if “x” happens?

Prescriptive analytics provide specific (prescriptive) recommendations to the user.
Various outcomes are calculated, and the best course of action for each outcome is suggested.


  • When is the best time to trade a particular stock?

1.2 Big Data

4V of Big Data

  • Volume
    A huge amount of data
  • Velocity
    High speed and continuous flow of data
  • Variety
    Different types of structured, semi structured and unstructured data coming from heterogenous sources
  • Veracity
    Data may be inconsistent, incomplete and messy

1.3 Structured vs. Unstructured data

Structured data
Data conforms to a data model or schema and is often stored in tabular form.

Unstructured data
Data that does not conform to a data model or data schema is known as unstructured data.
Estimated to makes up 80% of the data within any given enterprise.

Semi structured data
Non tabular structure, but conform to some level of structure.


1.4 The big data processing cycle

1.4.1 Collect

Collecting the raw data such as transactions, logs, and mobile devices.
Permits developers to ingest a wide variety of data.

1.4.2 Store

Requires a secure, scalable, and durable repository to store data before or after the processing tasks.

1.4.3 Process and analyze

Data is transformed from its raw state into a consumable format.
Usually by means of sorting, aggregating, joining, and performing more advanced functions and algorithms.
The resulting datasets are then stored for further processing or made available for consumption with business intelligence and data visualization tools.

1.4.4 Consume and visualize

Data is made available to stakeholders through self service business intelligence and data visualization tools to allow fast and easy exploration of datasets.
Users might also consume the resulting data in the form of statistical predictions (in the case of predictive analytics) or recommended actions (in the case of prescriptive analytics)


1.5 Databases

Designed to store and handle transaction data (live, real time data)

Relational databases (e.g. Mysql store data in tables with fixed rows and columns.

Non relational databases (NoSQL) store data in a variety of data models (e.g. JSON)

More flexible schema (how the data is organized)


1.6 Data Warehouse

Data warehouse is a giant database storing highly structured information that is optimized for analytics

Typically store current and historical data from one or more systems and disparate data sources
May not reflect the most up to date state of the data.
Business analysts and data scientists can connect data warehouses to explore the data, look for insights, and generate reports for business stakeholders.

Google BigQuery, Amazon

1.7 Extract, transform, load (ETL)

The ETL processes move data from its original source (e.g. database or other sources) to the data warehouse on a regular schedule (e.g., hourly or daily)

Extract : Extract data from homogeneous/heterogeneous
Transform: Clean the data and transform the data into appropriate format
Load: Insert data into the target data warehouse

1.8 Solving the big data challenges

  • Scaling up (Vertical scaling)
    Have a supercomputer with enormous amounts of storage attached to an extremely fast network.
  • Scaling out (Horizontal scaling)[A BETTER WAY]
    Have a lot of smaller computers, each with a modest amount of storage, connected by networking.


1.9 Processing of Big Data

The challenges of Big Data cannot be handled easily by traditional storage technology, e.g. databases

A framework that allows for storing a large amount of data and the distributed processing of
large data sets across clusters of computers

a programming paradigm that enables massive scalability across hundreds or thousands of
servers in a Hadoop cluster.

Apache Spark
An open source unified analytics engine for large scale data processing


2.0 Distributed File Systems

A cluster is a tightly coupled collection of servers, or nodes.
A distributed file system can allow us to store large files which spread across the nodes of a cluster

E.g. Hadoop Distributed File System (HDFS).

2.1 Hadoop

Splitting large dataset

Split large dataset into smaller data blocks and stored in different nodes.

In Hadoop, each block contains 128 MB of data and replicated three times by default.

Replication Factor: The number of times Hadoop framework replicate each and every data block.

Traditional approach

Moving huge amount data to the processing unit is costly.
The processing unit becomes the bottleneck.

Map Function

Instead of moving data to the processing unit, we are moving the processing unit to the data

MapReduce consists of two distinct tasks Map and Reduce.
Map: process data to create key value pairs in parallel


Reduce function

MapReduce consists of two distinct tasks Map and Reduce.

Map: process data by workers based on where data is stored
Reduce: Aggregate results by the “reduce workers”



Creation and study of the visual representation of data
One of the most important tools for data analytics/science.



Dashboard is a read only snapshot of an analysis that you can share with other users for reporting purposes.

Chapter 2: Overview of AI and Machine Learning

Autonomous Driving Car

Self driving vehicles or “driverless” cars

Combine sensors and software to control,
navigate, and drive the vehicle.

Drivers are NOT required to take control to safely operate the vehicle.

Vehicle/Object Detection

Classify and detect the objects in the image.

Assign a class to each object and draw a bounding box around it.

Disease Detection

1. Subfields of Artificial Intelligence

AI is concerned with developing machines with the ability that are usually done by us humans with our natural intelligence

Computer Vision: Enabling computers to derive information from images and videos

Natural Language Processing (NLP): Giving computers the ability to understand text and spoken words

Speech Recognition
Machine Learning
Deep Learning

1.1 Image Classification

Image classification models take an image as input and return a prediction about which class the image belongs to.

Images are expected to have only one class for each image.

1.2 Object Detection

Takes an image as input and output the images with bounding boxes and labels on detected objects.

For example, Google Lens.请添加图片描述

Automated Face analysis tasks

Face detection: Detect if there is a face in images/videos.

Face classification: Determine the kind of face
E.g. the Age, Gender and emotion of a person from the face

Face verification: One to one
Is it the same face (e.g. unlock your mobile phone)?

Face identification: One to many
E.g. Police search

1.3 Natural language processing (NLP)

The branch of artificial intelligence (AI) concerned with giving computers the ability to understand text and spoken words in much the same way human beings can.

Language Translation
Sentiment analysis

Extract subjective qualities (e.g. attitude, emotion) from text.

Predict whether a movie review is positive or negative, based on the words in the movie

Named Entity Recognition (NER)

Identify specific entities in a text, such as dates, individuals and places


1.4 Chatbots

Software application built to simulate a human like conversation.

Involve speech recognition, natural language processing and speech synthesis

Text to speech

Text to Speech (TTS) is the task of generating natural sounding speech given
text input.

May generates speech for multiple speakers and multiple languages.

Speech to text

Convert voice to text

AI, Machine Learning and Deep Learning


Example: Recognizing a digit

Let’s say that we want to teach a computer to recognize the number 7

Rules for distinguishing 7 from other characters

7s have a mostly horizontal line near the top of the figure

they have a mostly northeast southwest diagonal line

Those two lines meet in the upper right.

2. Problem: Rule-based approach


Finding a good and complete set of rules is frequently an overwhelmingly
difficult task.

The rules human experts follow are often not explicit

Easy to overlook exceptions and special cases

The technology, laws, and social conventions around human activities are
constantly changing

Constantly monitor, update, and repair this tangled web of interconnecting rules.

Learning by examples


Provide many examples of each class of image

The computer looks at these examples and learn about the visual appearance and
features of each type of image

Learning the rules instead of coding the rule

3. Machine Learning

In ML, features are any property or characteristic of the data that the
model can use to make predictions

Spam Classifier

Spam : junk or unwanted email, such as chain letters, promotions, etc
Ham: non spam emails.


A large visual database designed for use in visual object recognition software research

More than 14 million images have been hand annotated by the project to indicate what objects are pictured, covering 100,000 classes

ImageNet contains more than 20,000 categories

E.g. “balloon” or “strawberry”, each consisting of several hundred images

4. ML models and algorithms

ML Model
A representation of reality using a set of rules that mimic the existing data as closely as possible

Giving examples to a model so it can learn.
Split the dataset into two parts

Training set: Used to train the model
Test set: Used to test the accuracy of the model on data the model has never seen before during training

A procedure, or a set of steps, used to solve a problem or perform a computation
The goal of machine learning algorithms is to build a model for prediction

5. K-nearest neighbor


The nearest point to this new observation is malignant and located at the coordinates (2.1, 3.6).
If a point is close to another in the scatter plot, then the perimeter and concavity values
are similar.
We may expect that they would have the same diagnosis.

Classifying unlabelled examples by assigning them the class of similar labeled examples
“k”is a parameter that specifies the number of neighbors to consider when making
the classification.

Recommendation systems that predict whether a person will enjoy a movie or song
Identifying patterns in genetic data to detect specific proteins or diseases
Computer vision applications, including optical character recognition and facial recognition in
both still images and video.

3-nearest neighbor

To improve the prediction we can consider several
neighboring points
Among those 3 closest points, we use the majority class as our prediction for the new observation

Euclidian Distance


Boundary method




Hand-writing digit recognition

MNIST handwritten digit database

Ground Truth

Ground truth is information that is known to be real or true.

Training and Loss



The number of epochs is a hyperparameter that defines the number of times that the learning algorithm will work through the entire training dataset.

In each epoch, each sample in the training dataset has had an opportunity to update the internal model parameters.
• In the first epoch, AI may make large prediction errors
• Feed the training data to AI multiple times to learn from the mistakes and reduce the prediction errors

Batch size

Due to computational and memory limits, we generally don’t feed the entire training set to the AI model
• Break down the training data into smaller batches which are fed to the model individually
• The batch size is a hyperparameter that defines the number of samples to work through before



Any quantity that the model creates or modifies during the training process is a parameter
• We can twist many other knobs before training a model
• E.g. the number of epochs, batch size, the “k” value in k nearest neighbor, learning rate (more about it later), etc
• Any quantity that you set before the training process is a hyperparameter


The word overfitting refers to a model that models the training data well but it fails to generalize

6. Types of Machine Learning

Supervised Learning

Classification (Binary/Multiclass)

Use attributes (𝑥1,𝑥2,….) to predict a categorical variable (𝑦) yes/no, rain/no rain

Evaluation of Model
Positive vs. Negative Class
Confusion Matrix
Validation Set

To validate which model to use: Cross Validation

In the case of small data sets (say, less than 1,000 observations), a very popular scheme is cross-validation
• The data is split into K folds (e.g., 5). A model is trained on K − 1 training folds and tested on the remaining validation fold.
• This is repeated for all possible validation folds resulting in K performance estimates that can then be averaged



Simple Linear Regression

y = f ( x 1 , x 2 , x 3 , . . . ) y = f(x_1, x_2, x_3, ...) y=f(x1,x2,x3,...)

Use attributes (𝑥1,𝑥2,….) to predict a numerical variable (𝑦).

The output of a regression model is a number, e.g. prices, sizes, or weights.




Unsupervised Learning

Unsupervised learning is also a common type of machine l

人工智能和LISP(LISt Processing)是紧密相关的两个概念。LISP是一种编程语言,最早于1958年由美国计算机科学家约翰·麦卡锡(John McCarthy)设计并开发。正是基于LISP语言的特性,人工智能领域得到了巨大的推动和发展。 LISP的设计使其非常适用于处理符号化的数据和执行复杂的逻辑运算。这使得LISP成为早期人工智能研究中的首选编程语言。研究人员利用LISP的高度灵活性和表达能力,构建了许多能够模拟人类智能的系统。 LISP为人工智能的发展提供了强大的工具和理论基础。其中最著名的例子是LISP语言中的列表(list)数据结构,它提供了一种方便的方式来存储和操作符号化的知识。这在构建专家系统和知识表示方面起到了重要的作用。 另一个与LISP和人工智能密切相关的概念是LISP的元编程能力。通过LISP的元编程功能,研究人员可以在运行时修改和扩展程序逻辑,这对于开发智能系统的学习和自适应能力非常有帮助。例如,利用LISP的元编程能力,人工智能研究人员能够实现基于案例推理的系统,通过不断添加新的规则和示例,使系统能够从经验中学习。 总之,LISP语言为人工智能的发展做出了巨大贡献。它提供了强大的符号化处理能力和元编程能力,使研究人员能够构建智能系统并实现学习和自适应。至今,LISP仍然在人工智能研究中扮演着重要的角色,它仍然是研究人员用于开发智能系统的一种首选语言。


