DSCI550: Data Science at Scale Homework 3 Spring 2024Python

Java Python DSCI550: Data Science at Scale

Homework 3, Spring 2024

SHOW EACH STEP OF COMPUTATION.

1. (25 pts) (Decision Tree) Using the following training dataset, construct a decision tree using Information Gain and Entropy as discussed in the class. Using V1, V2, V3, V4, predict C.

2. (20 pts) (Naïve Bayes Classifier) We have data on 1000 patients. They were diagnosed to be Flu, Allergy, or Other Disease using three symptoms as shown. This is our 'training set.' We will use this to predict the type of any new patient we encounter.

A new patient says “High fever, No Sneezing, and Runny Nose”. Is this Flu, Allergy, or Other? Use Naive Bayes Classifier.

3. (15 pts) Regression: A company is investigating the relationship between its advertising expenditures and the sales of their products. The following data represents a sample of 10 products. Note that AD = Advertising dollars in K and S = Sales in thousands $.

1) (5 pts) Find the equation of the regression line, using Advertising dollars as the independent variable and Sales as the response variable.

2) (3 pts) Plot the scatter diagram and the regression line.

3) (5 pts) Find r2 and interpret it in the words of the problem.

4) (2 pts) Use the DSCI550: Data Science at Scale Homework 3 Spring 2024Python line to predict the Sales if Advertising dollars = $50 K.

4. (20 pts) (Hierarchical Clustering) Five, two-dimensional data points are shown below with their distance matrix, i.e., the symmetric matrix that gives the pairwise distance between any two points.

Use the distance matrix to perform. the following two types of hierarchical clustering: MIN and MAX distance. Show your results by drawing a dendrogram. Note: the dendrogram should clearly show the order in which the points are merged.

5. (20 pts) k-Means Clustering: For the following six points,

1) Use the k-means algorithm to show the final clustering result assuming initially we assign A1, A6 as the center of each cluster, respectively.

2) Use the k-means algorithm to show the final clustering result assuming initially we assign A3, A4 as the center of each cluster, respectively.

3) Compute the quality of the K-Means clustering using the Sum of Squared Error (SSE) which shows cohesion measures how near the data points in a cluster are to the cluster centroid. Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k (≤ n) sets S = {S1, S2, …, Sk} so as to minimize the intra-cluster sum of squares.

where μi is the mean of points in Si         

  • 12
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值