DSCI550: Data Science at Scale Homework 3 Spring 2024Python

qq__99515681

于 2024-09-10 12:45:27 发布

阅读量369

点赞数 12

文章标签： spring python java 开发语言

本文链接：https://blog.csdn.net/qq__99515681/article/details/142095828

版权

Java Python DSCI550: Data Science at Scale

Homework 3, Spring 2024

SHOW EACH STEP OF COMPUTATION.

1. (25 pts) (Decision Tree) Using the following training dataset, construct a decision tree using Information Gain and Entropy as discussed in the class. Using V1, V2, V3, V4, predict C.

2. (20 pts) (Naïve Bayes Classifier) We have data on 1000 patients. They were diagnosed to be Flu, Allergy, or Other Disease using three symptoms as shown. This is our 'training set.' We will use this to predict the type of any new patient we encounter.

A new patient says “High fever, No Sneezing, and Runny Nose”. Is this Flu, Allergy, or Other? Use Naive Bayes Classifier.

3. (15 pts) Regression: A company is investigating the relationship between its advertising expenditures and the sales of their products. The following data represents a sample of 10 products. Note that AD = Advertising dollars in K and S = Sales in thousands $.

1) (5 pts) Find the equation of the regression line, using Advertising dollars as the independent variable and Sales as the response variable.

2) (3 pts) Plot the scatter diagram and the regression line.

3) (5 pts) Find r2 and interpret it in the words of the problem.

4) (2 pts) Use the DSCI550: Data Science at Scale Homework 3 Spring 2024Python line to predict the Sales if Advertising dollars = $50 K.

4. (20 pts) (Hierarchical Clustering) Five, two-dimensional data points are shown below with their distance matrix, i.e., the symmetric matrix that gives the pairwise distance between any two points.

Use the distance matrix to perform. the following two types of hierarchical clustering: MIN and MAX distance. Show your results by drawing a dendrogram. Note: the dendrogram should clearly show the order in which the points are merged.

5. (20 pts) k-Means Clustering: For the following six points,

1) Use the k-means algorithm to show the final clustering result assuming initially we assign A1, A6 as the center of each cluster, respectively.

2) Use the k-means algorithm to show the final clustering result assuming initially we assign A3, A4 as the center of each cluster, respectively.

3) Compute the quality of the K-Means clustering using the Sum of Squared Error (SSE) which shows cohesion measures how near the data points in a cluster are to the cluster centroid. Given a set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k (≤ n) sets S = {S1, S2, …, Sk} so as to minimize the intra-cluster sum of squares.

where μi is the mean of points in Si

qq__99515681

关注

12
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
DSCI550: Data Science at Scale Homework 3 Spring 2024Python

Java Python DSCI550: Data Science at ScaleHomework 3, Spring 2024SHOW EACH STEP OF COMPUTATION.1. (25 pts) (Decision Tree) Using the following training dataset, construct a decision tree using Information Gain and Entropy as discussed in the class. Using V
复制链接

扫一扫