统计学习导论 Chap2 Exercises

AustinCyy

已于 2024-09-10 21:43:32 修改

阅读量633

点赞数 29

分类专栏：统计学习导论文章标签：机器学习深度学习 python

于 2024-08-13 20:41:11 首次发布

本文链接：https://blog.csdn.net/AustinCyy/article/details/141173139

版权

统计学习导论专栏收录该内容

1 篇文章 0 订阅

订阅专栏

说在前面

自学统计学习的过程是孤独的，同时国内的相关资料较少且不完善，我把我在学习过程中完成的作业和一些思考整理如下，希望能够对同样在学习的同学提供一定的帮助。介于是英文教材，所以我尽量全部使用英文作答，以锻炼自己的英文水准，加强学术能力。发现错误请及时指正，欢迎在评论区讨论。

本篇内容均已上传 GitHub：An-Introduction-to-Statistical-Learning-with-Python/Exercises/Chap-02 at main · Austinggg/An-Introduction-to-Statistical-Learning-with-Python (github.com)

参考资料：jooolia.github.io/IntroStatLearning/Exercises/chapter_2/chapter_2_questions.html

Conceptual

For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.

(a) The sample size n is extremely large, and the number of predictors p is small.
Expectation: Better performance of flexible methods.
- With a large sample size n, there is enough data to support the estimation of a more complex model without the risk of overfitting.
- Since the number of predictores is small, the model can be more flexible without the curse of dimensionality affecting performance negatively.
We would expect the performance of a flexible statistical learning method to be better than an inflexible one because with a large n you can approach the true distribution.
(b) The number of predictors p is extremely large, and the number of observations n is small.
Expectation: Better performance of inflexible methods.
- When the number of predictors is large relative to the number of observations, flexible methods are likely to overfit the data due to the high risk of model complexity exceeding the information provided by the limited data.
- Inflexible methods, with fewer parameters, are less likely to overfit in such "wide" data scenarios.
The performance of a flexible statistical learning method would be worse as the probability of overfitting would be very high.
(c) The relationship between the predictors and response is highly non-linear.
Expectation: Better performance of flexible methods.
- Flexible methods are better at capturing non-linear relationships because they can model complex interactions and non-linear effects.
- Inflexible methods, which often assume linearity, would not perform well in capturing such relationships and could lead to poor predictions.
Flexible statistical learning methods are more adapted to non-linear relationships than inflexible methods. The flexible method has better options to approximate the real distribution.
(d) The variance of the error terms,$i.e. \:σ^2 = Var(ϵ)$ , is extremely high.

The performance of a flexible statistical method would be worse when the variance of the error term is very high. Overfitting would be a large worry, i.e. that the model is following the errors in the data, so then the flexible approach would likely have lower performance.
Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.

(a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

regression, inference, n = 500, p = 3 (profit, number of employees, industry).

(b) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.

classification, prediction, n=20, p=13(price charged for the product, marketing budget, competition price and ten other variables)

(c) We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.

regression, prediction, n=52(we collect weekly data for 2012, and one year consists of 52 weeks), p=4 (For each week, we record, the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market).
We now revisit the bias-variance decomposition.

(a) Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less fexible statistical learning methods towards more fexible approaches. The x-axis should represent the amount of fexibility in the method, and the y-axis should represent the values for each curve. There should be five curves. Make sure to label each one.

(b) Explain why each of the fve curves has the shape displayed in part (a).
- bias - decreases with flexibility because more likely to appropriately fit the data
- variance - increases with flexibility because more wobbly, follows the data more
- training error - decreases with flexibility - possible to better follow the data with more flexible more
- test error - decreases and then increases with flexibility, error increases because model is following noise of data in training set and test data do not have the same noise.
- V(E) irreducible error - stays constant with the method because it is an error inherent in the data.
You will now think of some real-life applications for statistical learning.

(a) Describe three real-life applications in which classifcation might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.
Credit Scoring
- Response: The creditworthiness of an individual, typically categorized as 'High Risk', 'Medium Risk', or 'Low Risk'.
- Predictor: Variables such as credit history, income, employment status, debt-to-income ratio, and existing loans.
- Goal: prediction.
Medical Diagnosis
- Response: The presence or absence of a disease, such as 'Disease X' or 'No Disease X'.
- Predictors: Symptoms, patient history, test results, age, family history, and genetic information.
- Goal: Both inference and prediction.
Spam Email Detection
- Response: Classification of an email as 'Spam' or 'Not Spam'.
- Predictiors: The content of the email, sender information, the presence of certain keywords, the structure of the email, and the use of certain phrases.
- Goal: prediction.
(b) Describe three real-life applications in which regression might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.
Economic Forecasting
- Response: Economic indicators such as GDP growth rate, unemployment rate, or inflation rate.
- Predictors: Variables like consumer spending, government spending, investment levels, interest rates, and global economic conditions.
- Goal：prdiction.
Real Estate Pricing
- Response: The sale price of a property
- Predictors: Square footage, number of bedrooms and bathrooms, location, age of the property, local amenities, and market conditions.
- Goal: Both inference and prediction.
Educational Research
- Response: Student performance, often measured by standardized test scores.
- Predictors: Variables such as socio-economic status, quality of education, student attendance, teacher qualifications, and classroom size.
- Goal: prediction.
(c) Describe three real-life applications in which cluster analysis might be useful.
- microarray or gene expression data - samples with similar patterns.
- microbial communities - samples with similar functional pathways.
- people with similar behaviours in financial transaction data.
What are the advantages and disadvantages of a very fexible (versus a less fexible) approach for regression or classification? Under what circumstances might a more fexible approach be preferred to a less fexible approach? When might a less fexible approach be preferred?
Advantages of a very Flexible Approach
- Complexity Capture: Flexible models can caputre complex, non-linear relationships and interactions between varables.
- Accuracy: They often achieve higher accuracy on the training data due to their ability to fit the data closely.
- Adaptability: They can adapt to a wide range of data distributions and are less constrained by assumptions about the data.
- Discovery: They can reveal underlying patterns and structures in the data that might not be apparent with simpler models.
Disadvantages of a Very Flexible Approach:
- Overfitting: There is a high risk of overfitting, especially with very large datasets, where the model learns the noise in the training data.
- Interpretability: Flexible models can become black boxes, making it difficult to understand the influence of individual predictors.
- Computational Cost: They often require more computational resources and time for training and prediction.
- Sensitivity to Data Changes: Highly flexible models may be sensitive to small changes in the data, leading to less stable predictions.
Advantages of a Less Flexible Approach
- Interpretability: Simpler models are usually easier to understand and explain, which is important for decision-making.
- Robustness: They tend to be more robust to small variations in the data and can generalize better to new, unseen data.
- Computational Efficiency: Less flexible models are typically faster to train and make predictions.
- Stability: They are less sensitive to changes in the training data, providing more stable estimates.
Disadvantages of a Less Flexible Approach
- Missed Complexity: They may not capture all the complexities of the data, leading to underfitting.
- Limited Representation: They may be too constrained by their simplicity to accurately represent the data's relationships.
- Poor Fit: In cases where the true relationship is complex, a less flexible model may not fit the data well, leading to lower accuracy.
When to Prefer a More Flexible Approach
- When the data is complex and exhibits non-linear relationships.
- When the dataset is large enough to support the complexity of the model without overfitting.
- When the goal is to discover underlying patterns in the data.
- When interpretability and model simplicity are less of a concern.
When to Prefer a Less Flexible Approach
- When the data is simple or linear relationships are sufficient to describe it.
- When the dataset is small, and a simpler model is less likely to overfit.
- When interpretability and understanding the impact of individual variables are important.
- When computational efficiency and model stability are priorities.
Describe the diferences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classifcation (as opposed to a nonparametric approach)? What are its disadvantages?
Differences between a parametric and a non-parametric statistical learning approach.
- Parametric methods make an assumption about the function of the model and that it is linear.
- Non-parametric methods do not assume anything about the function when trying to estimate the fit of the data.
Advantage of parametric

needs less data than a non-parametric test.

Disadvantage of parametric

May not model the true functions and thus may have errors.
The table below provides a training data set containing six observations, three predictors, and one qualitative response variable.

Suppose we wish to use this data set to make a prediction for Y when X_1 = X_2 = X_3 = 0 using K-nearest neighbors.

(a) Compute the Euclidian distance between each observation and the test point, X_1 = X_2 = X_3 = 0.

Eclidian distance is

$d=\sqrt[2]{(x_{1test}-x_{1obs})^2+(x_{2test}-x_{2obs})^2+(x_{3test}-x_{3obs})^2}$

Since the test point has all coordinates equal to 0

$d=\sqrt[2]{x_{1obs}^2+x_{2obs}^2+x_{3obs}^2}$

Obs. X1 X2 X3 Y Euclidean_distance
1 0 3 0 Red $d_1=\sqrt[2]{9}=3.000000$
2 2 0 0 Red $d_2=\sqrt[2]{4}=2.000000$
3 0 1 3 Red $d_3=\sqrt[2]{1+9}=\sqrt[2]{10}=3.162278$
4 0 1 2 Green $d_4=\sqrt[2]{1+4}=\sqrt[2]{5}=2.236068$
5 -1 0 1 Green $d_4=\sqrt[2]{1+4}=\sqrt[2]{5}=2.236068$
6 1 1 1 Red $d_6=\sqrt[2]{1+1+1}=\sqrt[2]{3}=1.732051$

(b) What is our prediction with K = 1? Why?

with K=1, our prediction is Green.

Since KNN with 𝐾=1K=1 predicts the response variable 𝑌Y based on the value of the nearest neighbor, our prediction for 𝑌Y when 𝑋1=𝑋2=𝑋3=0X1=X2=X3=0 would be the value of 𝑌Y for the nearest observation. In this case, the value of 𝑌Y for Observation 5 is "Green."

(c) What is our prediction with K = 3? Why?

with K=3, our prediction is Red

because most of the points included are red.

(d) If the Bayes decision boundary in this problem is highly nonlinear, then would we expect the best value for K to be large or small? Why?

We would expect the best value to be small if the Bayes decision boundary is highly non-linear. This is because a large value would not be flexible enough to model the nonlinear boundary.

Obs.	X1	X2	X3	Y	Euclidean_distance
1	0	3	0	Red	$d_1=\sqrt[2]{9}=3.000000$
2	2	0	0	Red	$d_2=\sqrt[2]{4}=2.000000$
3	0	1	3	Red	$d_3=\sqrt[2]{1+9}=\sqrt[2]{10}=3.162278$
4	0	1	2	Green	$d_4=\sqrt[2]{1+4}=\sqrt[2]{5}=2.236068$
5	-1	0	1	Green	$d_4=\sqrt[2]{1+4}=\sqrt[2]{5}=2.236068$
6	1	1	1	Red	$d_6=\sqrt[2]{1+1+1}=\sqrt[2]{3}=1.732051$

Applied

详细内容已上传GitHub：An-Introduction-to-Statistical-Learning-with-Python/Exercises/Chap-02 at main · Austinggg/An-Introduction-to-Statistical-Learning-with-Python (github.com)

Import packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Question-08

a) Use the `pd.read_csv()` function to read the data into `Python`. Call the loaded data `college`. Make sure that you have the directory set to the correct location for the data.

college = pd.read_csv('../Datasets/College.csv')

(b) Look at the data used in the notebook by creating and running a new cell with just the code `college` in it. You should notice that the frst column is just the name of each university in a column named something like `Unnamed`: 0. We don’t really want `pandas` to treat this as data. However, it may be handy to have these names for later. Try the following commands and similarly look at the resulting data frames:

# 读取 'College.csv' 文件，将第一列（索引为 0）设为索引列
# 这意味着第一列的内容将被用作 DataFrame 的行索引
college2 = pd.read_csv('../Datasets/College.csv', index_col=0)

# 将第一列重命名为 'College'
college3 = college.rename({'Unnamed: 0': 'College'}, axis=1)

# 将 'College' 列设为索引
college3 = college3.set_index('College')

This has used the frst column in the fle as an index for the data frame. This means that pandas has given each row a name corresponding to the appropriate university. Now you should see that the frst data column is Private. Note that the names of the colleges appear on the left of the table. We also introduced a new python object above: a *dictionary*, which is specifed by `(key, value)` pairs. Keep your modifed version of the data with the following:

college = college3
college.head() # Display the first 5 rows of the data frame

college.describe()

(d) Use the `pd.plotting.scatter_matrix()` function to produce a scatterplot matrix of the first columns `[Top10perc, Apps, Enroll]`. Recall that you can reference a list `C` of columns of a data frame `A` using `A[C]`.

pd.plotting.scatter_matrix(college[['Top10perc', 'Apps', 'Enroll']])
# 在 jupyter Notebook 或 某些 IDE: 图形通常会自动显示，无需调用 plt.show()。
# 在脚本文件: 图形不会自动显示，需要调用 plt.show() 来手动显示图形。

(e) Use the `boxplot()` method of `college` to produce side-by-side boxplots of `Outstate` versus `Private`.

college.boxplot(column='Outstate', by='Private', figsize=(8, 6))

(f) Create a new qualitative variable, called `Elite`, by binning the `Top10perc` variable into two groups based on whether or not the proportion of students coming from the top 10% of their high school classes exceeds 50%.

college['Elite'] = pd.cut(college['Top10perc'], [0,50,100], labels=['No', 'Yes'])

Use the value_counts() method of college['Elite'] to see how many elite universities there are.

elite_counts = college['Elite'].value_counts()
print(elite_counts)

Finally, use the boxplot() method again to produce side-by-side boxplots of Outstate versus Elite.

college.boxplot(column='Outstate', by='Elite', figsize=(8, 6))

(g) Use the `plot.hist()` method of `college` to produce some histograms with differing numbers of bins for a few of the quantitative variables. The command `plt.subplots(2, 2)` may be useful: it will divide the plot window into four regions so that four plots can be made simultaneously. By changing the arguments you can divide the screen up in other combinations.

fig, axs = plt.subplots(2, 2, figsize=(12, 10)) # 创建一个 2x2 的子图布局

# 绘制第一个子图
college['Apps'].plot.hist(ax=axs[0, 0], bins=20, color='blue')
axs[0, 0].set_title('Histogram of Apps')
axs[0, 0].set_xlabel('Number of Applications')
axs[0, 0].set_ylabel('Frequency')

# 第二张子图 - 'Accept' 变量
college['Accept'].plot.hist(ax=axs[0, 1], bins=15, color='green')
axs[0, 1].set_title('Histogram of Accept')
axs[0, 1].set_xlabel('Number of Acceptances')
axs[0, 1].set_ylabel('Frequency')

# 第三张子图 - 'Enroll' 变量
college['Enroll'].plot.hist(ax=axs[1, 0], bins=10, color='red')
axs[1, 0].set_title('Histogram of Enroll')
axs[1, 0].set_xlabel('Number of Enrollments')
axs[1, 0].set_ylabel('Frequency')

# 第四张子图 - 'Outstate' 变量
college['Outstate'].plot.hist(ax=axs[1, 1], bins=25, color='purple')
axs[1, 1].set_title('Histogram of Outstate Tuition')
axs[1, 1].set_xlabel('Outstate Tuition')
axs[1, 1].set_ylabel('Frequency')

# 调整布局，确保子图不重叠
plt.tight_layout()

# 显示图形
plt.show()

Question-09

This exercise involves the `Auto` data set studied in the lab. Make sure that the missing values have been removed from the data.

auto = pd.read_csv('../Datasets/Auto.csv') # read the dataset Auto.csv
print(auto.isnull().sum()) # check for missing values
auto = auto.dropna() # remove missing values
auto.head()

(a) Which of the predictors are quantitative, and which are qualitative?

Preictors	Type	Description
MPG	Quantitative	Miles per gallon
Cylinder	Qualitative	Number of cylinders
Displacement	Quantitative	Engine displacement in liters
Horsepower	Quantitative	Horsepower
Weight	Quantitative	Vehicle weight in pounds
Acceleration	Quantitative	Vehicle acceleration in miles per hour per second
Year	Quantitative	Model year
Origin	Qualitative	Origin of the vehicle
Name	Qualitative	Name of the vehicle

# 定量变量列表
quantitative_vars = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year']
# 提取定量变量的数据子集
quantitative_data = auto[quantitative_vars]

# 查看数据集的每列数据类型
print(quantitative_data.dtypes)

# 尝试将所有列转换为数值类型，无法转换的字符将被转为 NaN
quantitative_data = quantitative_data.apply(pd.to_numeric, errors='coerce')
# 查看数据集的每列数据类型
print(quantitative_data.dtypes)

(b) What is the range of each quantitative predictor? You can answer this using the `min()` and `max()` methods in `numpy`.

# 计算统计信息
stats = pd.DataFrame({
    'Range': quantitative_data.max() - quantitative_data.min(),
})

# 打印统计信息表格
print(stats)

# 计算统计信息
stats = pd.DataFrame({
    'Mean': quantitative_data.mean(),
    'Standard Deviation': quantitative_data.std()
})

# 打印统计信息表格
print(stats)

(d) Now remove the 10th through 85th observations. What is the range, mean, and standard deviation of each predictor in the subset of the data that remains?

# 删除第10到第85条观察记录（索引从0开始，所以实际删除的是索引为9到84的行）
auto_subset = auto.drop(range(9, 85))
auto_subset_data = auto_subset[quantitative_vars]

# 尝试将所有列转换为数值类型，无法转换的字符将被转为 NaN
auto_subset_data = auto_subset_data.apply(pd.to_numeric, errors='coerce')

# 创建一个包含范围、均值和标准差的 DataFrame
stats_sub = pd.DataFrame({
    'Range': auto_subset_data[quantitative_vars].max() - auto_subset_data[quantitative_vars].min(),
    'Mean': auto_subset_data[quantitative_vars].mean(),
    'Standard Deviation': auto_subset_data[quantitative_vars].std()
})

# 打印统计信息表格
print(stats_sub)

(e) Using the full data set, investigate the predictors graphically, using scatterplots or other tools of your choice. Create some plots highlighting the relationships among the predictors. Comment on your fndings.

通过绘制变量之间的散点图来探讨变量间的关系

# 选择少量变量进行散点图矩阵绘制
selected_vars = ['mpg', 'horsepower', 'weight', 'acceleration']  # 示例变量

# 绘制散点图矩阵
num_vars = len(selected_vars)
fig, axes = plt.subplots(nrows=num_vars, ncols=num_vars, figsize=(10, 10))

for i in range(num_vars):
    for j in range(num_vars):
        if i != j:
            axes[i, j].scatter(quantitative_data[selected_vars[j]], quantitative_data[selected_vars[i]], alpha=0.5)
        if i == 0:
            axes[i, j].set_title(selected_vars[j])
        if j == 0:
            axes[i, j].set_ylabel(selected_vars[i])

plt.tight_layout()
plt.show()

通过绘制变量间的直方图来探讨变量间的关系

# 选择少量变量进行直方图矩阵绘制，假设这四个变量是 'mpg', 'weight', 'horsepower', 'acceleration'
selected_vars = ['mpg', 'weight', 'horsepower', 'acceleration']

# 创建 2x2 的子图
fig, axs = plt.subplots(2, 2, figsize=(10, 8))

# 循环遍历每个变量并绘制直方图
for i, var in enumerate(selected_vars):
    row = i // 2  # 行索引
    col = i % 2   # 列索引
    axs[row, col].hist(quantitative_data[var], bins=20, color='blue', alpha=0.7)
    axs[row, col].set_title(f'Histogram of {var}')
    axs[row, col].set_xlabel(var)
    axs[row, col].set_ylabel('Frequency')
    
# 调整布局
plt.tight_layout()
plt.show()

(f) Suppose that we wish to predict gas mileage (`mpg`) on the basis of the other variables. Do your plots suggest that any of the other variables might be useful in predicting `mpg`? Justify your answer.

# 设置图像的大小
plt.figure(figsize=(10, 10))

# mpg vs horsepower
plt.subplot(2, 2, 1)
plt.scatter(quantitative_data['horsepower'], quantitative_data['mpg'])
plt.xlabel('Horsepower')
plt.ylabel('MPG')
plt.title('MPG vs Horsepower')

# mpg vs weight
plt.subplot(2, 2, 2)
plt.scatter(quantitative_data['weight'], quantitative_data['mpg'])
plt.xlabel('Weight')
plt.ylabel('MPG')
plt.title('MPG vs Weight')

# mpg vs acceleration
plt.subplot(2, 2, 3)
plt.scatter(quantitative_data['acceleration'], quantitative_data['mpg'])
plt.xlabel('Acceleration')
plt.ylabel('MPG')
plt.title('MPG vs Acceleration')

# mpg vs displacement
plt.subplot(2, 2, 4)
plt.scatter(quantitative_data['displacement'], quantitative_data['mpg'])
plt.xlabel('Displacement')
plt.ylabel('MPG')
plt.title('MPG vs Displacement')

plt.tight_layout()
plt.show()

Question-10

This exercise involves the `Boston` housing data set.

(a) To begin, load in the `Boston` data set。

# 加载数据集
boston = pd.read_csv('../Datasets/boston.csv')

(b) How many rows are in this data set? How many columns? What do the rows and columns represent?

# 查看数据的前几行
print(boston.head())

# 将第一列重命名为 'id'
boston = boston.rename({'Unnamed: 0': 'id'}, axis=1)

# 将 'College' 列设为索引
boston = boston.set_index('id')

Rows: Each row represents an observation of a specific residential area.
Columns: Each column represents a specific attribute or feature of that residential area, such as crime rate, average number of rooms, housing price, etc.

Column Name	Description
crim	Per capita crime rate by town
zn	Proportion of residential land zoned for lots over 25,000 square feet
indus	Proportion of non-retail business acres per town
chas	Charles River dummy variable (1 if tract bounds river; 0 otherwise)
nox	Nitric oxides concentration (parts per 10 million)
rm	Average number of rooms per dwelling
age	Proportion of owner-occupied units built prior to 1940
dis	Weighted distances to five Boston employment centers
rad	Accessibility to radial highways
tax	Full-value property tax rate per $10,000
ptratio	Pupil-teacher ratio by town
lstat	Percentage of the population that is lower status
medv	Median value of owner-occupied homes (in thousands of dollars)

# 查看数据集的行数和列数
rows, cols = boston.shape
print(f"数据集中有 {rows} 行和 {cols} 列。")

# 绘制成对散点图
sns.pairplot(boston[['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age']])
plt.show()

(d) Are any of the predictors associated with per capita crime rate? If so, explain the relationship.

# 计算相关性矩阵并提取与CRIM的相关性
correlations = boston.corr()['crim'].sort_values(ascending=False)
print(correlations)

# 绘制crim与其他变量的散点图
for column in boston.columns:
    if column != 'crim':
        plt.figure(figsize=(5, 4))
        plt.scatter(boston[column], boston['crim'])
        plt.title(f'crim vs {column}')
        plt.xlabel(column)
        plt.ylabel('crim')
        plt.show()

(e) Do any of the suburbs of Boston appear to have particularly high crime rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.

# 定义阈值为第75百分位
crime_threshold = boston['crim'].quantile(0.75)
tax_threshold = boston['tax'].quantile(0.75)
ptratio_threshold = boston['ptratio'].quantile(0.75)

high_crime_suburbs = boston[boston['crim'] > crime_threshold]
high_tax_suburbs = boston[boston['tax'] > tax_threshold]
high_ptratio_suburbs = boston[boston['ptratio'] > ptratio_threshold]

print(f"犯罪率高于75百分位的郊区:\n{high_crime_suburbs}\n")
print(f"税率高于75百分位的郊区:\n{high_tax_suburbs}\n")
print(f"师生比高于75百分位的郊区:\n{high_ptratio_suburbs}\n")

(f) How many of the suburbs in this data set bound the Charles river?

charles_river_bound = boston[boston['chas'] == 1].shape[0]
print(f"紧邻查尔斯河的郊区数量为: {charles_river_bound}")

(g) What is the median pupil-teacher ratio among the towns in this data set?

charles_river_bound = boston[boston['chas'] == 1].shape[0]
print(f"紧邻查尔斯河的郊区数量为: {charles_river_bound}")

(h) Which suburb of Boston has lowest median value of owneroccupied homes? What are the values of the other predictors for that suburb, and how do those values compare to the overall ranges for those predictors? Comment on your fndings.

lowest_medv_suburb = boston.loc[boston['medv'].idxmin()]
print(f"自住房屋中位价格最低的郊区:\n{lowest_medv_suburb}")

# 比较其他预测变量的范围
range_comparison = boston.describe().loc[['min', 'max']]
print(f"与整体数据范围比较:\n{range_comparison}")

(i) In this data set, how many of the suburbs average more than seven rooms per dwelling? More than eight rooms per dwelling? Comment on the suburbs that average more than eight rooms per dwelling.

more_than_7_rooms = boston[boston['rm'] > 7].shape[0]
more_than_8_rooms = boston[boston['rm'] > 8].shape[0]

print(f"每栋住宅平均房间数超过7的郊区数量为: {more_than_7_rooms}")
print(f"每栋住宅平均房间数超过8的郊区数量为: {more_than_8_rooms}")

如果存在错误或者不准确的地方欢迎在评论区指出，如果对你有帮助的话希望能点赞，转发😀