Suppose that the data for analysis includes the attribute age. The age values for the data tuples are (in increasing order) 13, 15, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70. Also, suppose that a hospital tested the age and body fat data for the selected sample subjects with the above age attributes and obtained the following results:
age | 13 | 15 | 16 | 19 | 20 | 21 | 22 | 25 | 30 | 33 | 35 | 36 | 40 | 45 | 46 | 52 | 70 |
%fat | 9.5 | 26.5 | 7.8 | 17.8 | 31.4 | 25.9 | 27.4 | 27.2 | 31.2 | 34.6 | 42.5 | 28.8 | 33.4 | 30.2 | 34.1 | 32.9 | 41.2 |
Perform the following activities in Python and answer questions:
(a) Save the above data in a CSV file.
(b) Read data in the CSV file into variables in Python.
(c) What is the mean, medium, and standard deviation of age and fat?
(d) What is the mode of age? Comment on the data’s modality (i.e., bimodal, trimodal, etc.).
(e) What is the range of age?
(f) What are the first quartile (Q1) and the third quartile (Q3) of age?
(g) Give the five-number summary of age and fat.
(h) Draw the boxplots for age and fat.
(i) Show the histograms of age and fat.
(j) Draw a scatter plot based on the two variables.
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
age = [13, 15, 16, 19, 20, 21, 22, 25, 30, 33, 35, 36, 40, 45, 46, 52, 70]
fat = [9.5, 26.5, 7.8, 17.8, 31.4, 25.9, 27.4, 27.2, 31.2, 34.6, 42.5, 28.8, 33.4, 30.2, 34.1, 32.9, 41.2]
dict4DF = {"age": age, "fat": fat}
pd_data = pd.DataFrame(dict4DF)
# (a)Save the above data in a CSV file.
mydata = np.savetxt("mydata.csv", pd_data, delimiter=",", header='age,fat', comments="")
print("(a)Save the above data in a CSV file:", "见CSV文件", )
print("________________________________________")
# (b)Read data in the CSV file into variables in Python.
mydata1 = pd.read_csv("mydata.csv")
# (c) What is the mean, medium, and standard deviation of age and %fat?
mean_age = np.mean(mydata1['age'])
medium_age = np.median(mydata1['age'])
std_age = np.std(mydata1['age'])
mean_fat = np.mean(mydata1['fat'])
medium_fat = np.median(mydata1['fat'])
std_fat = np.std(mydata1['fat'])
print("(c) What is the mean, medium, and standard deviation of age and %fat?")
print("mean:{}, medium:{}, and standard deviation:{} of age".format(mean_age, medium_age, std_age))
print("mean:{}, medium:{}, and standard deviation:{} of fat".format(mean_fat, medium_fat, std_fat))
print("________________________________________")
# (d) What is the mode of the age? Comment on the data’s modality (i.e., bimodal, trimodal, etc.).
# bimodal
from collections import Counter # 计数器
data1 = [13, 15, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70]
data2 = Counter(data1
)
print("(d) What is the mode of the age? Comment on the data’s modality (i.e., bimodal, trimodal, etc.).")
print("Because the highest number is {}, so it's a bimodal".format(data2))
print("________________________________________")
# (e) What is the range of the age?
range = (min(mydata1['age']), max(mydata1['age']))
print("(e) the range of the age:{}".format(range))
print("________________________________________")
# (f) What are the first quartile (Q1) and the third quartile (Q3) of the age?
from numpy import percentile # 百分位数是统计中使用的度量,表示小于这个值的观察值的百分比。
quartiles = percentile(mydata1['age'], [0, 25, 50, 75, 100])
"""
四分位数(Quartile)也称四分位点,是指在统计学中把所有数值由小到大排列并分成四等份,处于三个分割点位置的数值。
"""
Q1 = quartiles[1]
Q3 = quartiles[3]
print("(f) quartile (Q1):{} and the third quartile (Q3):{} of the age.".format(Q1, Q3))
print("________________________________________")
# (g) Give the five-number summary of the age and the %fat.
age_five_number_summary = percentile(mydata1['age'], [0, 25, 50, 75, 100])
fat_five_number_summary = percentile(mydata1['fat'], [0, 25, 50, 75, 100])
print("(g) age_five_number_summary:{} , fat_five_number_summary:{}".format(age_five_number_summary,
fat_five_number_summary))
print("________________________________________")
# (h) Draw the boxplots for age and %fat.
plt.boxplot(mydata1['age'])
plt.savefig("boxplots for age.png")
plt.boxplot(mydata1['fat'])
plt.savefig("boxplots for fat.png")
"""
箱型图是一中用于统计数据分布的统计图,也可以粗略地看出数据是否具有对称性,分布的分散程度等信息。
"""
# (i) Show the histograms of age and %fat.
age_hist = pd_data.hist(column='age')
fat_hist = pd_data.hist(column='fat')
plt.savefig("histograms of age.png")
plt.savefig("histograms of fat.png")
# (j) Draw a scatter plot based on the two variables.
plt.plot(pd_data['age'], pd_data['fat'])
plt.title('Scatterplot of age vs. fat')
plt.xlabel('age')
plt.ylabel('fat')
plt.savefig("Scatterplot of age vs. fat.png")