Machine learning and data mining(一)

Suppose that the data for analysis includes the attribute age. The age values for the data tuples are (in increasing order) 13, 15, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70. Also, suppose that a hospital tested the age and body fat data for the selected sample subjects with the above age attributes and obtained the following results:

age

13

15

16

19

20

21

22

25

30

33

35

36

40

45

46

52

70

%fat

9.5

26.5

7.8

17.8

31.4

25.9

27.4

27.2

31.2

34.6

42.5

28.8

33.4

30.2

34.1

32.9

41.2

Perform the following activities in Python and answer questions:

(a) Save the above data in a CSV file.

(b) Read data in the CSV file into variables in Python.

(c) What is the mean, medium, and standard deviation of age and fat?

(d) What is the mode of age? Comment on the data’s modality (i.e., bimodal, trimodal, etc.).

(e) What is the range of age?

(f) What are the first quartile (Q1) and the third quartile (Q3) of age?

(g) Give the five-number summary of age and fat.

(h) Draw the boxplots for age and fat.

(i) Show the histograms of age and fat.

(j) Draw a scatter plot based on the two variables.

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

age = [13, 15, 16, 19, 20, 21, 22, 25, 30, 33, 35, 36, 40, 45, 46, 52, 70]
fat = [9.5, 26.5, 7.8, 17.8, 31.4, 25.9, 27.4, 27.2, 31.2, 34.6, 42.5, 28.8, 33.4, 30.2, 34.1, 32.9, 41.2]
dict4DF = {"age": age, "fat": fat}
pd_data = pd.DataFrame(dict4DF)
# (a)Save the above data in a CSV file.
mydata = np.savetxt("mydata.csv", pd_data, delimiter=",", header='age,fat', comments="")
print("(a)Save the above data in a CSV file:", "见CSV文件", )
print("________________________________________")
# (b)Read data in the CSV file into variables in Python.
mydata1 = pd.read_csv("mydata.csv")

# (c)	What is the mean, medium, and standard deviation of age and %fat?
mean_age = np.mean(mydata1['age'])
medium_age = np.median(mydata1['age'])
std_age = np.std(mydata1['age'])

mean_fat = np.mean(mydata1['fat'])
medium_fat = np.median(mydata1['fat'])
std_fat = np.std(mydata1['fat'])
print("(c)	What is the mean, medium, and standard deviation of age and %fat?")
print("mean:{}, medium:{}, and standard deviation:{} of age".format(mean_age, medium_age, std_age))
print("mean:{}, medium:{}, and standard deviation:{} of fat".format(mean_fat, medium_fat, std_fat))
print("________________________________________")
# (d)	What is the mode of the age? Comment on the data’s modality (i.e., bimodal, trimodal, etc.).
# bimodal
from collections import Counter  # 计数器

data1 = [13, 15, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70]
data2 = Counter(data1
                )
print("(d)	What is the mode of the age? Comment on the data’s modality (i.e., bimodal, trimodal, etc.).")
print("Because the highest number is {}, so it's a bimodal".format(data2))
print("________________________________________")

# (e)	What is the range of the age?
range = (min(mydata1['age']), max(mydata1['age']))
print("(e)	the range of the age:{}".format(range))
print("________________________________________")
# (f)	What are the first quartile (Q1) and the third quartile (Q3) of the age?
from numpy import percentile  # 百分位数是统计中使用的度量,表示小于这个值的观察值的百分比。

quartiles = percentile(mydata1['age'], [0, 25, 50, 75, 100])
"""
四分位数(Quartile)也称四分位点,是指在统计学中把所有数值由小到大排列并分成四等份,处于三个分割点位置的数值。
"""
Q1 = quartiles[1]
Q3 = quartiles[3]
print("(f)	quartile (Q1):{} and the third quartile (Q3):{} of the age.".format(Q1, Q3))
print("________________________________________")
# (g)	Give the five-number summary of the age and the %fat.
age_five_number_summary = percentile(mydata1['age'], [0, 25, 50, 75, 100])
fat_five_number_summary = percentile(mydata1['fat'], [0, 25, 50, 75, 100])
print("(g)  age_five_number_summary:{} , fat_five_number_summary:{}".format(age_five_number_summary,
                                                                            fat_five_number_summary))
print("________________________________________")
# (h)	Draw the boxplots for age and %fat.
plt.boxplot(mydata1['age'])
plt.savefig("boxplots for age.png")
plt.boxplot(mydata1['fat'])
plt.savefig("boxplots for fat.png")
"""
箱型图是一中用于统计数据分布的统计图,也可以粗略地看出数据是否具有对称性,分布的分散程度等信息。
"""
# (i)	Show the histograms of age and %fat.
age_hist = pd_data.hist(column='age')
fat_hist = pd_data.hist(column='fat')
plt.savefig("histograms of age.png")
plt.savefig("histograms of fat.png")
# (j)	Draw a scatter plot based on the two variables.
plt.plot(pd_data['age'], pd_data['fat'])
plt.title('Scatterplot of age vs. fat')
plt.xlabel('age')
plt.ylabel('fat')
plt.savefig("Scatterplot of age vs. fat.png")

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值