《机器学习实战：sklearn和TensorFlow》阅读笔记——第1章机器学习概览

最新推荐文章于 2022-04-08 01:29:14 发布

_helen_520

最新推荐文章于 2022-04-08 01:29:14 发布

阅读量459

点赞数 1

本文链接：https://blog.csdn.net/haronchou/article/details/114696363

版权

第一章机器学习概览

目录如下

01 代码实战部分

OECD数据：3292 * 17,17个列项。

挑出 INEQUALITY为 TOT的，因为有些是Woman 或者 Man，说明数据统计可能带有偏差。这样就只有 888 * 17个数据了。
oecd_bli=oecd_bli.pivot(index="Country",columns="Indicator",values="Value")

进行这个操作后，有了37*24，即共有37个城市，24个Indicator如教育水平、房间个数等因素。value是值的多少
挑出Indicator有值的部分：37 * 24

GDP数据：190 * 7。

挑出Country排序，有GDP数据的部分：190 * 6

# OECD的生活满意度与IMF的GDP数据
def prepare_country_stats(oecd_bli, gdp_per_capita):
    oecd_bli = oecd_bli[oecd_bli["INEQUALITY"]=="TOT"]  # 选择TOT部分的数据，Woman和man的数据不要了
    # 通过country栏来索引，列为Indicator，值为values
    oecd_bli = oecd_bli.pivot(index="Country", columns="Indicator", values="Value")
    # 重命名：将2015那一列命名为 省GDP，inplace为替换掉原数据
    gdp_per_capita.rename(columns={"2015": "GDP per capita"}, inplace=True)
    # 将Country作为索引
    gdp_per_capita.set_index("Country", inplace=True)
    # 将 oced和gdp的城市对应起来
    full_country_stats = pd.merge(left=oecd_bli, right=gdp_per_capita,
                                  left_index=True, right_index=True)
    # 按照gdp排序
    full_country_stats.sort_values(by="GDP per capita", inplace=True)
    # 去掉一些行索引，故意去掉一些特别数据，来拟合。
    remove_indices = [0, 1, 6, 8, 33, 34, 35]
    keep_indices = list(set(range(36)) - set(remove_indices))
    # 只把GDP和生活满意度两个标签列的数据返回
    return full_country_stats[["GDP per capita", 'Life satisfaction']].iloc[keep_indices]


import os
import numpy as np
# 加载数据集的路径
datapath = os.path.join("datasets", "lifesat", "")

oecd_bli = pd.read_csv(datapath + "oecd_bli_2015.csv", thousands=',')
gdp_per_capita = pd.read_csv(datapath + "gdp_per_capita.csv", thousands=',', delimiter='\t', encoding='latin1', na_values="n/a")

# Prepare the data
country_stats = prepare_country_stats(oecd_bli, gdp_per_capita)
X = np.c_[country_stats["GDP per capita"]]
y = np.c_[country_stats["Life satisfaction"]]

# Visualize the data
country_stats.plot(kind='scatter', x="GDP per capita", y='Life satisfaction')
plt.show()

# Train the model
model.fit(X, y)

# Make a prediction for Cyprus
X_new = [[22587]]  # Cyprus' GDP per capita
print(model.predict(X_new)) # outputs [[ 5.96242338]]

OECD生活满意度数据如下：

LOCATION是省份； Country是省份里面的城市；
Indicator的内容为：Student skills, self-reported health, personal earnings, voter turnout, employment rate, long-term unemployment rate, household net adjusted disposable income, life satisfaction, quality of support network, time devoted to leisure and personal care, assault rate, educational attainment, homicide rate, employees working very long hours, job security, water quality, life expenctancy, years in education, household net financial wealth, housing expenditure, air pollution, dwelling without basic facilities, rooms per person, consultation on rule-making. 学术技巧、自我健康评估，个人收入、就业率、长期失业率、生活满意度，支持网络的质量，空闲时间多少，袭击率，教育保持，长时间工作，工作安全，水质量，寿命，教育年限，空气污染，平均房间数量。