Datacamp 笔记&代码 Supervised Learning with scikit-learn 第二章 Regression

最新推荐文章于 2023-11-01 22:44:20 发布

JinnyR

最新推荐文章于 2023-11-01 22:44:20 发布

阅读量2.4k

点赞数

分类专栏： datacamp 文章标签： datacamp python sklearn machine learning

本文链接：https://blog.csdn.net/u011292816/article/details/97046882

版权

这篇博客通过Datacamp课程讲解了使用scikit-learn进行监督学习中的回归问题，特别是针对Gapminder数据集。内容包括数据导入、数据探索、线性回归、训练测试拆分、交叉验证、正则化（Lasso和Ridge）等步骤。通过数据可视化和模型评估指标如R2分数和RMSE来分析模型性能。

摘要由CSDN通过智能技术生成

更多原始数据文档和JupyterNotebook
Github: https://github.com/JinnyR/Datacamp_DataScienceTrack_Python

Datacamp track: Data Scientist with Python - Course 21 (2)

Exercise

Importing data for supervised learning

In this chapter, you will work with Gapminder data that we have consolidated into one CSV file available in the workspace as 'gapminder.csv'. Specifically, your goal will be to use this data to predict the life expectancy in a given country based on features such as the country’s GDP, fertility rate, and population. As in Chapter 1, the dataset has been preprocessed.

Since the target variable here is quantitative, this is a regression problem. To begin, you will fit a linear regression with just one feature: 'fertility', which is the average number of children a woman in a given country gives birth to. In later exercises, you will use all the features to build regression models.

Before that, however, you need to import the data and get it into the form needed by scikit-learn. This involves creating feature and target variable arrays. Furthermore, since you are going to use only one feature to begin with, you need to do some reshaping using NumPy’s .reshape() method. Don’t worry too much about this reshaping right now, but it is something you will have to do occasionally when working with scikit-learn so it is useful to practice.

Instruction

Import numpy and pandas as their standard aliases.
Read the file 'gapminder.csv' into a DataFrame dfusing the read_csv() function.
Create array X for the 'fertility' feature and array y for the 'life' target variable.
Reshape the arrays by using the .reshape() method and passing in -1 and 1.

# Modified/Added by Jinny

fn = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_2433/datasets/gapminder-clean.csv'
from urllib.request import urlretrieve
urlretrieve(fn, 'gapminder.csv')

('gapminder.csv', <http.client.HTTPMessage at 0x120c05b00>)

# Import numpy and pandas
import numpy as np
import pandas as pd

# Read the CSV file into a DataFrame: df
df = pd.read_csv('gapminder.csv')

# Create arrays for features and target variable
y = df['life'].values
X = df['fertility'].values

# Print the dimensions of X and y before reshaping
print("Dimensions of y before reshaping: {}".format(y.shape))
print("Dimensions of X before reshaping: {}".format(X.shape))

# Reshape X and y
y = y.reshape(-1, 1)
X = X.reshape(-1, 1)

# Print the dimensions of X and y after reshaping
print("Dimensions of y after reshaping: {}".format(y.shape))
print("Dimensions of X after reshaping: {}".format(X.shape))

Dimensions of y before reshaping: (139,)
Dimensions of X before reshaping: (139,)
Dimensions of y after reshaping: (139, 1)
Dimensions of X after reshaping: (139, 1)

Exercise

Exploring the Gapminder data

As always, it is important to explore your data before building models. On the right, we have constructed a heatmap showing the correlation between the different features of the Gapminder dataset, which has been pre-loaded into a DataFrame as df and is available for exploration in the IPython Shell. Cells that are in green show positive correlation, while cells that are in red show negative correlation. Take a moment to explore this: Which features are positively correlated with life, and which ones are negatively correlated? Does this match your intuition?

Then, in the IPython Shell, explore the DataFrame using pandas methods such as .info(), .describe(), .head().

In case you are curious, the heatmap was generated using Seaborn’s heatmap function and the following line of code, where df.corr()computes the pairwise correlation between columns:

sns.heatmap(df.corr(), square=True, cmap='RdYlGn')

Once you have a feel for the data, consider the statements below and select the one that is not true. After this, Hugo will explain the mechanics of linear regression in the next video and you will be on your way building regression models!

Instruction

Possible Answers
- The DataFrame has 139 samples (or rows) and 9 columns.
- life and fertility are negatively correlated.
- The mean of life is 69.602878
- fertility is of type int64.
- GDP and life are positively correlated.

# Modified/Added by Jinny

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from urllib.request import urlretrieve

fn = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_2433/datasets/gapminder-clean.csv'
urlretrieve(fn, 'gapminder.csv')
df = pd.read_csv('gapminder.csv')
sns.heatmap(df.corr(

最低0.47元/天解锁文章

JinnyR

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Datacamp 笔记&代码 Supervised Learning with scikit-learn 第二章 Regression

更多原始数据文档和JupyterNotebookGithub: https://github.com/JinnyR/Datacamp_DataScienceTrack_PythonDatacamp track: Data Scientist with Python - Course 21 (2)ExerciseImporting data for supervised learningI...
复制链接

扫一扫