python 验证模型
Scikit-learn is an open source machine learning library that provides tools for building, training and testing models. The model selection module has many functions that are useful for model testing and validation. In this post, we will discuss some of the important model selection functions in scikit-learn.
Scikit-learn是一个开放源代码的机器学习库,提供用于构建,训练和测试模型的工具。 模型选择模块具有许多对模型测试和验证有用的功能。 在本文中,我们将讨论scikit-learn中的一些重要模型选择功能。
Let’s get started!
让我们开始吧!
For our purposes, we will be working with The Wines Reviews data set, which can be found here.
为了我们的目的,我们将使用“葡萄酒评论”数据集,可在此处找到。
To start, let’s read our data into a Pandas data frame:
首先,让我们将数据读取到Pandas数据框中:
import pandas as pd
df = pd.read_csv("winemag-data-130k-v2.csv")
Next, let’s print the first five rows of data:
接下来,让我们打印数据的前五行:
print(df.head())
![Image for post](https://i-blog.csdnimg.cn/blog_migrate/9b5fc724de3cc3287ea19f0ca1d6cbf3.png)
Let’s consider the task of predicting whether wine price is more expensive than $50 based on the variety, winery, country and review points. We can build a random forest classifier to perform this task. First, let’sconvert the categorical features into categorical codes that can be handled by random forests:
让我们考虑根据品种,酒庄,国家和审查要点来预测葡萄酒价格是否高于50美元的任务。 我们可以构建一个随机森林分类器来执行此任务。 首先,让我们将分类特征转换为可以由随机森林处理的分类代码:
df['country_cat'.format(i)] = df['country'].astype('category').copy()
df['country_cat'.format(i)] = df['country_cat'.format(i)].cat.codesdf['winery_cat'.format(i)] = df['winery'].astype('category').copy()
df['winery_cat'.format(i)] = df['winery_cat'.format(i)].cat.codesdf['variety_cat'.format(i)] = df['variety'].astype('category').copy()
df['variety_cat'.format(i)] = df['variety_cat'.format(i)].cat.codes
Let’s also impute missing values. We won’t do any fancy imputing here but check out Predicting Missing Values with Python for a more reliable method of imputation. Here, let’s replace missing values with 0:
让我们还估算缺失的值。 我们不会在此处进行任何花哨的插补,但请查看