Python学习从0开始——Kaggle机器学习002代码参考
一、Basic Data Exploration
基础数据探索
Step 1: Loading Data
import pandas as pd
# Path of the file to read
iowa_file_path = '../input/home-data-for-ml-course/train.csv'
# Fill in the line below to read the file into a variable home_data
home_data = pd.read_csv(iowa_file_path)
# Call line below with no argument to check that you've loaded the data correctly
step_1.check()
Step 2: Review The Data
# Print summary statistics in next line
home_data.describe()
# What is the average lot size (rounded to nearest integer)?
avg_lot_size =round(sum(home_data.LotArea)/len(home_data))
# As of today, how old is the newest home (current year - the date in which it was built)
newest_home_age = 2024-max(home_data.YrSold)
# Checks your answers
step_2.check()
二、Selecting Data for Modeling
为模型选择数据
Step 1: Specify Prediction Target
# print the list of columns in the dataset to find the name of the prediction target
home_data.columns
y = home_data.SalePrice
# Check your answer
step_1.check()
Step 2: Create X
# Create the list of features below
feature_names = ['LotArea','YearBuilt','1stFlrSF','2ndFlrSF','FullBath','BedroomAbvGr','TotRmsAbvGrd']
# Select data corresponding to features in feature_names
X = home_data[feature_names]
# Check your answer
step_2.check()
# Review data
# print description or statistics from X
#print(_)
X.describe
# print the top few lines
#print(_)
Step 3: Specify and Fit Model
# from _ import _
from sklearn.tree import DecisionTreeRegressor
#specify the model.
#For model reproducibility, set a numeric value for random_state when specifying the model
iowa_model = DecisionTreeRegressor(random_state=1)
# Fit the model
iowa_model.fit(X,y)
# Check your answer
step_3.check()
Step 4: Make Predictions
predictions = iowa_model.predict(X)
print(predictions)
# Check your answer
step_4.check()
三、Model Validation
模型验证
Step 1: Split Your Data
# Import the train_test_split function and uncomment
# from _ import _
from sklearn.model_selection import train_test_split
# fill in and uncomment
# train_X, val_X, train_y, val_y = ____
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 1)
# Check your answer
step_1.check()
Step 2: Specify and Fit the Model
# You imported DecisionTreeRegressor in your last exercise
# and that code has been copied to the setup code above. So, no need to
# import it again
# Specify the model
iowa_model = DecisionTreeRegressor(random_state=1)
# Fit iowa_model with the training data.
iowa_model.fit(train_X,train_y)
# Check your answer
step_2.check()
Step 3: Make Predictions with Validation data
# Predict with all validation observations
val_predictions = iowa_model.predict(val_X)
# Check your answer
step_3.check()
Step 4: Calculate the Mean Absolute Error in Validation Data
from sklearn.metrics import mean_absolute_error
val_mae = mean_absolute_error(val_y,val_predictions)
# uncomment following line to see the validation_mae
# print(val_mae)
# Check your answer
step_4.check()
四、Underfitting and Overfitting
欠拟合和过拟合
Step 1: Compare Different Tree Sizes
candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
best_mae=[]
# Write loop to find the ideal tree size from candidate_max_leaf_nodes
for idea_size in candidate_max_leaf_nodes:
my_mae=get_mae(idea_size,train_X,val_X,train_y,val_y)
best_mae.append(my_mae)
print("Max leaf nodes: %d \t\t Mean Absolute Error: %d" %(idea_size, my_mae))
# Store the best value of max_leaf_nodes (it will be either 5, 25, 50, 100, 250 or 500)
best_tree_size = candidate_max_leaf_nodes[best_mae.index(min(best_mae))]
# Check your answer
step_1.check()
Step 2: Fit Model Using All Data
# Fill in argument to make optimal size and uncomment
# final_model = DecisionTreeRegressor(____)
final_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size, random_state=0)
# fit the final model and uncomment the next two lines
# final_model.fit(____, ____)
final_model.fit(X, y)
# Check your answer
step_2.check()
五、Random Forests
随机森林
Step 1: Use a Random Forest
from sklearn.ensemble import RandomForestRegressor
# Define the model. Set random_state to 1
rf_model = RandomForestRegressor(random_state=1)
# fit your model
rf_model.fit(train_X,train_y)
# Calculate the mean absolute error of your Random Forest model on the validation data
rf_val_mae =mean_absolute_error(val_y, rf_model.predict(val_X))
print("Validation MAE for Random Forest Model: {}".format(rf_val_mae))
# Check your answer
step_1.check()
六、Machine Learning Competitions
机器学习竞赛
Train a model for the competition
# To improve accuracy, create a new Random Forest model which you will train on all training data
rf_model_on_full_data = RandomForestRegressor(random_state=1)
# fit rf_model_on_full_data on all data from the training data
rf_model_on_full_data.fit(train_X,train_y)
# path to file you will use for predictions
test_data_path = '../input/test.csv'
# read test data file using pandas
test_data = pd.read_csv(test_data_path)
# create test_X which comes from test_data but includes only the columns you used for prediction.
# The list of columns is stored in a variable called features
test_X = test_data[features]
# make predictions which we will submit.
test_preds = rf_model_on_full_data.predict(test_X)
# Check your answer (To get credit for completing the exercise, you must get a "Correct" result!)
step_1.check()
# step_1.solution()
七、结束
所有教程和练习完成后: