About Random Forest

DB架构

于 2022-06-22 11:54:11 发布

阅读量118

点赞数

分类专栏： Python learning 文章标签：随机森林机器学习深度学习 python

本文链接：https://blog.csdn.net/u011868279/article/details/125402941

版权

Python learning 专栏收录该内容

33 篇文章 1 订阅

订阅专栏

Decision trees can suffer from high variance which makes their results fragile to the specific training data used. Building multiple models from samples of your training data, called bagging, can reduce this variance, but the trees are highly correlated.

Random Forest is an extension of bagging that in addition to building trees based on multiple samples of your training data, it also constrains the features that can be used to build the trees, forcing trees to be different.

After completing this tutorial, you will know:

The difference between bagged decision trees and the random algorithm.
How to construct bagged decsion trees with more variance
How to apply the random forest algorithm to a predictive modeling problem.

1.1.1 Random Forest Algorithm

Decision trees involve the greedy selection of the best split point from the dataset at each step.

Random Forest Algorithm : force the decision trees to be different by limiting the features(column) that greedy algorithm can evaluate at each split point when creating the tree.

1.2 Tutorial

Calculating Splits
Sonar Case Study

These step provide the foundation that you need to implement and apply the Random Forest algortihm to your own predictive modeling problem.

1.2.1 Calculating Splits

Split points are chosen by finding the attribute and the value of that attribute that results in the lowest cost.

Finding the best split point in a decision tree involves evaluating the cost of each value in the training dataset for each input variable.

Below is a function name :

get_split() that implement this procedure.

test_split() : split the dataset by a candiate split point

gini_index() : evaluate the cost of a given split by the groups of rows created.

features: a list of features is created by randomly selecting feature indices and adding them to a list.

split points : this list of feature is then enumerated and specific values in the traing dataset evaluated .

# Select the best split point for a dataset
def get_split(dataset, n_features):
    class_values = list(see(row[-1] for row in dataset))
    b_index,b_value,b_score,b_groups = 999,999,999,None
    features = list()
    while len(features) < n_features:
        index = randrange(len(dataset[0])-1)
        if index not in features:
            features.append(index)
    for index in features:
        for row in dataset:
            groups = test_split(index, row[index],dataset)
            gini = gini_index(groups, class_values)
            if gini < b_score:
                b_index,b_value,b_score,b_groups = index,row[index],gini,groups
    return {'index':b_index,'value':b_value,'groups':b_groups}

Now that we know how a decison tree algorithm can be modified for use with Random Forest algorithm. we can piece this together with an implementation of bagging and apply it to a real-world dataset.

1.2.2 Sonar Case Study

load_csv() : load the dataset

str_column_to_float(): the string value converted to numeric and the output column is convert from strings to the integer values of 0 and 1.

str_column_to_int() to load and prepare the dataset.

will use k-fold cross-validation to estimate the performance of the learned model on unseen data.This means that we will construct and evaluate k models and estimate the performance as the mean model error.Classification accuracy will be used to evaluate each model/

cross_validation_split()

accuracy_metric()

evaluate_algorithm()

will use an implementation of the classification and regression trees(CART) algorithm adapted for bagging and the helper functions :

test_split() : split a dataset into groups

gini_index() : evaluate a split point

get_split() :split the dataset by a candiate split point

to_terminal(),split() and build_tree() used to create a single decision tree.

predict() to make a prediction with a decision tree

subsample() : make a subsample of the training dataset

bagging_predict() : make a prediction with a list of decision trees.

random_forest() : creates a list of decision trees from subsamples of the training dataset and then uses them to make predictions.

the key difference between Random Forest and bagged decision trees is the one small change to the way that trees are created here in the get split() function. The complete example is listed below.

# Random Forest Algorithm on Sonar Dataset
from random import seed
from random import randrange
from csv import reader
from math import  sqrt

# Load a CSV file
def load_csv(filename):
    dataset = list()
    with open(filename,'r') as file:
        csv_reader = reader(file)
        for row in csv_reader:
            if not row:
                continue
            dataset.append(row)
    return dataset

# Convert string column to float
def str_column_to_float(dataset, column):
    for row in dataset:
        row[column] = float(row[column].strip())
        
# Convert string column to integer
def str_column_to_int(dataset, column):
    class_values = [row[column] for row in dataset]
    unique = set(class_values)
    lookup = dict()
    for i,value in enumerate(unique):
        lookup[value] = i
    for row in dataset:
        row[column] = lookup[row[column]]
    return lookup

# Split a dataset into k folds
def cross_validation_split(dataset, n_folds):
    dataset_split = list()
    dataset_copy = list(dataset)
    fold_size = int(len(dataset) / n_folds)
    for i in range(n_folds):
        fold = list()
        while len(fold) < fold_size:
            index = randrange(len(dataset_copy))
            fold.append(dataset_copy.pop(index))
        dataset_split.append(fold)
    return dataset_split

# Calculate accuracy percentage
def accuracy_metric(actual,predicted):
    correct = 0
    for i in range(len(actual)):
        if actual[i] == predicted[i]:
            correct += 1
    return correct / float(len(actual)) * 100.0

# Evaluate an algorithm using a cross validation split
def evaluate_algorithm(dataset, algorithm,n_folds,*args):
    folds = cross_validation_split(dataset,n_folds)
    scores = list()
    for fold in folds:
        train_set = list(folds)
        train_set.remove(fold)
        train_set = sum(train_set, [])
        test_set = list()
        for row in fold:
            row_copy = list(row)
            test_set.append(row_copy)
            row_copy[-1] = None
        predicted = algorithm(train_set,test_set,*args)
        actual = [row[-1] for row in fold]
        accuracy = accuracy_metric(actual,predicted)
        scores.append(accuracy)
    return scores

# Split a dataset based on an attribute and an attribute value
def test_split(index,value,dataset):
    left,right = list(),list()
    for row in dataset:
        if row[index] < value:
            left.append(row)
        else:
            right.append(row)
    return left,right

# Calculate the Gini index for a split dataset
def gini_index(groups,classes):
    # count all samples at split point
    n_instances = float(sum([len(group) for group in groups]))
    # sum weighted Gini index for each group
    gini = 0.0
    for group in groups:
        size = float(len(group))
        # avoid divide by zero
        if size == 0:
            continue
        score = 0.0
        # score the group based on the score for each class
        for class_val in classes:
            p = [row[-1] for row in group].count(class_val) / size
            score += p * p
        # weight the group score by its relative size
        gini += (1.0 - score) * (size / n_instances)
    return gini
    
# Select the best split point for a dataset
def get_split(dataset, n_features):
    class_values = list(set(row[-1] for row in dataset))
    b_index,b_value,b_score,b_groups = 999,999,999,None
    features = list()
    while len(features) < n_features:
        index = randrange(len(dataset[0])-1)
        if index not in features:
            features.append(index)
    for index in features:
        for row in dataset:
            groups = test_split(index, row[index],dataset)
            gini = gini_index(groups, class_values)
            if gini < b_score:
                b_index,b_value,b_score,b_groups = index,row[index],gini,groups
    return {'index':b_index,'value':b_value,'groups':b_groups}

# Create a terminal node value
def to_terminal(group):
    outcomes = [row[-1] for row in group]
    return max(set(outcomes),key=outcomes.count)

# Create child splits for a node or make terminal
def split(node, max_depth,min_size,n_features, depth):
    left,right = node['groups']
    del(node['groups'])
    # check for a no split
    if not left or not right:
        node['left'] = node['right'] = to_terminal(left + right)
        return
    #check for max depth
    if depth >= max_depth:
        node['left'],node['right'] = to_terminal(left),to_terminal(right)
        return
    # process left child
    if len(left) < min_size:
        node['left'] = to_terminal(left)
    else:
        node['left'] = get_split(left,n_features)
        split(node['left'],max_depth,min_size,n_features,depth+1)
    # process right child
    if len(right) < min_size:
        node['right'] = to_terminal(right)
    else:
        node['right'] = get_split(right,n_features)
        split(node['right'],max_depth,min_size,n_features,depth+1)

# Build a decision tree
def build_tree(train, max_depth, min_size,n_features):
    root = get_split(train,n_features)
    split(root, max_depth, min_size,n_features, 1)
    return root

# Make a prediction with a decision tree
def predict(node, row):
    if row[node['index']] < node['value']:
        if isinstance(node['left'],dict):
            return predict(node['left'],row)
        else:
            return node['left']
    else:
        if isinstance(node['right'],dict):
            return predict(node['right'],row)
        else:
            return node['right']
        
# Create a random subsample from the dataset with replacement
def subsample(dataset, ratio):
    sample = list()
    n_sample = round(len(dataset) * ratio)
    while len(sample) < n_sample:
        index = randrange(len(dataset))
        sample.append(dataset[index])
    return sample

# Make a prediction with a list of bagged trees
def bagging_predict(trees, row):
    predictions = [predict(tree, row) for tree in trees]
    return max(set(predictions),key=predictions.count)

# Random Forest Algorithm
def random_forest(train, test, max_depth, min_size, sample_size,n_trees,n_features):
    trees = list()
    for i in range(n_trees):
        sample = subsample(train, sample_size)
        tree = build_tree(sample, max_depth, min_size, n_features)
        trees.append(tree)
    predictions = [bagging_predict(trees, row) for row in test]
    return (predictions)

# Test bagging on the sonar dataset
seed(2)

# load and prepare data
filename = 'sonar.all-data.csv'
dataset = load_csv(filename)

# convert string attributes to integers
for i in range(len(dataset[0])-1):
    str_column_to_float(dataset,i)

# convert class column to integers
str_column_to_int(dataset, len(dataset[0])-1)

# evaluate algorithm
n_folds = 5
max_depth = 10
min_size = 1
sample_size = 1.0
n_features = int(sqrt(len(dataset[0])-1))
for n_trees in [1, 5, 10]:
    scores = evaluate_algorithm(dataset, random_forest,n_folds,max_depth,min_size,sample_size,n_trees,n_features)
    print('Tree: %d' % n_trees)
    print('Scores: %s' % scores)
    print('Mean Accuracy: %.3f%%' % (sum(scores)/float(len(scores))))