About Bootstrap Aggregation

Decision trees are a simple and powerful predictive modeling technique.but they suffer from high-variance. A technique to make decision trees more robust and to achieve better performance is called bootstrap aggregation or bagging for short.

  • How to create a bootstrap sample of your dataset.
  • How to make with bootstrapped models.
  • How to apply bagging to your own predictive modeling problems.

1.1.1 Bootstrap Aggregation Algorithm

A bootstrap is sample of a dataset with replacement.This means that a new dataset is created from a random sample of an existing dataset where a given row may be selected and added more than once to the sample.It is a useful approach to use when estimating values such as the mean for a broader dataset, when you only have a limited dataset available.

        The same approach can be used with machine learning algorithms that a high variance.A separate model is trained on each bootstrap sample of data and the average output of those models used to make predictions.This technique is called boostrap aggregation or bagging for short.Variance means that an algorithm's performance is sensitive to the training data,with high variance suggesting that the more the training data is changed, the more the performance of the algorithm will vary.

        The performance of high variance machine learning algorithms like unpruned decision trees can be improved by training many trees and taking the average of their predictions. Results are often better than a single decision tree. Another benefit of bagging in addition to improved performance is that the bagged decision trees cannot overfit the problem. Trees can continue to be added until a maximum in performance is achieved.

1.2 Tutorial

  • Bootstrap Resample
  • Sonar Case Study

        These steps provide the foundation that you need to implement and apply bootstrap aggregation with decision trees to your own preditive modeling problems.

1.2.1 Bootstrap Resample

Below is a function name subsample() that implements this procedure. The randrange() function from the random module is used to select a random row index  to add the sample each iteration of the loop.

# create a random subsample from the dataset with replacement
def subsample(dataset, ratio=1.0):
    sample = list()
    n_sample = round(len(dataset) * ratio)
    while len(sample) < n_sample:
        index = randrange(len(dataset))
        sample.append(dataset[index])
    return sample

The complete example is listed below. Each bootstrap sample is created as a 10% sample of the original 20 observation dataset.

#Example of subsampling a dataset
from random import seed
from random import randrange


# create a random subsample from the dataset with replacement
def subsample(dataset, ratio=1.0):
    sample = list()
    n_sample = round(len(dataset) * ratio)
    while len(sample) < n_sample:
        index = randrange(len(dataset))
        sample.append(dataset[index])
    return sample

# Calculate the mean of a list of numbers
def mean(numbers):
    return sum(numbers) / float(len(numbers))

# Test subsample a dataset
seed(1)

# True mean
dataset = [[randrange(10)] for i in range(20)]
print('True Mean: %.3f' % mean([row[0] for row in dataset]))

# Estimated mean
ratio = 0.10
for size in [1, 10, 100]:
    sample_means = list()
    for i in range(size):
        sample = subsample(dataset, ratio)
        sample_mean = mean([row[0] for row in sample])
        sample_means.append(sample_mean)
    print('Samples=%d, Estimated Mean: %.3f' % (size, mean(sample_means)))

We can then see the estimated mean from the various different numbers of bootstrp samples.

1.2.2 Sonar Case Study

We will apply the Bagging algorithm to the Sonar dataset.

load_csv: the dataset is first loaded

str_column_to_float() : the string values converted to numeric

str_column_to_int() : the output column is converted from string to the integer values of 0 to 1.

cross_validation_split() : use k-fold cross-validation to estimate the performance of the learned model on unseen data.

accuracy_metric() : construct and evaluate k models and estimate the performance as the mean model error.

evaluate_algorithm() : classification accuracy will be used to evaluate each model.

we will also use an implementation of the Classification and Regression Trees(CART) algorithm adapted for bagging with the helper function.

Test_split() : split a dataset into groups

gini_index() : evaluate a split point

get_split() : find an optional split point

to_terminal(),split() and build tree() used to create a single decision tree

predict() to make a prediction with a decision tree

subsample() function described in the previous step to make a subsample of the training dataset.

bagging_predict() is developed that is responsible for making a prediction with each decision tree and combining the prediction into a single return value.

bagging() is developed that is responsible for creating the samples of the training dataset.training  a decision tree on each, then making predictions on the test dataset using the list of bagged trees.

The complete example is listed below:

# Bagging Algorithm on the Sonar dataset
from random import seed
from random import randrange
from csv import reader

# Load a CSV file
def load_csv(filename):
    dataset = list()
    with open(filename,'r') as file:
        csv_reader = reader(file)
        for row in csv_reader:
            if not row:
                continue
            dataset.append(row)
    return dataset

# Convert string column to float
def str_column_to_float(dataset, column):
    for row in dataset:
        row[column] = float(row[column].strip())

# Convert  string column to integer
def str_column_to_int(dataset, column):
    class_values = [row[column] for row in dataset]
    unique = set(class_values)
    lookup = dict()
    for i, value in enumerate(unique):
        lookup[value] = i
    for row in dataset:
        row[column] = lookup[row[column]]
    return lookup

# Split a dataset into k folds
def cross_validation_split(dataset,n_folds):
    dataset_split = list()
    dataset_copy = list(dataset)
    fold_size = int(len(dataset)/n_folds)
    for i in range(n_folds):
        fold = list()
        while len(fold) < fold_size:
            index = randrange(len(dataset_copy))
            fold.append(dataset_copy.pop(index))
        dataset_split.append(fold)
    return dataset_split

# Calculate accuracy percentage
def accuracy_metric(actual, predicted):
    correct = 0
    for i in range(len(actual)):
        if actual[i] == predicted[i]:
            correct += 1
        return correct / float(len(actual)) * 100.0

# Evaluate an algorithm using a cross validation split
def evaluate_algorithm(dataset, algorithm, n_folds, *args):
    folds = cross_validation_split(dataset, n_folds)
    scores = list()
    for fold in folds:
        train_set = list(folds)
        train_set.remove(fold)
        train_set = sum(train_set,[])
        test_set = list()
        for row in fold:
            row_copy = list(row)
            test_set.append(row_copy)
            row_copy[-1] = None
        predicted = algorithm(train_set, test_set,*args)
        actual = [row[-1] for row in fold]
        accuracy = accuracy_metric(actual, predicted)
        scores.append(accuracy)
    return scores

# Split a dataset based on an attribute and an attribute value
def test_split(index, value, dataset):
    left,right = list(),list()
    for row in dataset:
        if row[index] < value:
            left.append(row)
        else:
            right.append(row)
    return left, right

# Calculate the Gini index for a split dataset
def gini_index(groups, classes):
    # count all samples at split point
    n_instances = float(sum([len(group) for group in groups]))
    # sum weighted Gini index for each group
    gini = 0.0
    for group in groups:
        size = float(len(group))
        # avoid divide by zero
        if size == 0:
            continue
        score = 0.0
        # score the group based on the score for each class
        for class_val in classes:
            p = [row[-1] for row in group].count(class_val) / size
            score += p * p
        # weight the group score by its relative size
        gini += (1.0 - score) * (size / n_instances)
    return gini


# Select the best split point for a dataset
def get_split(dataset):
    class_values = list(set(row[-1] for row in dataset))
    b_index, b_value, b_score,b_groups = 999, 999, 999, None
    for index in range(len(dataset[0])-1):
        for row in dataset:
            # for i in range(len(dataset)):
            #   row = dataset[randrange(len(dataset))]
            groups = test_split(index, row[index],dataset)
            gini = gini_index(groups, class_values)
            if gini < b_score:
                b_index,b_value,b_score,b_groups = index,row[index],gini,groups
    return {'index':b_index,'value':b_value,'groups': b_groups}

# Create a terminal node value
def to_terminal(group):
    outcomes = [row[-1] for row in group]
    return max(set(outcomes),key=outcomes.count)

# Create child splits for a node or make terminal
def split(node, max_depth,min_size, depth):
    left,right = node['groups']
    del(node['groups'])
    # check for a no split
    if not left or not right:
        node['left'] = node['right'] = to_terminal(left + right)
        return
    #check for max depth
    if depth >= max_depth:
        node['left'],node['right'] = to_terminal(left),to_terminal(right)
        return
    # process left child
    if len(left) < min_size:
        node['left'] = to_terminal(left)
    else:
        node['left'] = get_split(left)
        split(node['left'],max_depth,min_size,depth+1)
    # process right child
    if len(right) < min_size:
        node['right'] = to_terminal(right)
    else:
        node['right'] = get_split(right)
        split(node['right'],max_depth,min_size,depth+1)

# Build a decision tree
def build_tree(train, max_depth, min_size):
    root = get_split(train)
    split(root, max_depth, min_size, 1)
    return root

# Make a prediction with a decision tree
def predict(node, row):
    if row[node['index']] < node['value']:
        if isinstance(node['left'],dict):
            return predict(node['left'],row)
        else:
            return node['left']
    else:
        if isinstance(node['right'],dict):
            return predict(node['right'],row)
        else:
            return node['right']
        
# Create a random subsample from the dataset with replacement
def subsample(dataset, ratio):
    sample = list()
    n_sample = round(len(dataset) * ratio)
    while len(sample) < n_sample:
        index = randrange(len(dataset))
        sample.append(dataset[index])
    return sample

# Make a prediction with a list of bagged trees
def bagging_predict(trees, row):
    predictions = [predict(tree, row) for tree in trees]
    return max(set(predictions),key=predictions.count)

# Bootstrap Aggreagation Algorithm
def bagging(train, test, max_depth, min_size, sample_size,n_trees):
    trees = list()
    for i in range(n_trees):
        sample = subsample(train, sample_size)
        tree = build_tree(sample, max_depth, min_size)
        trees.append(tree)
    predictions = [bagging_predict(trees, row) for row in test]
    return (predictions)

# Test bagging on the sonar dataset
seed(1)

# load and prepare data
filename = 'sonar.all-data.csv'
dataset = load_csv(filename)

# convert string attributes to integers
for i in range(len(dataset[0])-1):
    str_column_to_float(dataset,i)

# convert class column to integers
str_column_to_int(dataset, len(dataset[0])-1)

# evaluate algorithm
n_folds = 5
max_depth = 6
min_size = 2
sample_size = 0.50
for n_trees in [1, 5, 10, 50]:
    scores = evaluate_algorithm(dataset, bagging,n_folds,max_depth,min_size,sample_size,n_trees)
    print('Tree: %d' % n_trees)
    print('Scores: %s' % scores)
    print('Mean Accuracy: %.3f%%' % (sum(scores)/float(len(scores))))

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值