Leaf Classification--kaggle入门

最新推荐文章于 2024-01-07 17:24:55 发布

Cookly94

最新推荐文章于 2024-01-07 17:24:55 发布

阅读量1.2k

点赞数 1

分类专栏： kaggle-入门文章标签： kaggle

本文链接：https://blog.csdn.net/Cookly94/article/details/52554662

版权

kaggle-入门专栏收录该内容

1 篇文章 0 订阅

订阅专栏

SITE-LINK kaggle–Leaf Classification

DATA DESC
The dataset consists approximately 1,584 images of leaf specimens (16 samples each of 99 species) which have been converted to binary black leaves against white backgrounds. Three sets of features are also provided per image: a shape contiguous descriptor, an interior texture histogram, and a ﬁne-scale margin histogram. For each feature, a 64-attribute vector is given per leaf sample.

Note that of the original 100 species, we have eliminated one on account of incomplete associated data in the original dataset.
File descriptions

train.csv - the training set
test.csv - the test set
sample_submission.csv - a sample submission file in the correct format
images - the image files (each image is named with its corresponding id)

Data fields

id - an anonymous id unique to an image
margin_1, margin_2, margin_3, ..., margin_64 - each of the 64 attribute vectors for the margin feature
shape_1, shape_2, shape_3, ..., shape_64 - each of the 64 attribute vectors for the shape feature
texture_1, texture_2, texture_3, ..., texture_64 - each of the 64 attribute vectors for the texture feature

CODE

'''
@author: Cookly
'''
import pandas as pd
import numpy as np

path_name = r'~/DATA/LeafClassification/data/'
file_train = r'train.csv'
file_test =  r'test.csv'
file_submission = r'sample_submission.csv'

df_train = pd.read_csv(path_name + file_train, sep=',')
df_test = pd.read_csv(path_name + file_test, sep=',')

with open(path_name + file_submission) as fl:
    fl_first_line = fl.readline()
    species = fl_first_line.split('\n')[0].split(',')[1:]
dict_species_num = dict(zip(species,xrange(1,len(set_species)+1)))

from sklearn import preprocessing
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier

X_train = df_train.values[:,2:]
y_train = df_train.species.map(dict_species_num).values

X_test = df_test.values[:,1:]
model = RandomForestClassifier(n_estimators=200)
model.fit(X_train, y_train)

y_test_proba = model.predict_proba(X_test)
# y_test = model.predict(X_test)

result = pd.DataFrame(y_test_proba,columns=species,index=df_test.id)
result.to_csv(path_name + 'submission.csv')

########choose best n of RandomForestClassifier
X_split_train, X_split_test, y_split_train, y_split_test = train_test_split(X_train, y_train, test_size = 0.3)
def get_best_n_estimators(num_list):
    result = []
    for i in num_list:
        result_one = 0.0
        for j in xrange(0,5):
            clf = RandomForestClassifier(n_estimators=i)
            clf.fit(X_split_train, y_split_train)
            result_one += clf.score(X_split_test, y_split_test)
        result.append(result_one/5.0)
    return zip(num_list,result)

num_list = [100, 120, 150, 200, 300, 500]
get_best_n_estimators(num_list)
num_list = [100, 120, 150, 200, 300, 500]
get_best_n_estimators(num_list)