Leaf Classification--kaggle入门

SITE-LINK kaggle–Leaf Classification


DATA DESC
The dataset consists approximately 1,584 images of leaf specimens (16 samples each of 99 species) which have been converted to binary black leaves against white backgrounds. Three sets of features are also provided per image: a shape contiguous descriptor, an interior texture histogram, and a fine-scale margin histogram. For each feature, a 64-attribute vector is given per leaf sample.

Note that of the original 100 species, we have eliminated one on account of incomplete associated data in the original dataset.
File descriptions

train.csv - the training set
test.csv - the test set
sample_submission.csv - a sample submission file in the correct format
images - the image files (each image is named with its corresponding id)

Data fields

id - an anonymous id unique to an image
margin_1, margin_2, margin_3, ..., margin_64 - each of the 64 attribute vectors for the margin feature
shape_1, shape_2, shape_3, ..., shape_64 - each of the 64 attribute vectors for the shape feature
texture_1, texture_2, texture_3, ..., texture_64 - each of the 64 attribute vectors for the texture feature

CODE

'''
@author: Cookly
'''
import pandas as pd
import numpy as np

path_name = r'~/DATA/LeafClassification/data/'
file_train = r'train.csv'
file_test =  r'test.csv'
file_submission = r'sample_submission.csv'

df_train = pd.read_csv(path_name + file_train, sep=',')
df_test = pd.read_csv(path_name + file_test, sep=',')

with open(path_name + file_submission) as fl:
    fl_first_line = fl.readline()
    species = fl_first_line.split('\n')[0].split(',')[1:]
dict_species_num = dict(zip(species,xrange(1,len(set_species)+1)))

from sklearn import preprocessing
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier

X_train = df_train.values[:,2:]
y_train = df_train.species.map(dict_species_num).values

X_test = df_test.values[:,1:]
model = RandomForestClassifier(n_estimators=200)
model.fit(X_train, y_train)

y_test_proba = model.predict_proba(X_test)
# y_test = model.predict(X_test)

result = pd.DataFrame(y_test_proba,columns=species,index=df_test.id)
result.to_csv(path_name + 'submission.csv')

########choose best n of RandomForestClassifier
X_split_train, X_split_test, y_split_train, y_split_test = train_test_split(X_train, y_train, test_size = 0.3)
def get_best_n_estimators(num_list):
    result = []
    for i in num_list:
        result_one = 0.0
        for j in xrange(0,5):
            clf = RandomForestClassifier(n_estimators=i)
            clf.fit(X_split_train, y_split_train)
            result_one += clf.score(X_split_test, y_split_test)
        result.append(result_one/5.0)
    return zip(num_list,result)

num_list = [100, 120, 150, 200, 300, 500]
get_best_n_estimators(num_list)
num_list = [100, 120, 150, 200, 300, 500]
get_best_n_estimators(num_list)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值