train test 划分 转自 http://blog.sina.com.cn/s/blog_6a90ae320101a5rc.html

Python scikit-learn机器学习工具包学习笔记:cross_validation模块

(2013-02-19 20:39:36)
sklearn.cross_validation模块的作用顾名思义就是做crossvalidation的。

crossvalidation大概的意思是:对于原始数据我们要将其一部分分为train data,一部分分为testdata。train data用于训练,test data用于测试准确率。在test data上测试的结果叫做validationerror。将一个算法作用于一个原始数据,我们不可能只做出随机的划分一次train和testdata,然后得到一个validationerror,就作为衡量这个算法好坏的标准。因为这样存在偶然性。我们必须好多次的随机的划分train data和testdata,分别在其上面算出各自的validation error。这样就有一组validationerror,根据这一组validation error,就可以较好的准确的衡量算法的好坏。
crossvalidation是在数据量有限的情况下的非常好的一个evaluate performance的方法。
而对原始数据划分出traindata和test data的方法有很多种,这也就造成了cross validation的方法有很多种。

sklearn中的crossvalidation模块,最主要的函数是如下函数:
sklearn.cross_validation.cross_val_score。他的调用形式是scores= cross_validation.cross_val_score(clf, raw data, raw target, cv=5,score_func=None)
参数解释:
clf是不同的分类器,可以是任何的分类器。比如支持向量机分类器。clf =svm.SVC(kernel='linear', C=1)
cv参数就是代表不同的crossvalidation的方法了。如果cv是一个int数字的话,并且如果提供了rawtarget参数,那么就代表使用StratifiedKFold分类方式,如果没有提供rawtarget参数,那么就代表使用KFold分类方式。
cross_val_score函数的返回值就是对于每次不同的的划分raw data时,在testdata上得到的分类的准确率。至于准确率的算法可以通过score_func参数指定,如果不指定的话,是用clf默认自带的准确率算法。
还有其他的一些参数不是很重要。
cross_val_score具体使用例子见下:
>>>clf = svm.SVC(kernel='linear', C=1)
>>>scores = cross_validation.cross_val_score(
...   clf, raw data, raw target,cv=5)
...
>>>scores                                     
array([ 1. ...,  0.96..., 0.9 ...,  0.96..., 1.       ])

除了刚刚提到的KFold以及StratifiedKFold这两种对rawdata进行划分的方法之外,还有其他很多种划分方法。但是其他的划分方法调用起来和前两个稍有不同(但是都是一样的),下面以ShuffleSplit方法为例说明:
>>>n_samples = raw_data.shape[0]
>>>cv = cross_validation.ShuffleSplit(n_samples,n_iter=3,
...    test_size=0.3,random_state=0)

>>>cross_validation.cross_val_score(clf, raw data, raw target,cv=cv)
...                                           
array([0.97...,  0.97...,  1.      ])

还有的其他划分方法如下:
cross_validation.Bootstrap
cross_validation.LeaveOneLabelOut
cross_validation.LeaveOneOut
cross_validation.LeavePLabelOut
cross_validation.LeavePOut
cross_validation.StratifiedShuffleSplit

他们的调用方法和ShuffleSplit是一样的,但是各自有各自的参数。至于这些方法具体的意义,见machinelearning教材。

还有一个比较有用的函数是train_test_split
功能:从样本中随机的按比例选取train data和testdata。调用形式为:
X_train, X_test, y_train, y_test =cross_validation.train_test_split(train_data, train_target,test_size=0.4, random_state=0)
test_size是样本占比。如果是整数的话就是样本的数量。random_state是随机数的种子。不同的种子会造成不同的随机采样结果。相同的种子采样结果相同。
展开阅读全文

Train Time

10-30

DescriptionnnCity transportation planners are developing a light rail transit system to carry commuters between the suburbs and the downtown area. Part of their task includes scheduling trains on different routes between the outermost stations and the metro center hub. nnnPart of the planning process consists of a simple simulation of train travel. A simulation consists of a series of scenarios in which two trains, one starting at the metro center and one starting at the outermost station of the same route, travel toward each other along the route. The transportation planners want to find out where and when the two trains meet. You are to write a program to determine those results. nnnThis model of train travel is necessarily simplified. All scenarios are based on the following assumptions. nnAll trains spend a fixed amount of time at each station. nAll trains accelerate and decelerate at the same constant rate. All trains have the same maximum possible velocity. nWhen a train leaves a station, it accelerates (at a constant rate) until it reaches its maximum velocity. It remains at that maximum velocity until it begins to decelerate (at the same constant rate) as it approaches the next station. Trains leave stations with an initial velocity of zero (0.0) and they arrive at stations with terminal velocity zero. Adjacent stations on each route are far enough apart to allow a train to accelerate to its maximum velocity before beginning to decelerate. nBoth trains in each scenario make their initial departure at the same time. nThere are at most 30 stations along any route. nInputnnAll input values are real numbers. Data for each scenario are in the following format. nd1 d2 ... dn 0.0nFor a single route, the list of distances (in miles--there are 5,280 feet in a mile) from each station to the metro center hub,separated by one or more spaces. Stations are listed in ascending order, starting with the station closest to the metro center hub (station 1) and continuing to the outermost station. All distances are greater than zero. The list is terminated by the sentinel value 0.0.nvnThe maximum train velocity, in feet/minute.nsnThe constant train acceleration rate in feet/minute².nmnThe number of minutes a train stays in a station.nnThe series of runs is terminated by a data set which begins with the number -1.0.nOutputnnFor each scenario, output consists of the following labeled data. nnnThe number of the scenario (numbered consecutively, starting with Scenario #1). nThe time when the two trains meet in terms of minutes from starting time. All times must be displayed to one decimal place. nThe distance in miles between the metro center hub and the place where the two trains meet. Distances must be displayed to three decimal places. Also, if the trains meet in a station, output the number of the station where they meet. nPrint a blank line after each scenario.nSample Inputnn15.0 0.0n5280.0n10560.0n5.0n3.5 7.0 0.0n5280.0n10560.0n2.0n3.4 7.0 0.0n5280.0n10560.0n2.0n-1.0nSample OutputnnScenario #1n Meeting time: 7.8 minutesn Meeting distance: 7.500 miles from metro center hubnnScenario #2n Meeting time: 4.0 minutesn Meeting distance: 3.500 miles from metro center hub, in station 1nnScenario #3n Meeting time: 4.1 minutesn Meeting distance: 3.400 miles from metro center hub, in station 1 问答

没有更多推荐了,返回首页