1. 统计每列中各个元素的个数
感谢@waple_0820 提供的两个个方案:
df.groupby(['colname'],as_index=False)['colname'].agg({'cnt':'count'}) #方案一
df['colname'].value_counts() #方案二
2. 从原数据中划分train和test
在sklearn中有一个方法叫train_test_split, 通过这种方式将数据进分离。具体操作如下:
from sklearn.cross_validation import train_test_split
X = df.drop('class', axis=1)
Y = df['class']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 42)
X,Y分别是从原数据集中提取的特征和label,然后调用函数train_test_split()
就可以将原数据划分为训练数据和测试数据。其中test_size
是测试样本在总样本中所占的比例;random_state
是用于产生随机数的一个种子,如果你设置了这个值,意味着每次划分的样本都是一样的;如果这个值为空,那么你每次生成的训练和测试样本都是不一样的,但是数量是一样的。
3. 评估模型的查准率(Precision)、查全率(Recall)与F1
同样是使用slearn中的函数,使用操作如下:
from sklearn.metrics import accuracy_score,recall_score,f1_score
print("ACC", accuracy_score(Y_test, Y_pred)) #查准率
print("REC", recall_score(Y_test, Y_pred, average="micro")) #查全率
print("F-score", f1_score(Y_test, Y_pred, average="micro")) #F1
其中,第一个参数测试数据的label,第二个参数是预测的结果。其中还有好些参数,我们查看源码就知道知道这些参数的作用:
y_true : 1d array-like, or label indicator array / sparse matrix Ground truth (correct) target values. y_pred : 1d array-like, or label indicator array / sparse matrix Estimated targets as returned by a classifier. labels : list, optional The set of labels to include when ``average != 'binary'``, and their order if ``average is None``. Labels present in the data can be excluded, for example to calculate a multiclass average ignoring a majority negative class, while labels not present in the data will result in 0 components in a macro average. For multilabel targets, labels are column indices. By default, all labels in ``y_true`` and ``y_pred`` are used in sorted order. .. versionchanged:: 0.17 parameter *labels* improved for multiclass problem. pos_label : str or int, 1 by default The class to report if ``average='binary'`` and the data is binary. If the data are multiclass or multilabel, this will be ignored; setting ``labels=[pos_label]`` and ``average != 'binary'`` will report scores for that label only. average : string, [None, 'binary' (default), 'micro', 'macro', 'samples', \ 'weighted'] This parameter is required for multiclass/multilabel targets. If ``None``, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data: ``'binary'``: Only report results for the class specified by ``pos_label``. This is applicable only if targets (``y_{true,pred}``) are binary. ``'micro'``: Calculate metrics globally by counting the total true positives, false negatives and false positives. ``'macro'``: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account. ``'weighted'``: Calculate metrics for each label, and find their average, weighted by support (the number of true instances for each label). This alters 'macro' to account for label imbalance; it can result in an F-score that is not between precision and recall. ``'samples'``: Calculate metrics for each instance, and find their average (only meaningful for multilabel classification where this differs from :func:`accuracy_score`). sample_weight : array-like of shape = [n_samples], optional Sample weights.
前两个参数就不再赘述了;
labels
是当average!='binary'
和average =None
时需要被设置。数据中存在一些标签可以被排除,如计算多重分类的平均的时候,我们可以将数据看成多个二分类问题的集合,每个类都是一个二分类。 这个时候label只是一个索引,在默认情况下,y_true和y_pred中的标签都是按顺序排列的。
pos_label
可以是str,int,默认为1。它在是二分类问题中被调用,如果数据是个多重分类,这个参数被忽略。
average
将一个二分类matrics拓展到多分类或多标签问题时,我们可以将数据看成多个二分类问题的集合,每个类都是一个二分类。接着,我们可以通过跨多个分类计算每个二分类metrics得分的均值,这在一些情况下很有用。你可以使用average参数来指定。
average = binary : 在一个二分类问题中,只返回某一个标签的结果,这个标签由pos_label来指定。
average = macro:计算二分类metrics的均值,为每个类给出相同权重的分值。当小类很重要时会出问题,因为该macro-averging方法是对性能的平均。另一方面,该方法假设所有分类都是一样重要的,因此macro-averaging方法会对小类的性能影响很大。
average = micro:给出了每个样本类以及它对整个metrics的贡献的pair(sample-weight),而非对整个类的metrics求和,它会每个类的metrics上的权重及因子进行求和,来计算整个份额。Micro-averaging方法在多标签(multilabel)问题中设置。
average = weighted:对于不均衡数量的类来说,计算二分类metrics的平均,通过在每个类的score上进行加权实现。可以用于在label数量不均衡的时候替代macro。
average = samples:应用在multilabel问题上。它不会计算每个类,相反,它会在评估数据中,通过计算真实类和预测类的差异的metrics,来求平均(sample_weight-weighted) 。
average=None将返回一个数组,它包含了每个类的得分。
3. 对Series Object的处理
有时候用loc,iloc获取数据的时候返回的是Series 。对Series 直接访问数据会带上index,于是可以用以下的方式来分别获取index和值。
obj.values # 获取值
obj.index # 获取index