机器学习分类数据集
什么是数据集? (What is the dataset?)
The Poker Hand dataset [Cattral et al., 2007] is publicly available and very well-documented at the UCI Machine Learning Repository [Dua et al., 2019]. [Cattral et al., 2007] described it as:
扑克之手数据集 [Cattral等人,2007]可公开获得,并且在UCI机器学习存储库中有很好的文档记录[Dua等人,2019]。 [Cattral et al。,2007]将其描述为:
Found to be a challenging dataset for classification algorithms
被发现是分类算法的具有挑战性的数据集
It is an 11-dimensional dataset with 25K samples for training and over 1M samples for testing. Each dataset instance is a 5-cards poker-hand that uses two features per card (suite and rank) and the Poker-hand label.
它是一个11维数据集,包含用于训练的25K样本和用于测试的1M样本。 每个数据集实例都是一张5张纸牌的扑克手,每张纸牌使用两个功能(套房和等级)和扑克手标签。
为什么很难? (Why is it hard?)
It has two properties that makes it particular challenging for classification algorithms: it’s all categorical features and it’s extremely imbalanced. Categorical features are hard because the typical distance (a.k.a. similarity) metrics can’t be naturally applied to such features. E.g. this dataset has two features: rank and suite, calculating the Euclidean distance between “spades” and “hearts” simply doesn’t make sense. Imbalanced datasets are hard because the machine learning algorithms kind of assume a good balance, Jason Brownlee from Machine Learning Mastery describes the problem as:
它具有两个特性,这使分类算法特别具有挑战性:它是所有分类功能 ,并且极不平衡 。