We need to report testset performance
Why validation set is not enough and we need a seperate test set?
We don’t strictly need a test set, it might be okay if some cases to have training and validation sets onluy, the purpose of the test set is to help us get an unbiased estimate of the generazation performance especially when we have a lot of hyper-parameters to tune, there might be a risk of overfitting to the validation set although the model never sees the validation set, we do and we tune the knobs to reconfigure our model accordingly. When we have a lot of knobs to tune , the model might end up being overly tuned to perform well on the validation set, yet do not generaliza well to truely unseen data. That’s why it might be beneficial to have a seperate test set to use once we are done with configuring and training our model.
Partition the data: (training set , validation, test set)
Partitioned in an unbiased way, for an unbiased split, randomly shuffling the data before partitioning is usually good enough . If the distribution of the labels in the dataset is heavily imbalanced, you might want to do stratified sampling, to make sure that you have the representative samples of all samples in all sets. Another caveat is that some datasets might have duplicate samples. So we need to make sure that the partitions are disjoint and we don’t use the same samples for both training and evaluation.
It’s not uncommon to train the model on one source and test the model on another source which usually consists of more challenging samples. It might also be practical in some cases to train a model on loosely labelled large volume data such as images crawled from the wen and test it on a small dataset with better quality labels, such as images manually annotated by humans. One thing to keep in mind is to choose validation and test sets to reflect the type of data you expect your model to receive in the future .
Error Summary
Training loss | Validation Loss | |
---|---|---|
High | High | Underfitting: since the model is not even fittinf the training set well, we want to increase capacity by using larger model or training the model for longer |
Low | High | Overfitting: Our model migt be memorizing the training samples yet learning nothing meaningful. To curb overfitting, we can try shrinking our model size or use regularization techniques. |
Low | Low | We can use the test set to get an unbiased estimate of the generazation performance (metrics: accuracy, percision and recall) |
High | Low | Unlikely: debug (same dataset?) |