Be systematic.
Keep a log of every architecture you've tried, what the hyperparameters (layer sizes, learning rate, etc.) were, and what the resulting performance was. As you try more things, you can start seeing patterns about which parameters matter. If you find a bug in your code, be sure to cross out past results that are invalid due to the bug.
Start with a shallow network (just two layers, i.e. one non-linearity).
Deeper networks have exponentially more hyperparameter combinations, and getting even a single one wrong can ruin your performance. Use the small network to find a good learning rate and layer size; afterwards you can consider adding more layers of similar size.
If your learning rate is wrong, none of your other hyperparameter choices matter.
You can take a state-of-the-art model from a research paper, and change the learning rate such that it performs no better than random.
Smaller batches require lower learning rates.
When experimenting with different batch sizes, be aware that the best learning rate may be different depending on the batch size.
Making the network too wide generally doesn't hurt accuracy too much.
If you keep making the network wider accuracy will gradually decline, but computation time will increase quadratically in the layer size -- you're likely to give up due to excessive slowness long before the accuracy falls too much. The full autograder for all parts of the project takes 2-12 minutes to run with staff solutions; if your code is taking much longer you should check it for efficiency.
If your model is returning Infinity or NaN, your learning rate is probably too high for your current architecture.
Recommended values for your hyperparameters:
Hidden layer sizes: between 10 and 400
Batch size: between 1 and the size of the datasize. For Q2 and Q3, we require that total size of the dataset be evenly divisible by the batch size.
Learning rate: between 0.001 and 1.0
Number of hidden layers: bewteen 1 and 3