The network is sized to be able to learn the training data reasonably well (e.g. via hyper-parameter optimization). If there is too much variation in data that is not seen in the real application (like rotation of letters mentioned in the article), an appropriately sized network will still learn it, but would be an overkill for the application at hand.