BFGS is an example of a quasi-Newton method. L-BFGS is the popular "low-memory" ...

davidkretch · on June 21, 2016

> Perhaps surprisingly, given how fundamental optimization is to so many fields, there isn't consensus on what's best. And as I previously hinted above, the answer may change with the hardware landscape.

Not entirely to your point, but related, my understanding is that is no best at everything: methods that work well in one field won't necessarily work in others. See https://en.wikipedia.org/wiki/No_free_lunch_in_search_and_op...

imurray · on June 21, 2016

I agree. But part of it is sociology: people use what their friends use.

goerz · on June 21, 2016

I always assumed that when people talk about gradient-descent, they actually mean L-BFGS, i.e. a quasi-Hessian. After all, it can be used as a black-box algorithm that only needs to be told the gradient. At least in quantum optimization, the "simple" (non-quasi-Newton) gradient approach almost never works, whereas L-BFGS does just fine. So, I'm confused why in machine learning, the situation would be different, or why one would choose not to use the "for free" enhancement that L-BFGS provides.

imurray · on June 27, 2016

In machine learning "evaluating the gradient" means sweeping over the whole training set. For simple models, stochastic gradient descent will have found a good fit after one or two passes over the dataset. (For neural nets, tens to hundreds of passes over the dataset may be needed.) It's hard for batch L-BFGS to do much in the same number of gradient and function evaluations. No one uses batch steepest gradient descent, which would be the worst of all worlds.

The discussion above was about making stochastic or mini-batch versions of algorithms like L-BFGS. It's possible, but not entirely straightforward, which is why "dumb" stochastic gradient descent is often used.