Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

BFGS is an example of a quasi-Newton method. L-BFGS is the popular "low-memory" variant. Any such method can be used with the two references I gave above. One could also attempt to approximate each of its internal computations to within some tolerance with minibatches. Work in that sort of direction here: https://ei.is.tuebingen.mpg.de/publications/mahhen2015

Perhaps surprisingly, given how fundamental optimization is to so many fields, there isn't consensus on what's best. And as I previously hinted above, the answer may change with the hardware landscape.



> Perhaps surprisingly, given how fundamental optimization is to so many fields, there isn't consensus on what's best. And as I previously hinted above, the answer may change with the hardware landscape.

Not entirely to your point, but related, my understanding is that is no best at everything: methods that work well in one field won't necessarily work in others. See https://en.wikipedia.org/wiki/No_free_lunch_in_search_and_op...


I agree. But part of it is sociology: people use what their friends use.


I always assumed that when people talk about gradient-descent, they actually mean L-BFGS, i.e. a quasi-Hessian. After all, it can be used as a black-box algorithm that only needs to be told the gradient. At least in quantum optimization, the "simple" (non-quasi-Newton) gradient approach almost never works, whereas L-BFGS does just fine. So, I'm confused why in machine learning, the situation would be different, or why one would choose not to use the "for free" enhancement that L-BFGS provides.


In machine learning "evaluating the gradient" means sweeping over the whole training set. For simple models, stochastic gradient descent will have found a good fit after one or two passes over the dataset. (For neural nets, tens to hundreds of passes over the dataset may be needed.) It's hard for batch L-BFGS to do much in the same number of gradient and function evaluations. No one uses batch steepest gradient descent, which would be the worst of all worlds.

The discussion above was about making stochastic or mini-batch versions of algorithms like L-BFGS. It's possible, but not entirely straightforward, which is why "dumb" stochastic gradient descent is often used.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: