Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Bayes is guaranteed to overfit (yulingyao.com)
157 points by Ambolia on May 28, 2023 | hide | past | favorite | 46 comments


As the author admits at the end, this is rather misleading. In normal usage, "overfit" is by definition a bad thing (it wouldn't be "over" if it was good). And the argument given does nothing to show that Bayesian inference is doing anything bad.

To take a trivial example, suppose you have a uniform(0,1) prior for the probability of a coin landing heads. Integrating over this gives a probability for heads of 1/2. You flip the coin once, and it lands heads. If you integrate over the posterior given this observation, you'll find that the probability of the value in the observation, which is heads, is now 2/3, greater than it was under the prior.

And that's OVERFITTING, according to the definition in the blog post.

Not according to any sensible definition, however.


I was writing another comment based on that same example and his leaving-one-out calculations (at least based on what I understood).

The posterior vs prior would be the extreme case of a leaving-one-out procedure - leaving the only data point out there is nothing left.

The divergence between the data and the model goes down when we include information about the data in the model. That doesn't seem a controversial opinion. (That's how the blog post is introduced here: https://twitter.com/YulingYao/status/1662284440603619328)

---

If the data consists of two flips they are either equal or different (the former becomes more likely as the true probability diverges from 0.5).

a) If the data is the same, the posterior probability of that result is 3/4. The log score is 2 log(3/4) = -0.6

When we check the out-of-sample log score for each one based on the 2/3 posterior obtained from the other we get in each case a log score log(2/3) = -0.4

b) If the data is different, the posterior probability is still 1/2. The log score is 2 log(1/2) = 2 -0.7 = -1.4

When we check the out-of-sample log score for each one based on the 1/3 posterior for getting that result obtained from the other we get in each case a log score log(1/3) = -1.1


When there is a small amount of information the variance of any estimation is very big and this explains what happens in that example. Overfitting implies a different behavior in training and in test and this is related to a big variance in the estimation of the error. So small amount of information implies that any model suffer overfitting and big variance, so is a general result not related especifically with Bayes.


I'm on my phone so I haven't tried to work through the math to see where the error is, but the author's conclusion is wrong and the counterexample is simple.

> Unless in degenerating cases (the posterior density is point mass), then the harmonic mean inequality guarantees a strict inequality p ( y i | y − i ) < p ( y i | y ) , for any point i and any model.

Let y_1, ... y_n be iid from a Uniform(0,theta) distribution, with some nice prior on theta (e.g. Exponential(1)). Then the posterior for theta, and hence the predictive density for a new y_i, depends only on max(y_1, ..., y_n). So for all but one of the n observations the author's strict inequality does not hold.


The author mentions he defines over fit as “Test error is always larger than training error”. Is there an algorithm or model where that’s not the case?


Yeah, that's a crummy definition. You can easily force "Test error is always larger than training error" for any model type.


It's not even about "forcing". This is such common and expected behavior that it's surprising (and suspicious) when it isn't the case.


yes, that's better said.


A pedantic example: a model that ignored the training data would do just as well on the training set as on the test set.


If you do not train on the training set, then there is no training set, and your example is degenerate.


We've trained on the training set, but our fit() implementation was, "accept the input, do nothing with it, and then return." When we then evaluate the model on the test set and training set, assuming there isn't a large covariant shift between the two or something, we'll get approximately equal values.

So it's a degenerate case, but not of the training set. (And presumably that's partly what they meant by "pedantic".)


You can see regularly in practice where aggressive data augmentation is used, which obviously is only used on training data. But, of course, you'd still be 'overfit' if you fed in unaugmented training data.


I think they mean "always (statistically) significantly larger". They're probably imagining that something like cross-validation would make test errors approximately equal to training errors, but if you consistently see significantly larger errors, then you've overfit.


When you find the minima of your validation curve, rarely if ever is your test loss lower than your training loss. I don't think this necessarily means you're overfitting.


Under the Probably Approximately Correct framework, there's an x% chance your error on the empirical distribution deviates more than \epsilon away in either direction from that of the actual distribution, so one can assume similar things between two empirical distributions.


Yes and no. Suppose for example you give MNIST to a SVM and fit the model, then test it only on 0 and 1 digits, which are generally well discriminized, you'll get almost 100% accuracy in test, whereas 97% or less in training. (ye probably need some preprocessing, like using PacMAP or UMAP or whatever but the point is the same)

However, that's just because I decided the right data to test it onto. So, you can't really say much on a model using that definition.


It seems that the post is comparing a predictive distribution conditioned on N data points to one conditioned on N-1 data points. The latter is a biased estimate of the former (e.g., https://users.aalto.fi/~ave/publications/VehtariLampinen_NC2...)


This. Leave-one-out doesn't turn an in-sample prediction problem into an equivalent out-of-sample problem, but a slightly less skilful one.


I don't follow the math. WLOG, for N total datapoints, let y_i = y_N. Then the leave-one-out posterior predictive is

  \int p(y_N|θ)p(θ|{y_1...y_{N-1}}) dθ = p(y_N|{y_1...y_{N-1}) 
by the law of total probability.

Expanding the leave-one-out posterior (via Bayes' rule), we have

  p(θ|{y_1...y_{N-1}}) = p({y_1...y_{N-1}}|θ)p(θ)/\int p({y_1...y_{N-1}}|θ')p(θ') dθ'
which when plugged back into the first equation is

  \int p(y_N|θ) p({y_1...y_{N-1}}|θ)p(θ) dθ/(\int p({y_1...y_{N-1}}|θ')p(θ') dθ')
I don't see how this simplifies to the harmonic mean expression in the post.

Regardless, the author is asserting that

  p(y_N|{y_1...y_{N-1}}) ≤ p(y_N|{y_1...y_N})
which seems intuitively plausible for any trained model — given a model trained on data {y_1...y_N}, performing inference on any datapoint y_1...y_N in the training set will generally be more accurate than performing inference on a datapoint y_{N+1} not in the training set.


It's reassuring to see that I'm not the only one who finds those equations far from obvious. I didn't spend much time trying to understand the derivation though - as you wrote the result doesn't seem interesting anyway.


I got lost in the second equation, when the author says

p(y_i|y_{-i})= \int p(y_i|\theta) p(\theta|y) \frac{p(y_i|\theta)^{-1}} {\int p(y_i|\theta^\prime p(\theta^{\prime}|y))^{-1} d\theta^\prime} d\theta

why is that? Can someone explain the rationale behind this?



Here's a good recent paper that looks at this problem and provides remedies in a Bayesian manner. https://arxiv.org/abs/2202.11678


Quote

|How do we compare between hypotheses that are entirely consistent with observations?

|... Occam's razor

The answer is, build an experiment that makes it obvious. According to Occam's Razor, the Sun was burning coal, before we found out, it doesnt.


Their paper is more for the case of "we can't gather more data, so what to do?" but your solution is in line with optimal experimental design and choosing a utility function to distinguish between models using as little data as possible.


This is a pretty silly blog post. The complaint comes down to "a Bayesian model will always place a lower probability on outcomes that have not been observed than outcomes which have been observed already" which... of course it would! In what situation where you're trying to understand an exchangeable set of outcomes would you think you should put more probability on things that you haven't seen than those you have? The only things I can dream up violate exchangeability (ie for instance, a finite bag of different color marbles, where you draw without replacement).


So the argument is essentially that "not only if you pick the best thing fitting your finite data, but even if you take a weighted average over things that fit your finite data proportionally to how well they fit your finite data - you are still almost surely end up with something that fits your finite sample better than the general population (that this sample was drawn from)"?


Typically, we think of overfitting and underfitting as exclusive properties. IMO a large problem here is that the author's definition of overfitting is not inconsistent with underfitting. (Underfit indicates a poor fit on both the training and test sets, in general.)

For example, a simple model might underfit in general, but it may still fit the training set better than the test. If this happens yet both are poor fits, it is clearly underfitting and not overfitting. Yet by the article's definition, it would both be underfitting and overfitting simultaneously. So, I suspect this is not an ideal definition.


Am I missing something or is this argument only as strong as the (perfectly reasonable) claim that all models overfit unless they have a regularisation term?


The author is using a slightly different (but not wrong) definition of overfitting that possibly you are not used to.


We can safely say that Bayes is guaranteed to underfit compared to MLE.


Bayesian statistics is dumb. What's the point of using prior assumptions that are based on speculation? It's better to admitting your lack of knowledge, not jumping to conclusions without sufficient data, and processing data in an unbiased manner.


The prior represents your current state of knowledge, whatever that may be. If you have a total lack of knowledge, your prior is uniform over all possible states.

It is never proper to fabricate a biased (e.g. non-uniform) prior without any knowledge/information, that is not in any way part of Bayesian Inference. If you do have some extremely weak evidence, you use it accordingly with an extremely weak prior.

To paraphrase E.T. Jaynes, the rules of probability theory (e.g. Bayes Theorem) are the unique logically consistent way to reason about uncertainty.


I'm not sure total lack of knowledge can ever be encoded into a prior distribution, and if it can, I think Jeffreys prior makes a more compelling case than a uniform prior. After all, to believe that all possible values of the prior parameter are equally likely, is to believe that all possible values of its square are not.


How can your prior be uniform if the hypothesis class is unbounded?


How can the hypothesis class be unbounded?

Anyway, you can define a sequence of solutions with bounded uniform priors and calculate the limiting solution. For any given data set when the endpoints of the intervals go to +/-infinity the solution will converge to the uniform prior one - if it exist.


I discussed this in another reply in this thread, but I can't personally think of any examples where the hypothesis is unbounded. The laws of physics, computation, the human mind, measuring instruments, etc. all impose bounds on real world problems.

As someone that does Bayesian Inference a lot for my work (computational biology), I very often use uniform priors, but the structure of all real world problems I have ever encountered allows me specify hard bounds to the edges of non-zero probability.


This isn't always possible, but sometimes you can define what's called an improper prior p(µ), such that even if ∫p(µ) is not finite, the posterior distribution p(µ|x) is.

A common example is when p(x|µ,v) is a Gaussian dist. with a prior on the mean set to p(µ)=1.


Too bad that the prior is fixed and you cannot change it to represent your lack of knowledge, eh?


The prior is a way to condense knowledge into math. A prior doesn't make sense if there is a lack of knowledge.


What does “lack of knowledge” mean? If it means “any value is as plausible as any other as far as I know” there is a prior for that. Etc.


> any value is as plausible as any other as far as I know”

That isn't possible to use as a prior as the uniform distribution from negative to positive infinity is zero everywhere.


Can you give an example of a problem where there is no possible prior information, and states all the way from negative to positive infinity are equally plausible? I don't think the laws of physics (not to mention, the practical limits of computation) allow for real world inference problems where there are no bounds on a prior of any kind.

Failing that, I don't think there are any human usable data sources that could report observations over an infinite interval.


It’s easy to use a uniform prior in a suitably huge interval - say from minus one gazillion to plus one gazillion - and you can look at the limit when the endpoints go to minus/plus infinity if you really have doubts about whether the interval was huge enough for your problem to fit comfortably within it.

If you don’t think that this is possible that says more about you than about the shortcomings of Bayesian statistics.


What would you do if you see an observation outside of the interval?


Use a larger interval?

As I said, you can define the improper uniform prior solution as the limit of a sequence of solutions corresponding to a sequence of increasingly wider intervals with endpoints that go to +/-infinity.

(And as I said, you can start with a suitably huge region. Say that you want to determine the position of something and use a uniform prior that extends to a distance of 10^27m - a perfectly bounded prior from a mathematical point of view that covers the whole observable universe. If you observe something outside it, it’s not with the prior that you have a problem.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: