> A typical approach is to generate synthetic data based on existing ground truth. DoorDash did this via random text augmentation... During model training, they had a ratio of 100 synthetic labels to 1 actual label.
i'm sure this is well understood by practitioners but this isn't intuitive to me. when you synthetically generate data, you are resampling an observed distribution, but you're not materially going to get out something different than what you put in. (just read the linked cloudflare post - ok so we are generating negative examples and there's a lot more negative than positive - but what if we accidentally generate positives?)
whats a math/statistics intuition for why does generating synthetic data from existing data, to a 100 to 1 ratio, work so well?
It’s more of a regularization method. It can generate a lot of “additional” data that is mostly known to have the same label, which can prevent the model from overfitting to the 1 instance you actually have.
It’s most successful with images - where it’s really easy to generate significant modifications that are definitely still semantically the same. (Crop, translate, blur, desaturate, …) How many true labels would you need to learn to be invariant to all that?
> when you synthetically generate data ... you're not materially going to get
> out something different than what you put in
Well, yes and no.
If I give you five points and tell you that they are actually samples from a normal distribution, you could easily generate lots of samples with the same mean and variance.
If you were to then compute the mean and variance of this new and "improved" dataset, you would get results very nearly the same as the mean and variance of the original data. It would be fair to note at this point that the expanded dataset has no information that the original had.
On the other hand, if you generate 10,000 synthetic points in the same way and compute the 99th percentile of these points, you will get a more interesting result than if you simply computed the 99th percentile of the original points.
This makes it seem that the augmented dataset has better information in it than the original sample.
In fact, the augmented dataset is only better because our estimation algorithm is deficient. A better algorithm would look at the mean and variance and find the 99th percentile of the corresponding normal distribution. This improved algorithm would get nearly the same result with the original or the augmented data.
In this simple example, we can see and remedy the defect in our estimation algorithm, but with most machine learning methods it is much, much harder to remedy the algorithms' appetite for data. Augmentation is, however, a nice and simple alternative.
A more complex example that isn't as trivial can be seen with the problem of finding a linear classifier for sets of points in the plane. If we don't have a lot of samples, we will often have the problem that there are an infinite number of solutions that will perfectly separate the sets in the training data. We would like a better answer, however, that is likely to work well on data we haven't seen yet.
We have several choices of learning algorithms. One of the simplest is logistic regression, but that often doesn't converge well if the training set is small. We could add regularization and use fancier algorithms like ridge regression or LASSO or support vector methods to get a better result.
OR
We could just add samples that are "pretty near" each of our training examples. With a sufficiently sloppy definition of pretty near or with enough added samples, even the simplest algorithm will give us nice results. If the samples we add are normally distributed around the data, we get something like ridge regression. If the distance of the new samples from the originals are exponentially distributed, we get LASSO.
If I were stuck on a desert island and had to implement regularization for linear classifiers, data augmentation is a pretty straightforward way to do it. For more complicated models or more complicated data, it rapidly becomes a really good approach.
> For example, “Is this nudity?” is more objective than “Is this adult content?”
I know it’s only an example, but it’s annoying that nudity is such a problem on internet. It’s easy to block all nudity but it prevents a lot of non adult content.
There's Ancient Greek style statuary all over my town, penis, vagina and all. It's out in public and very few people care, because it's art, and in appropriate places like the public gardens. Hiding it away, would be an overreaction.
Yet, somehow, seeing naked art, on a private device is somehow a terrible thing.
What counts as "NSFW" strongly reflects the internet's American origins. Innocent non sexual nudity of any kind? NSFW. Violence and vile racism? Almost always SFW.
huge, huge fan of Eugene's blog. he is almost singlehandedly documenting the industry SOTA of dozens of ML efforts in practice, doing the hard work of collating, comparing and contrasting between published posts from various companies. Work that you'd have to do yourself if you were tasked to do something like this, but now you have an authoritative, well researched/argued source. For free. (I mean i guess you could also pay him for extra consulting)
i'm sure this is well understood by practitioners but this isn't intuitive to me. when you synthetically generate data, you are resampling an observed distribution, but you're not materially going to get out something different than what you put in. (just read the linked cloudflare post - ok so we are generating negative examples and there's a lot more negative than positive - but what if we accidentally generate positives?)
whats a math/statistics intuition for why does generating synthetic data from existing data, to a 100 to 1 ratio, work so well?