> The final computation of Hoeffding's D involves a formula that normalizes this sum, taking into account the total number of data points and the expected values under the assumption of independence. The result is a measure that ranges from -0.5 to 1, where values near 0 indicate no association, values closer to 1 indicate a strong positive association, and values near -0.5 suggest a strong negative association.
This is incorrect.
Hoeffding's D was not intended to be used as a descriptive statistic to measure the strength or direction of a relationship, it's just a statistic from a nonparametric test for independence. It comes from Hoeffding's 1948 paper A Non-Parametric Test of Independence [1]. Interpreting it as a measure of the strength of a relationship is questionable. The scale is... mostly meaningless except in the qualitative sense that close to 0 is close to independence (maybe--more on that later) and close to 1 is a strong relationship of some kind. But what does a D of 0.2 or 0.6 or 0.9 mean? Who knows! It's certainly not your traditional correlation scale--don't be tempted to interpret it that way.
Interpreting it as a measure of the direction of a relationship is simply wrong. You can easily check that D for two vectors X and Y is the same as D for X and (-Y). It gives you no information about the direction of the relationship. D sometimes being negative is just an artifact of the way it's calculated and scaled.
Back to that "maybe close to independence" thing--Hoeffding's D is blind to certain deviations from independence. You can have a joint distribution with clear dependence but Hoeffding's D will be 0. See Section 4 of [2] for some examples.
If you want to use an esoteric measure for the strength of the relationship between two variables I'd go with the distance correlation instead. At least that has a clear meaning.
Thanks, I changed that sentence to be more accurate. As for the usefulness of the measure, I've found it to be extremely handy for teasing out relationships that get missed using regular correlation measures.
You repeat the same error about negative association in a couple other places:
> The final formula for Hoeffding's D combines D_1, D_2, and D_3, along with normalization factors, to produce a statistic that ranges from -0.5 to 1. This range allows for interpretation of the degree of association between the sequences, with values near 0 indicating no association, values closer to 1 indicating a strong positive association, and values near -0.5 indicating a strong negative association.
> And a score near -0.5 suggests they're moving in opposite directions, perhaps clashing rather than complementing each other.
Mutal information is definitely a good measure, but it can struggle with complex and highly non-linear associations-- particularly when you are dealing in high dimensional spaces (because of the curse of dimensionality). Mutual information can have some serious bias/variance issues, especially when you don't have a huge amount of data to work with. Hoeffding's basically sidesteps all of these problems. The main downside of it is that it's so computationally intensive, much more than mutual information.
Really depends on the context and what you’re trying to do. If you’re trying to come up with an explanatory or causal theory of the relationship between some sequence and thousands of other sequences, then maybe that starts to turn into excessive “data mining.”
If you’re using it more as a form of search (information retrieval), then I think there’s no harm in using it. For example, for ranking relevant embedding vectors.
Apparently it works quite well for finding similar genes (I guess you replace base pairs with integers or something like that). Sometimes you just need a good place to look and then you can confirm things independently.
Honestly, the outcomes of statistical tests are so correlated that if your question is "is there a relationship between these numbers", it really doesn't matter which test you do.
If there's a decent relationship there, you could run Pearson, Hoeffding, Chatterjee, a simple linear regression: it'd be a weird dataset where you get different results.
It's always a challenge to know what level to target. I think it's helpful for many people (myself included) to ground the discussion in very concrete, tangible terms that they can easily imagine before getting into abstract definitions. And if you're not one of those people, it's still not so bad to read a few gratuitous sentences with examples, especially since those examples are extensively referenced later in the article. My real focus here was to make things as simple and easy to understand as possible, while still getting into all the nitty gritty details of how to actually compute the thing and understand something about why it works. I find you can usually get "easy and shallow" content or you can get "difficulty and deep" content, but not many technical articles go for "easy but deep."
Maybe a few subsection titles with "Introduction"/"use case"/"let's dive in the math" will help to guide the simple guy to read the simple part, and the expert guy to reach directly the complex part of your message.
It's totally possible to target both the amateur and expert audience in a single article, but it need an appropriate structure to achieve this :)
For exemple:
(1) Abstract: set the direction of your article. What we want, what we use (Pearson), what I like to use (Hoeffding's D), eventually some outcome
(2) Introduction: your first three paragraph can go in there, with two subsection (exemple, and state of the art with Pearson use).
(3) Pearson details
(4) Hoeffding(s D details
(5) implementation of Hoeffding's D
(6) Conclusion/comparaison
Personally, without a clear introduction stating where we start, where we go, and what is the journey programm, I tend to not read (which is not good for me and for you :) )
Another approach is to create a synthetic joint distribution by combining the marginal distributions independently and train a random forest classifier to distinguish the synthetic joint from the real joint. If the classifier is unable, the distributions are uncorrelated. https://arxiv.org/abs/1611.07526
Very cool, I’ve never heard of this before but it makes a lot of sense. I’ll have to try this out and compare how it does on some challenging data sets.
Is there a version of Hoeffding's D for binary variables? For example, for Pearson correlation, the formula reduces to the Phi coefficient when using two binary variables (two events):
I don’t think it really makes sense in the binary context. If there are only two possible values (0 and 1), then you’ll have so many ties and also you won’t be able to get meaningful quadruples. I think the closest analogue in the binary context is probably mutual information, although this is clearly measuring a different thing.
Although, I guess you could theoretically try taking the values in each sequence in groups of N at a time and then interpreting each grouping of binary digits as an integer, and then compute the Hoeffding’s D of the resulting integers. Maybe doing that for a range of N values, like 3 to 8, and averaging the Hoeffding’s D values you get. Not sure if that really makes sense though, I need to try it with some real data!
Funny you should mention that, since I recently responded to a tweet about Chatterjee correlation to mention Hoeffding’s D, and that’s what got me thinking about it again over the past couple days and wanting to build up more intuition for how and why it works:
This was a great read and I loved the sections on intuition. But I kept wondering when the author would circle back to showing how the D stat would perform against the cases they presented for other correlation stats as evidence that those stats were flawed.
Good idea, I should add that. I need to generate some challenging data sets to illustrate things, which is why I left that out (I wrote this all in one shot and didn’t want to stop to generate data).
It's naive to think you can compare vectors without any sort of prior on what this relationship might look like. Hoeffding's D cannot effectively capture relationships between time-shifted series that the (generalised) cross correlation can. Nor can it identify patterns in conditional heteroskedasticity. You will always need to impose some sort of prior when comparing vectors, and Hoeffding's D is rarely the best choice.
I have to disagree. It repeats the same pseudoscientific presentation of correlation, as having to do with the semantics of the data.
Generically, correlation has nothing to do with whether data are related, dependent or independent.
It only is an indication of this *if you already believe* they are dependent. There has to be a prior semantic model of what the data means (what it is a measure of, how reliable its a measure of it, etc.) before correlation measures anything at all.
The only reason we would suppose correlated data to be related, is just that we design experiments to already contain possible dependencies. We do not, as a habit, measure say, the fall of rain and the beat of a song playing. But these would, given suitable measurements, be correlated.
This becomes absolutely vital to understand when experiments do take this "hoover up everything, arbitrarily" approach. As often they do in the social and psychological sciences.
I added a link to a picture at least! I was going to put more images in there initially, but decided that I didn't want to risk disrupting the flow of the writing, especially for things that I could explain reasonably well using only words. The best images for understanding are sometimes the ones that you form in your own mind as you read!
This is incorrect.
Hoeffding's D was not intended to be used as a descriptive statistic to measure the strength or direction of a relationship, it's just a statistic from a nonparametric test for independence. It comes from Hoeffding's 1948 paper A Non-Parametric Test of Independence [1]. Interpreting it as a measure of the strength of a relationship is questionable. The scale is... mostly meaningless except in the qualitative sense that close to 0 is close to independence (maybe--more on that later) and close to 1 is a strong relationship of some kind. But what does a D of 0.2 or 0.6 or 0.9 mean? Who knows! It's certainly not your traditional correlation scale--don't be tempted to interpret it that way.
Interpreting it as a measure of the direction of a relationship is simply wrong. You can easily check that D for two vectors X and Y is the same as D for X and (-Y). It gives you no information about the direction of the relationship. D sometimes being negative is just an artifact of the way it's calculated and scaled.
Back to that "maybe close to independence" thing--Hoeffding's D is blind to certain deviations from independence. You can have a joint distribution with clear dependence but Hoeffding's D will be 0. See Section 4 of [2] for some examples.
If you want to use an esoteric measure for the strength of the relationship between two variables I'd go with the distance correlation instead. At least that has a clear meaning.
[1] https://projecteuclid.org/journals/annals-of-mathematical-st...
[2] https://arxiv.org/pdf/2010.09712.pdf