My Favorite Statistical Measure: Hoeffding's D

CrazyStat · on Feb 22, 2024

> The final computation of Hoeffding's D involves a formula that normalizes this sum, taking into account the total number of data points and the expected values under the assumption of independence. The result is a measure that ranges from -0.5 to 1, where values near 0 indicate no association, values closer to 1 indicate a strong positive association, and values near -0.5 suggest a strong negative association.

This is incorrect.

Hoeffding's D was not intended to be used as a descriptive statistic to measure the strength or direction of a relationship, it's just a statistic from a nonparametric test for independence. It comes from Hoeffding's 1948 paper A Non-Parametric Test of Independence [1]. Interpreting it as a measure of the strength of a relationship is questionable. The scale is... mostly meaningless except in the qualitative sense that close to 0 is close to independence (maybe--more on that later) and close to 1 is a strong relationship of some kind. But what does a D of 0.2 or 0.6 or 0.9 mean? Who knows! It's certainly not your traditional correlation scale--don't be tempted to interpret it that way.

Interpreting it as a measure of the direction of a relationship is simply wrong. You can easily check that D for two vectors X and Y is the same as D for X and (-Y). It gives you no information about the direction of the relationship. D sometimes being negative is just an artifact of the way it's calculated and scaled.

Back to that "maybe close to independence" thing--Hoeffding's D is blind to certain deviations from independence. You can have a joint distribution with clear dependence but Hoeffding's D will be 0. See Section 4 of [2] for some examples.

If you want to use an esoteric measure for the strength of the relationship between two variables I'd go with the distance correlation instead. At least that has a clear meaning.

[1] https://projecteuclid.org/journals/annals-of-mathematical-st...

[2] https://arxiv.org/pdf/2010.09712.pdf

eigenvalue · on Feb 22, 2024

Thanks, I changed that sentence to be more accurate. As for the usefulness of the measure, I've found it to be extremely handy for teasing out relationships that get missed using regular correlation measures.

CrazyStat · on Feb 22, 2024

You repeat the same error about negative association in a couple other places:

> The final formula for Hoeffding's D combines D_1, D_2, and D_3, along with normalization factors, to produce a statistic that ranges from -0.5 to 1. This range allows for interpretation of the degree of association between the sequences, with values near 0 indicating no association, values closer to 1 indicating a strong positive association, and values near -0.5 indicating a strong negative association.

> And a score near -0.5 suggests they're moving in opposite directions, perhaps clashing rather than complementing each other.

eigenvalue · on Feb 22, 2024

Thanks again, fixed them.

tomrod · on Feb 22, 2024

I'm personally a fan of mutual information and flavors, like transfer entropy.

eigenvalue · on Feb 22, 2024

Mutal information is definitely a good measure, but it can struggle with complex and highly non-linear associations-- particularly when you are dealing in high dimensional spaces (because of the curse of dimensionality). Mutual information can have some serious bias/variance issues, especially when you don't have a huge amount of data to work with. Hoeffding's basically sidesteps all of these problems. The main downside of it is that it's so computationally intensive, much more than mutual information.

kqr · on Feb 22, 2024

This sounds a little like p-hacking but I guess I'm missing some nuance?

eigenvalue · on Feb 22, 2024

Really depends on the context and what you’re trying to do. If you’re trying to come up with an explanatory or causal theory of the relationship between some sequence and thousands of other sequences, then maybe that starts to turn into excessive “data mining.”

If you’re using it more as a form of search (information retrieval), then I think there’s no harm in using it. For example, for ranking relevant embedding vectors.

Apparently it works quite well for finding similar genes (I guess you replace base pairs with integers or something like that). Sometimes you just need a good place to look and then you can confirm things independently.

specproc · on Feb 22, 2024

Honestly, the outcomes of statistical tests are so correlated that if your question is "is there a relationship between these numbers", it really doesn't matter which test you do.

If there's a decent relationship there, you could run Pearson, Hoeffding, Chatterjee, a simple linear regression: it'd be a weird dataset where you get different results.

rossdavidh · on Feb 22, 2024

It seems to me that the linked post gives several not-too-weird data sources that Person would miss.

j7ake · on Feb 22, 2024

That’s right, that’s why always plot your data!

aredox · on Feb 22, 2024

When there are too many dimensions, it becomes impossible

shiandow · on Feb 22, 2024

In that case sorting algorithms aren't going to help you either.

VyseofArcadia · on Feb 22, 2024

Good read overall, but I have a writing style critique about the first paragraph in particular.

> Suppose you have two sequences of numbers that you want to compare so you can measure to what extent they are related or dependent on each other.

Great, got it. I'm on board. I have two sequences of numbers.

> [20 more sentences about why you might have two sequences of numbers and what they might look like]

I assure you that anyone interested in an article titled "My Favorite Statistical Measure" doesn't need anything besides that first sentence.

eigenvalue · on Feb 22, 2024

It's always a challenge to know what level to target. I think it's helpful for many people (myself included) to ground the discussion in very concrete, tangible terms that they can easily imagine before getting into abstract definitions. And if you're not one of those people, it's still not so bad to read a few gratuitous sentences with examples, especially since those examples are extensively referenced later in the article. My real focus here was to make things as simple and easy to understand as possible, while still getting into all the nitty gritty details of how to actually compute the thing and understand something about why it works. I find you can usually get "easy and shallow" content or you can get "difficulty and deep" content, but not many technical articles go for "easy but deep."

logtempo · on Feb 22, 2024

Maybe a few subsection titles with "Introduction"/"use case"/"let's dive in the math" will help to guide the simple guy to read the simple part, and the expert guy to reach directly the complex part of your message.

It's totally possible to target both the amateur and expert audience in a single article, but it need an appropriate structure to achieve this :)

For exemple:

(1) Abstract: set the direction of your article. What we want, what we use (Pearson), what I like to use (Hoeffding's D), eventually some outcome

(2) Introduction: your first three paragraph can go in there, with two subsection (exemple, and state of the art with Pearson use).

(3) Pearson details

(4) Hoeffding(s D details

(5) implementation of Hoeffding's D

(6) Conclusion/comparaison

Personally, without a clear introduction stating where we start, where we go, and what is the journey programm, I tend to not read (which is not good for me and for you :) )

david_draco · on Feb 22, 2024

Another approach is to create a synthetic joint distribution by combining the marginal distributions independently and train a random forest classifier to distinguish the synthetic joint from the real joint. If the classifier is unable, the distributions are uncorrelated. https://arxiv.org/abs/1611.07526

eigenvalue · on Feb 22, 2024

Very cool, I’ve never heard of this before but it makes a lot of sense. I’ll have to try this out and compare how it does on some challenging data sets.

cubefox · on Feb 22, 2024

Is there a version of Hoeffding's D for binary variables? For example, for Pearson correlation, the formula reduces to the Phi coefficient when using two binary variables (two events):

https://en.wikipedia.org/wiki/Phi_coefficient

I'm asking this because association measures for binary variables are often easier to analyze.

eigenvalue · on Feb 22, 2024

I don’t think it really makes sense in the binary context. If there are only two possible values (0 and 1), then you’ll have so many ties and also you won’t be able to get meaningful quadruples. I think the closest analogue in the binary context is probably mutual information, although this is clearly measuring a different thing.

Although, I guess you could theoretically try taking the values in each sequence in groups of N at a time and then interpreting each grouping of binary digits as an integer, and then compute the Hoeffding’s D of the resulting integers. Maybe doing that for a range of N values, like 3 to 8, and averaging the Hoeffding’s D values you get. Not sure if that really makes sense though, I need to try it with some real data!

nmca · on Feb 22, 2024

I haven't yet read this in detail, but it seemed to (reasonably) miss Chatterjee correlation, which is also fun.

hackerlight · on Feb 22, 2024

Copula correlation, wavelet cross-correlation and dynamic conditional correlation are some others.

dr_dshiv · on Feb 22, 2024

Super. Any recommended resources?

hackerlight · on Feb 24, 2024

Not in particular, but wavelet cross-correlation is the most interesting if you wanted to choose one to look at.

eigenvalue · on Feb 22, 2024

Funny you should mention that, since I recently responded to a tweet about Chatterjee correlation to mention Hoeffding’s D, and that’s what got me thinking about it again over the past couple days and wanting to build up more intuition for how and why it works:

https://x.com/doodlestein/status/1759328443466920444?s=46

devin-sills · on Feb 22, 2024

This was a great read and I loved the sections on intuition. But I kept wondering when the author would circle back to showing how the D stat would perform against the cases they presented for other correlation stats as evidence that those stats were flawed.

eigenvalue · on Feb 22, 2024

Good idea, I should add that. I need to generate some challenging data sets to illustrate things, which is why I left that out (I wrote this all in one shot and didn’t want to stop to generate data).

mvanaltvorst · on Feb 22, 2024

It's naive to think you can compare vectors without any sort of prior on what this relationship might look like. Hoeffding's D cannot effectively capture relationships between time-shifted series that the (generalised) cross correlation can. Nor can it identify patterns in conditional heteroskedasticity. You will always need to impose some sort of prior when comparing vectors, and Hoeffding's D is rarely the best choice.

stblack · on Feb 22, 2024

The disquisition would improve after an iteration to apply the following rule: one paragraph, one idea.

nobodywillobsrv · on Feb 22, 2024

It would be cool to discuss this when you do not have alignment of the "time" labels.

galeos · on Feb 22, 2024

What a clear explanation of what correlation actually is!

mjburgess · on Feb 22, 2024

I have to disagree. It repeats the same pseudoscientific presentation of correlation, as having to do with the semantics of the data.

Generically, correlation has nothing to do with whether data are related, dependent or independent.

It only is an indication of this *if you already believe* they are dependent. There has to be a prior semantic model of what the data means (what it is a measure of, how reliable its a measure of it, etc.) before correlation measures anything at all.

The only reason we would suppose correlated data to be related, is just that we design experiments to already contain possible dependencies. We do not, as a habit, measure say, the fall of rain and the beat of a song playing. But these would, given suitable measurements, be correlated.

This becomes absolutely vital to understand when experiments do take this "hoover up everything, arbitrarily" approach. As often they do in the social and psychological sciences.

isoprophlex · on Feb 22, 2024

That's a pretty poor ratio of text to pictures, for my brain at least.

eigenvalue · on Feb 22, 2024

I added a link to a picture at least! I was going to put more images in there initially, but decided that I didn't want to risk disrupting the flow of the writing, especially for things that I could explain reasonably well using only words. The best images for understanding are sometimes the ones that you form in your own mind as you read!

Traubenfuchs · on Feb 22, 2024

Correct -that lonely image on top adds zero value and should be removed. Maybe I will do a PR later.

Jaxan · on Feb 22, 2024

Haha. I agree to both!