More

uniqueuid · 2025-02-14T08:19:06 1739521146

Funnily enough that hardly matters here.

Causality is a largely orthogonal problem to frequentist/bayesian - it makes everything harder, not just one of those!

procaryote · 2025-02-14T08:58:22 1739523502

Causality at least correlates with a lot of problems

uniqueuid · 2025-02-14T09:02:23 1739523743

Yeah but in this case it's really a wrong way to think about it.

If you have a DAG based on wrong assumptions, it doesn't matter whether you get a point estimate based on null hypothesis thinking or whether you get a posterior distribution based on some prior. The problem is that the way in which you combine variables is wrong, and bayesian analysis will just be more detailed and precise in being wrong.

GuB-42 · 2025-02-14T14:58:02 1739545082

Does frequentist/bayesian matters to anything but quasi-religious beliefs?

I mean, that's maths, either approach has to give the same results, as they come from the same theory. The Bayes theorem is just a theorem, use it explicitly or not, the numbers will be the same because the axioms are the same.

uniqueuid · 2025-02-14T17:41:10 1739554870

No, they are linked to beliefs (like anything else), but the canonical forms do differ a lot. Most importantly:

- bayesian methods give you posterior distributions rather than point estimates and SEs

- bayesian methods natively offer prior and posterior predictive checks

- with bayesian methods, it's evidently easier to combine knowledge from multiple sources, which null-hypothesis testing struggles with (best way is probably still meta-analyses)

uniqueuid · 2025-02-14T07:26:08 1739517968

Oh what fun to discover the horror of causality!

For some areas of research, truly understanding causality is essentially impossible - if well-controlled experiments are impossible and the list of possible colliders and confounders is unknowable.

The key problem is that any causal relation can be an illusion caused by some other, unobserved relation!

This means that in order to show fully valid causal effect estimates, we need to

- measure precisely

- measure all relevant variables

- actively NOT measure all harmful (i.e. falsely correlated) variables

I heartily recommend the book of why [1] by Pearl and Mackenzie for a deeper reading and the "haunted DAG" in McElreath's wonderful Statistical Rethinking.

[1] https://en.wikipedia.org/wiki/The_Book_of_Why

kqr · 2025-02-14T07:52:50 1739519570

Pearl's Causality is very high on my "re-read while making flashcards" list. It is depressing how hard it is to establish causality, but also inspiring how causality can be teased out of observational statistics provided one dares assume a model on which variables and correlations are meaningful.

uniqueuid · 2025-02-14T08:18:03 1739521083

"provided one dares assume ..." - that's a great quote which I'll steal in the future if you allow!

Most things we learn about DAGs and causality are frustrating, but simulating a DAG (e.g. with lavaan in R) is a technique that actually helps in understanding when and how those assumptions make sense. That's (to me) a key part of making causality productive.

currymj · 2025-02-14T14:41:18 1739544078

even if you hit all the assumptions you need to make Pearl/Rubin causality work, and there is no unobserved factor to cause problems, there is still a philosophical problem.

it all assumes you can divide the world cleanly into variables that can be the nodes of your DAG. The philosopher Nancy Cartwright talks about this a lot, but it’s also a practical problem.

shadowgovt · 2025-02-14T18:00:51 1739556051

And this is even before we get into the philosophical / epistemological questions about "cause."

You can make the argument, from correlative data, that bridges and train tracks cause truck accidents. And more importantly, if you act like they do when designing roadways, you actually will decrease truck accidents. But it's a common-sense-odd meaning of causality to claim a stationary object is acting upon a mobile object...

KempyKolibri · 2025-02-14T09:58:53 1739527133

I’ve heard Miguel Hernán’s “What If” is also excellent, but not got round to reading it.

uniqueuid · 2025-02-14T10:29:08 1739528948

Yes it's great!

There is also this great book on causality in ML, but it's a much heavier read:

Chernozhukov, V., Hansen, C., Kallus, N., Spindler, M., & Syrgkanis, V. (2025). Causal Inference with ML and AI.

levocardia · 2025-02-14T18:44:26 1739558666

For a lighter introduction to Hernán’s ideas check out:

"The C-Word: Scientific Euphemisms Do Not Improve Causal Inference From Observational Data" (https://pmc.ncbi.nlm.nih.gov/articles/PMC5888052/)

"Does water kill? A call for less casual causal inferences" (https://pmc.ncbi.nlm.nih.gov/articles/PMC5207342/)

dan_mctree · 2025-02-14T22:21:10 1739571670

And even if you do know there's causality (eg: the input variable X is part of software that provides some output Y), the exact nature of the causality can be too complex to analyze due to emergent and chaotic effects. It's seldom as simple as: an increase in X will result in an increase in Y

alexpetralia · 2025-02-14T12:06:52 1739534812

I have reflected on a good definition of causality and would be curious if anyone has thoughts or critiques of it. I am repasting part of my essay below. (https://alexpetralia.com/2023/02/25/statistics-only-gives-co...)

--

Can we nevertheless extract causality from correlation?

I would argue that, theoretically, we cannot. Practically speaking, however, we frequently settle for “very, very convincing correlations” as indicative of causation. A correlation may be persuasively described as causation if three conditions are met:

Completeness: The association itself (R²) is 100%. When we observe X, we always observe Y.

No bias: The association between X and Y is not affected by a third, omitted variable, Z.

Temporality: X temporally precedes Y.

kqr · 2025-02-14T12:23:47 1739535827

I feel like you have this backwards. In the assignment Y:=2X, each unit of Y is caused by half a unit of X. In the game where we flip a coin at fair odds, if you have increased your wealth by 8× in 3 tosses, that was caused by you getting heads every toss. Theoretically establishing causality is trivial.

The problem comes when we try to do so practically, because reality is full of surprising detail.

> No bias: The association between X and Y is not affected by a third, omitted variable, Z.

This is, practically speaking, the difficult condition. I'm not so convinced the others are necessary (practically speaking, anyway) but you should read Pearl if you're into this!

uniqueuid · 2025-02-14T12:36:39 1739536599

You are missing one crucial additional condition:

- No colliders have been included in the analysis, which would introduce appearance of causality that does not exist

dan_mctree · 2025-02-14T22:29:18 1739572158

You probably also need at least: - Y does not appear when X does not - We need an overwhelming sample size containing examples of both X and not X - The experiment and data collection and trivially repeatable (so that we don't need to rely on trust) - The experiment, data collection and analysis must be easy to understand and sensible in every way without leaving room for error

And as another commenter already pointed out: You can't really eradicate the existence of an unknown Z

istjohn · 2025-02-15T01:38:11 1739583491

Lightning doesn't cause fire because I have observed fire created by matches under a blue sky.

(I've also observed lightning that was not followed by fire. We really need to stop wasting money on lightning rods.)

HPsquared · 2025-02-14T13:56:31 1739541391

Ruling out all Z is the almost-impossible part. It's hard to prove a negative, especially with incomplete information.

stonemetal12 · 2025-02-14T15:29:39 1739546979

What of the double slit experiment, where observation changes the outcome? Do we call observation the cause of the outcome?

uniqueuid · 2025-02-14T16:28:07 1739550487

In general you assume DAGs, i.e. non-cyclical causality. Cyclical relations must be resolved through distinct temporal steps, i.e. u_t0 causes v_t1 and v_t1 causes u_t2. When your measurement precision only captures simultaneous effects of both u on v and v on u you have a problem.

QuantumGood · 2025-02-14T18:01:01 1739556061

That colliders and confounders have technical definitions is not known by some:

------------------ Confounders ------------------

A variable that affects both the exposure and the outcome. It is a common cause of both variables.

Role: Confounders can create a spurious association between the exposure and outcome if not properly controlled for. They are typically addressed by controlling for them in statistical models, such as regression analysis, to reduce bias and estimate the true causal effect.

Example: Age is a common confounder in many studies because it can affect both the exposure (e.g., smoking) and the outcome (e.g., lung cancer).

------------------ Colliders ------------------

A variable that is causally influenced by two or more other variables. In graphical models, it is represented as a node where the arrowheads from these variables "collide."

Role: Colliders do not inherently create an association between the variables that influence them. However, conditioning on a collider (e.g., through stratification or regression) can introduce a non-causal association between these variables, leading to collider bias.

Example: If both smoking and lung cancer affect quality of life, quality of life is a collider. Conditioning on quality of life could create a biased association between smoking and lung cancer.

------------------ Differences ------------------

Direction of Causality: Confounders cause both the exposure and the outcome, while colliders are caused by both the exposure and the outcome.

Statistical Handling: Confounders should be controlled for to reduce bias, whereas controlling for colliders can introduce bias.

Graphical Representation: In Directed Acyclic Graphs (DAGs), confounders have arrows pointing away from them to both the exposure and outcome, while colliders have arrows pointing towards them from both the exposure and outcome.

------------------ Managing ------------------

Directed Acyclic Graphs (DAGs): These are useful tools for identifying and distinguishing between confounders and colliders. They help in understanding the causal structure of the variables involved.

Statistical Methods: For confounders, methods like regression analysis are effective for controlling their effects. For colliders, avoiding conditioning on them is crucial to prevent collider bias.

lenzm · 2025-02-14T18:10:06 1739556606

If you have to start with apologies then you know, just stop and don't post.

QuantumGood · 2025-02-14T19:35:23 1739561723

Sure, but someone else did this for me, using AI, I found it useful to scan in the moment. I appreciated it and upvoted it.

Like that experience, this was meant as a scannable introduction to the topic, not an exact reference. Happy to hear altenative views, or downvote to give herding-style feedback.

Had I done a short AI-generated summary, it would have been a bit less helpful, but there wouldn't have been downvotes.

Had I linked instead of posted the same AI explanation, there would have been no or fewer downvotes, because many wouldn't click, and some of those that did would find it helpful.

Had I linked to something else, many would not click and read without a summary, both of which could have been AI-created.

I chose to move on and accept a few downvotes. The votes count less than the helpfulness to me. Votes don't mean it helps or doesn't. Many people accept confusion without seeking clarification, and appreciate a little help.

Although I personally do tend to downvote content-free unhelpful Reddit-style comments, I'm not overly fond of trying to massage things to help people manage their feelings when posts are only information, with no framing or opinion content. I understand that there is value in downvotes as herding-style feedback (as PG has pointed out). Yes, I've read the HN guidelines.

I think beyond herding-style feedback downvotes, AI info has become a bit socially unacceptable—okay to talk about it but not share it. But I find AI particularly useful as an initial look at information about a domain, though not trustworthy as a detailed source. I appreciate the footnotes that Perplexity provides for this kind of usage that let me begin checking for accurate details.

uniqueuid · 2025-02-03T13:41:26 1738590086

No you cannot, see Article 2: Scope

uniqueuid · 2025-02-03T13:40:43 1738590043

Unfortunately yes, the article is a simplification, in part because the AI act delegates some regulation to existing other acts. So to know the full picture of AI regulation one needs to look at the combination of multiple texts.

The precise language on high risk is here [1], but some enumerations are placed in the annex, which (!!!) can be amended by the commission, if I am not completely mistaken. So this is very much a dynamic regulation.

[1] https://artificialintelligenceact.eu/article/6/

zelphirkalt · 2025-02-04T12:15:45 1738671345

Is the regulation itself AI, due to bein adaptive after deployment?

Just joking, but I think it is a funny parallel. Also because of it being probably solely human made rules.

uniqueuid · 2025-01-14T09:09:54 1736845794

Yes but see my sibling comment.

When you expand your array, your existing data will not be stored any more efficiently.

To get the new parity/data ratios, you would have to force copies of the data and delete the old, inefficient versions, e.g. with something like this [1]

My personal take is that it's a much better idea to buy individual complete raid-z configurations and add new ones / replace old ones (disk by disk!) as you go.

[1] https://github.com/markusressel/zfs-inplace-rebalancing

Mashimo · 2025-01-14T09:16:01 1736846161

I wish something like this would be build into ZFS, so snapshots and current access would not be broken.

uniqueuid · 2025-01-14T09:31:48 1736847108

True, but I have a gut feeling that a lot of these thorny issues would come up again:

https://github.com/openzfs/zfs/issues/3582

uniqueuid · 2025-01-14T07:32:00 1736839920

It's good to see that they were pretty conservative about the expansion.

Not only is expansion completely transparent and resumable, it also maintains redundancy throughout the process.

That said, there is one tiny caveat people should be aware of:

> After the expansion completes, old blocks remain with their old data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but distributed among the larger set of disks. New blocks will be written with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been expanded once to 6-wide, has 4 data to 2 parity).

chungy · 2025-01-14T09:37:10 1736847430

I'm not sure that's really a caveat, it just means old data might be in an inoptimal layout. Even with that, you still get the full benefits of raidzN, where up to N disks can completely fail and the pool will remain functional.

crote · 2025-01-14T13:32:38 1736861558

I think it's a huge caveat, because it makes upgrades a lot less efficient than you'd expect.

For example, home users generally don't want to buy all of their storage up front. They want to add additional disks as the array fills up. Being able to start with a 2-disk raidz1 and later upgrade that to a 3-disk and eventually 4-disk array is amazing. It's a lot less amazing if you end up with a 55% storage efficiency rather than 66% you'd ideally get from a 2-disk to 3-disk upgrade. That's 11% of your total disk capacity wasted, without any benefit whatsoever.

ryao · 2025-01-14T14:58:18 1736866698

You have a couple options:

1. Delete the snapshots and rewrite the files in place like how people do when they want to rebalance a pool.

2. Use send/receive inside the pool.

Either one will make the data use the new layout. They both carry the caveat that reflinks will not survive the operation, such that if you used reflinks to deduplicate storage, you will find the deduplication effect is gone afterward.

pdimitar · 2025-01-17T17:13:32 1737134012

Can you give sample commands on how to achieve both options that you gave?

bmicraft · 2025-01-14T14:21:35 1736864495

Well, when you start a raidz with 2 devices you've already done goofed. Start with a mirror or at least 3 devices.

Also, if you don't wait to upgrade until the disks are at 100% utilization (which you should never do! you're creating massive fragmentation upwards of ~85%) efficiency in the real world will be better.

chungy · 2025-01-14T14:14:08 1736864048

It still seems pretty minor. If you want extreme optimization, feel free to destroy the pool and create it new, or create it with the ideal layout from the beginning.

Old data still works fine, the same guarantees RAID-Z provides still hold. New data will be written with the new data layout.

stavros · 2025-01-14T10:33:11 1736850791

Is that the case? What if I expand a 3-1 array to 3-2? Won't the old blocks remain 3-1?

Timshel · 2025-01-14T10:57:38 1736852258

I don't believe it supports adding parity drives only data drives.

stavros · 2025-01-14T11:02:23 1736852543

Ahh interesting, thanks.

bmicraft · 2025-01-14T14:26:27 1736864787

Since preexisting blocks are kept at their current parity ratio and not modified (only redistributed among all devices), increasing the parity level of new blocks won't really be useful in practice anyway.

wjdp · 2025-01-14T14:05:08 1736863508

Caveat is very much expected, you should expect ZFS features to not rewrite blocks. Changes to settings only apply to new data for example.

rekoil · 2025-01-14T13:39:16 1736861956

Yaeh it's a pretty huge caveat to be honest.

    Da1 Db1 Dc1 Pa1 Pb1
    Da2 Db2 Dc2 Pa2 Pb2
    Da3 Db3 Dc3 Pa3 Pb3
    ___ ___ ___ Pa4 Pb4

___ represents free space. After expansion by one disk you would logically expect something like:

    Da1 Db1 Dc1 Da2 Pa1 Pb1
    Db2 Dc2 Da3 Db3 Pa2 Pb2
    Dc3 ___ ___ ___ Pa3 Pb3
    ___ ___ ___ ___ Pa4 Pb4

But as I understand it it would actually expand to:

    Da1 Db1 Dc1 Dd1 Pa1 Pb1
    Da2 Db2 Dc2 Dd2 Pa2 Pb2
    Da3 Db3 Dc3 Dd3 Pa3 Pb3
    ___ ___ ___ ___ Pa4 Pb4

Where the Dd1-3 blocks are just wasted. Meaning by adding a new disk to the array you're only expanding free storage by 25%... So say you have 8TB disks for a total of 24TB of storage free originally, and you have 4TB free before expansion, you would have 5TB free after expansion.

Please tell me I've misunderstood this, because to me it is a pretty useless implementation if I haven't.

ryao · 2025-01-14T14:45:47 1736865947

ZFS RAID-Z does not have parity disks. The parity and data is interleaved to allow data reads to be done from all disks rather than just the data disks.

The slides here explain how it works:

https://openzfs.org/w/images/5/5e/RAIDZ_Expansion_2023.pdf

Anyway, you are not entirely wrong. The old data will have the old parity:data ratio while new data will have the new parity:data ratio. As old data is freed from the vdev, new writes will use the new parity:data ratio. You can speed this up by doing send/receive, or by deleting all snapshots and then rewriting the files in place. This has the caveat that reflinks will not survive the operation, such that if you used reflinks to deduplicate storage, you will find the deduplication effect is gone afterward.

chungy · 2025-01-14T17:22:58 1736875378

To be fair, RAID5/6 don't have parity disks either. RAID2, RAID3, and RAID4 do, but they're all effectively dead technology for good reason.

I think it's easy for a lot of people to conceptualize RAID5/6 and RAID-Zn as having "data disks" and "parity disks" to wrap around the complicated topic of how it works, but all of them truly interleave and compute parity data across all disks, allowing any single disk to die.

I've been of two minds on the persistent myth of "parity disks" but I usually ignore it, because it's a convenient lie to understand your data is safe, at least. It's also a little bit the same way that raidz1 and raidz2 are sometimes talked about as "RAID5" and "RAID6"; the effective benefits are the same, but the implementation is totally different.

magicalhippo · 2025-01-14T15:06:59 1736867219

Unless I misunderstood you, you're describing more how classical RAID would work. The RAID-Z expansion works like you note you would logically expect. You added a drive with four blocks of free space, and you end up with four blocks more of free space afterwards.

You can see this in the presentation[1] slides[2].

The reason this is sub-optimal post-expansion is because, in your example, the old maximal stripe width is lower than the post-expansion maximal stripe width.

Your example is a bit unfortunate in terms of allocated blocks vs layout, but if we tweak it slightly, then

    Da1 Db1 Dc1 Pa1 Pb1
    Da2 Db2 Dc2 Pa2 Pb2
    Da3 Db3 Pa3 Pb3 ___

would after RAID-Z expansion would become

    Da1 Db1 Dc1 Pa1 Pb1 Da2
    Db2 Dc2 Pa2 Pb2 Da3 Db3 
    Pa3 Pb3 ___ ___ ___ ___

Ie you added a disk with 3 new blocks, and so total free space after is 1+3 = 4 blocks.

However if the same data was written in the post-expanded vdev configuration, it would have become

    Da1 Db1 Dc1 Dd1 Pa1 Pb1
    Da2 Db2 Dc2 Dd2 Pa2 Pb2
    ___ ___ ___ ___ ___ ___

Ie, you'd have 6 free blocks not just 4 blocks.

Of course this doesn't count for writes which end up taking less than the maximal stripe width.

[1]: https://www.youtube.com/watch?v=tqyNHyq0LYM

[2]: https://openzfs.org/w/images/5/5e/RAIDZ_Expansion_2023.pdf

ryao · 2025-01-14T15:25:43 1736868343

Your diagrams have some flaws too. ZFS has a variable stripe size. Let’s say you have a 10 disk raid-z2 vdev that is ashift=12 for 4K columns. If you have a 4K file, 1 data block and 2 parity blocks will be written. Even if you expand the raid-z vdev, there is no savings to be had from the new data:parity ratio. Now, let’s assume that you have a 72K file. Here, you have 18 data blocks and 6 parity blocks. You would benefit from rewriting this to use the new data:parity ratio. In this case, you would only need 4 parity blocks. ZFS does not rewrite it as part of the expansion, however.

There are already good diagrams in your links, so I will refrain from drawing my own with ASCII. Also, ZFS will vary which columns get parity, which is why the slides you linked have the parity at pseudo-random locations. It was not a quirk of the slide’s author. The data is really laid out that way.

magicalhippo · 2025-01-14T15:30:44 1736868644

What are the errors? I tried to show exactly what you talk about.

edit: ok, I didn't consider the exact locations of the parity, I was only concerned with space usage.

The 8 data blocks need three stripes on a 3+2 RAID-Z2 setup both pre and post expansion, the last being a partial stripe, but when written in the 4+2 setup only needs 2 full stripes, leading to more total free space.

uniqueuid · 2024-12-08T14:43:19 1733668999

Ok so the correlation coefficient for vote share is .048, meaning the R squared or explained variance is .0028 i.e. around three percent. No, monkeys cannot predict vote share.

That said, for a paper with such an obvious bullshit design the text itself is not that bad, at least they seriously investigate the facial features that seem to be relevant to these cases.

If you must remember this paper (please don’t), do it as “monkey ganze is a better predictor for election races in the US than a coin throw”.

uniqueuid · 2024-11-22T12:18:31 1732277911

It's so great to see that we finally move away from the thirty year old triple categorization of people, organizations and locations.

This of course means that we now have to think about all the irreconcilable problems of taxonomy, but I'll take that any day over the old version :)

uniqueuid · 2024-11-20T12:44:18 1732106658

And sometimes you find little gems, too!

I have a copy of "New Rules for the New Economy" by Kevin Kelly, signed as part of the Global Business Network that he and Steward Brand founded a long time ago.

Having read Fred Turner's immensely great book "From Counterculture to Cyberculture", that is a valuable little piece of history to me.

uniqueuid · 2024-11-19T17:28:03 1732037283

Looks interesting - they even added a pitch bend. But I'm wondering if it's worth the 2.3k euro.