Yeah but in this case it's really a wrong way to think about it.
If you have a DAG based on wrong assumptions, it doesn't matter whether you get a point estimate based on null hypothesis thinking or whether you get a posterior distribution based on some prior. The problem is that the way in which you combine variables is wrong, and bayesian analysis will just be more detailed and precise in being wrong.
Does frequentist/bayesian matters to anything but quasi-religious beliefs?
I mean, that's maths, either approach has to give the same results, as they come from the same theory. The Bayes theorem is just a theorem, use it explicitly or not, the numbers will be the same because the axioms are the same.
No, they are linked to beliefs (like anything else), but the canonical forms do differ a lot. Most importantly:
- bayesian methods give you posterior distributions rather than point estimates and SEs
- bayesian methods natively offer prior and posterior predictive checks
- with bayesian methods, it's evidently easier to combine knowledge from multiple sources, which null-hypothesis testing struggles with (best way is probably still meta-analyses)
For some areas of research, truly understanding causality is essentially impossible - if well-controlled experiments are impossible and the list of possible colliders and confounders is unknowable.
The key problem is that any causal relation can be an illusion caused by some other, unobserved relation!
This means that in order to show fully valid causal effect estimates, we need to
- measure precisely
- measure all relevant variables
- actively NOT measure all harmful (i.e. falsely correlated) variables
I heartily recommend the book of why [1] by Pearl and Mackenzie for a deeper reading and the "haunted DAG" in McElreath's wonderful Statistical Rethinking.
Pearl's Causality is very high on my "re-read while making flashcards" list. It is depressing how hard it is to establish causality, but also inspiring how causality can be teased out of observational statistics provided one dares assume a model on which variables and correlations are meaningful.
"provided one dares assume ..." - that's a great quote which I'll steal in the future if you allow!
Most things we learn about DAGs and causality are frustrating, but simulating a DAG (e.g. with lavaan in R) is a technique that actually helps in understanding when and how those assumptions make sense. That's (to me) a key part of making causality productive.
even if you hit all the assumptions you need to make Pearl/Rubin causality work, and there is no unobserved factor to cause problems, there is still a philosophical problem.
it all assumes you can divide the world cleanly into variables that can be the nodes of your DAG. The philosopher Nancy Cartwright talks about this a lot, but it’s also a practical problem.
And this is even before we get into the philosophical / epistemological questions about "cause."
You can make the argument, from correlative data, that bridges and train tracks cause truck accidents. And more importantly, if you act like they do when designing roadways, you actually will decrease truck accidents. But it's a common-sense-odd meaning of causality to claim a stationary object is acting upon a mobile object...
And even if you do know there's causality (eg: the input variable X is part of software that provides some output Y), the exact nature of the causality can be too complex to analyze due to emergent and chaotic effects. It's seldom as simple as: an increase in X will result in an increase in Y
Can we nevertheless extract causality from correlation?
I would argue that, theoretically, we cannot. Practically speaking, however, we frequently settle for “very, very convincing correlations” as indicative of causation. A correlation may be persuasively described as causation if three conditions are met:
Completeness: The association itself (R²) is 100%. When we observe X, we always observe Y.
No bias: The association between X and Y is not affected by a third, omitted variable, Z.
I feel like you have this backwards. In the assignment Y:=2X, each unit of Y is caused by half a unit of X. In the game where we flip a coin at fair odds, if you have increased your wealth by 8× in 3 tosses, that was caused by you getting heads every toss. Theoretically establishing causality is trivial.
The problem comes when we try to do so practically, because reality is full of surprising detail.
> No bias: The association between X and Y is not affected by a third, omitted variable, Z.
This is, practically speaking, the difficult condition. I'm not so convinced the others are necessary (practically speaking, anyway) but you should read Pearl if you're into this!
You probably also need at least:
- Y does not appear when X does not
- We need an overwhelming sample size containing examples of both X and not X
- The experiment and data collection and trivially repeatable (so that we don't need to rely on trust)
- The experiment, data collection and analysis must be easy to understand and sensible in every way without leaving room for error
And as another commenter already pointed out: You can't really eradicate the existence of an unknown Z
In general you assume DAGs, i.e. non-cyclical causality. Cyclical relations must be resolved through distinct temporal steps, i.e. u_t0 causes v_t1 and v_t1 causes u_t2. When your measurement precision only captures simultaneous effects of both u on v and v on u you have a problem.
That colliders and confounders have technical definitions is not known by some:
------------------
Confounders
------------------
A variable that affects both the exposure and the outcome. It is a common cause of both variables.
Role: Confounders can create a spurious association between the exposure and outcome if not properly controlled for. They are typically addressed by controlling for them in statistical models, such as regression analysis, to reduce bias and estimate the true causal effect.
Example: Age is a common confounder in many studies because it can affect both the exposure (e.g., smoking) and the outcome (e.g., lung cancer).
------------------
Colliders
------------------
A variable that is causally influenced by two or more other variables. In graphical models, it is represented as a node where the arrowheads from these variables "collide."
Role: Colliders do not inherently create an association between the variables that influence them. However, conditioning on a collider (e.g., through stratification or regression) can introduce a non-causal association between these variables, leading to collider bias.
Example: If both smoking and lung cancer affect quality of life, quality of life is a collider. Conditioning on quality of life could create a biased association between smoking and lung cancer.
------------------
Differences
------------------
Direction of Causality: Confounders cause both the exposure and the outcome, while colliders are caused by both the exposure and the outcome.
Statistical Handling: Confounders should be controlled for to reduce bias, whereas controlling for colliders can introduce bias.
Graphical Representation: In Directed Acyclic Graphs (DAGs), confounders have arrows pointing away from them to both the exposure and outcome, while colliders have arrows pointing towards them from both the exposure and outcome.
------------------
Managing
------------------
Directed Acyclic Graphs (DAGs): These are useful tools for identifying and distinguishing between confounders and colliders. They help in understanding the causal structure of the variables involved.
Statistical Methods: For confounders, methods like regression analysis are effective for controlling their effects. For colliders, avoiding conditioning on them is crucial to prevent collider bias.
Sure, but someone else did this for me, using AI, I found it useful to scan in the moment. I appreciated it and upvoted it.
Like that experience, this was meant as a scannable introduction to the topic, not an exact reference. Happy to hear altenative views, or downvote to give herding-style feedback.
Had I done a short AI-generated summary, it would have been a bit less helpful, but there wouldn't have been downvotes.
Had I linked instead of posted the same AI explanation, there would have been no or fewer downvotes, because many wouldn't click, and some of those that did would find it helpful.
Had I linked to something else, many would not click and read without a summary, both of which could have been AI-created.
I chose to move on and accept a few downvotes. The votes count less than the helpfulness to me. Votes don't mean it helps or doesn't. Many people accept confusion without seeking clarification, and appreciate a little help.
Although I personally do tend to downvote content-free unhelpful Reddit-style comments, I'm not overly fond of trying to massage things to help people manage their feelings when posts are only information, with no framing or opinion content. I understand that there is value in downvotes as herding-style feedback (as PG has pointed out). Yes, I've read the HN guidelines.
I think beyond herding-style feedback downvotes, AI info has become a bit socially unacceptable—okay to talk about it but not share it. But I find AI particularly useful as an initial look at information about a domain, though not trustworthy as a detailed source. I appreciate the footnotes that Perplexity provides for this kind of usage that let me begin checking for accurate details.
Unfortunately yes, the article is a simplification, in part because the AI act delegates some regulation to existing other acts. So to know the full picture of AI regulation one needs to look at the combination of multiple texts.
The precise language on high risk is here [1], but some enumerations are placed in the annex, which (!!!) can be amended by the commission, if I am not completely mistaken. So this is very much a dynamic regulation.
When you expand your array, your existing data will not be stored any more efficiently.
To get the new parity/data ratios, you would have to force copies of the data and delete the old, inefficient versions, e.g. with something like this [1]
My personal take is that it's a much better idea to buy individual complete raid-z configurations and add new ones / replace old ones (disk by disk!) as you go.
It's good to see that they were pretty conservative about the expansion.
Not only is expansion completely transparent and resumable, it also maintains redundancy throughout the process.
That said, there is one tiny caveat people should be aware of:
> After the expansion completes, old blocks remain with their old data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but distributed among the larger set of disks. New blocks will be written with the new data-to-parity
ratio (e.g. a 5-wide RAIDZ2 which has been expanded once to 6-wide, has 4 data to 2 parity).
I'm not sure that's really a caveat, it just means old data might be in an inoptimal layout. Even with that, you still get the full benefits of raidzN, where up to N disks can completely fail and the pool will remain functional.
I think it's a huge caveat, because it makes upgrades a lot less efficient than you'd expect.
For example, home users generally don't want to buy all of their storage up front. They want to add additional disks as the array fills up. Being able to start with a 2-disk raidz1 and later upgrade that to a 3-disk and eventually 4-disk array is amazing. It's a lot less amazing if you end up with a 55% storage efficiency rather than 66% you'd ideally get from a 2-disk to 3-disk upgrade. That's 11% of your total disk capacity wasted, without any benefit whatsoever.
1. Delete the snapshots and rewrite the files in place like how people do when they want to rebalance a pool.
2. Use send/receive inside the pool.
Either one will make the data use the new layout. They both carry the caveat that reflinks will not survive the operation, such that if you used reflinks to deduplicate storage, you will find the deduplication effect is gone afterward.
Well, when you start a raidz with 2 devices you've already done goofed. Start with a mirror or at least 3 devices.
Also, if you don't wait to upgrade until the disks are at 100% utilization (which you should never do! you're creating massive fragmentation upwards of ~85%) efficiency in the real world will be better.
It still seems pretty minor. If you want extreme optimization, feel free to destroy the pool and create it new, or create it with the ideal layout from the beginning.
Old data still works fine, the same guarantees RAID-Z provides still hold. New data will be written with the new data layout.
Since preexisting blocks are kept at their current parity ratio and not modified (only redistributed among all devices), increasing the parity level of new blocks won't really be useful in practice anyway.
Where the Dd1-3 blocks are just wasted. Meaning by adding a new disk to the array you're only expanding free storage by 25%... So say you have 8TB disks for a total of 24TB of storage free originally, and you have 4TB free before expansion, you would have 5TB free after expansion.
Please tell me I've misunderstood this, because to me it is a pretty useless implementation if I haven't.
ZFS RAID-Z does not have parity disks. The parity and data is interleaved to allow data reads to be done from all disks rather than just the data disks.
Anyway, you are not entirely wrong. The old data will have the old parity:data ratio while new data will have the new parity:data ratio. As old data is freed from the vdev, new writes will use the new parity:data ratio. You can speed this up by doing send/receive, or by deleting all snapshots and then rewriting the files in place. This has the caveat that reflinks will not survive the operation, such that if you used reflinks to deduplicate storage, you will find the deduplication effect is gone afterward.
To be fair, RAID5/6 don't have parity disks either. RAID2, RAID3, and RAID4 do, but they're all effectively dead technology for good reason.
I think it's easy for a lot of people to conceptualize RAID5/6 and RAID-Zn as having "data disks" and "parity disks" to wrap around the complicated topic of how it works, but all of them truly interleave and compute parity data across all disks, allowing any single disk to die.
I've been of two minds on the persistent myth of "parity disks" but I usually ignore it, because it's a convenient lie to understand your data is safe, at least. It's also a little bit the same way that raidz1 and raidz2 are sometimes talked about as "RAID5" and "RAID6"; the effective benefits are the same, but the implementation is totally different.
Unless I misunderstood you, you're describing more how classical RAID would work. The RAID-Z expansion works like you note you would logically expect. You added a drive with four blocks of free space, and you end up with four blocks more of free space afterwards.
You can see this in the presentation[1] slides[2].
The reason this is sub-optimal post-expansion is because, in your example, the old maximal stripe width is lower than the post-expansion maximal stripe width.
Your example is a bit unfortunate in terms of allocated blocks vs layout, but if we tweak it slightly, then
Your diagrams have some flaws too. ZFS has a variable stripe size. Let’s say you have a 10 disk raid-z2 vdev that is ashift=12 for 4K columns. If you have a 4K file, 1 data block and 2 parity blocks will be written. Even if you expand the raid-z vdev, there is no savings to be had from the new data:parity ratio. Now, let’s assume that you have a 72K file. Here, you have 18 data blocks and 6 parity blocks. You would benefit from rewriting this to use the new data:parity ratio. In this case, you would only need 4 parity blocks. ZFS does not rewrite it as part of the expansion, however.
There are already good diagrams in your links, so I will refrain from drawing my own with ASCII. Also, ZFS will vary which columns get parity, which is why the slides you linked have the parity at pseudo-random locations. It was not a quirk of the slide’s author. The data is really laid out that way.
What are the errors? I tried to show exactly what you talk about.
edit: ok, I didn't consider the exact locations of the parity, I was only concerned with space usage.
The 8 data blocks need three stripes on a 3+2 RAID-Z2 setup both pre and post expansion, the last being a partial stripe, but when written in the 4+2 setup only needs 2 full stripes, leading to more total free space.
Ok so the correlation coefficient for vote share is .048, meaning the R squared or explained variance is .0028 i.e. around three percent. No, monkeys cannot predict vote share.
That said, for a paper with such an obvious bullshit design the text itself is not that bad, at least they seriously investigate the facial features that seem to be relevant to these cases.
If you must remember this paper (please don’t), do it as “monkey ganze is a better predictor for election races in the US than a coin throw”.
I have a copy of "New Rules for the New Economy" by Kevin Kelly, signed as part of the Global Business Network that he and Steward Brand founded a long time ago.
Having read Fred Turner's immensely great book "From Counterculture to Cyberculture", that is a valuable little piece of history to me.
Causality is a largely orthogonal problem to frequentist/bayesian - it makes everything harder, not just one of those!