Design Trumps Analysis

Where 90% of Causal Inference really lies

Mar 03, 2025

Hopefully in my blogposts thus far you have learned that I love Joan of Arc, cycling, Margaret Beaufort and the Danish Tax Code while simultaneously I hate colonialism & the UK Tax Code. I also love Arsenal and hate Tottenham but that’s for another day.

It’s now time to return to the field that actually gets me paid. Namely, Causal Inference. I advise that, unless you’re already familiar with the field, that you read my earlier post before reading further.

In this post, I will argue that the bulk of the field and potential for improvements comes not from improving the methods of analysis, which is my main area of focus, but in the quality of the datasets we have available and, as a result, the flexibility of the study designs one can construct.

Areas around this theme have already been explored such as in Donald Rubin’s paper “Design Trumps Analysis”. For this post however, I shall write this not having read Rubin’s paper and will instead offer up my own thoughts derived from my own first principles thinking and my experiences thus far in the field.

Prediction vs Causation

The field of statistics can be divided up into two big subcategories: predictive based statistics and causation based statistics.

The former is focused on using datasets solely to predict the outcome of events, with no interest in what the causal mechanism is behind it. Examples of this might be using certain weather measurements to predict the weather next day for a whether forecast, or evaluated the likelihood someone of a specific profile will have an accident for the purposes of insurance.

In contrast, causation based statistics involves assessing how something is caused. In other words, whether variable X causes variable Y and if so by what magnitude. The way I think of causation philosophically is if I could magically alter variable X, without altering anything else aside from what altering X effects downstream, how will Y change in this alternative hypothetical counterfactual. A good real life example is if I administer a drug to someone, how will their prognosis be compared to were I not to administer the drug. Another example might be assessing the effect of passing a tax cut on the economy.

You’ve probably heard the maxim “correlation does not imply causation” and this division of statistics largely reflects that. Correlation is more to do with prediction while causation is what it says on the tin. My favourite example of this is living near powerlines. People who live near powerlines are more likely to get cancer, but this association/correlation is not a causal one. The powerlines aren’t doing something mysterious that increases your probability of getting cancer if you live there. It’s simply that people who live near powerlines are more likely to be from a poorer socioeconomic background, and those from poorer socioeconomic backgrounds are also more likely to get cancer. Socioeconomic background here simultaneously has a causal effect on the likelihood on you living near a powerline as well as having a causal effect on the probability you will get cancer. We call socioeconomic background a confounder of chance of living near a powerline and chance of getting cancer, and it results in a correlation/association between living near powerlines and getting cancer but it’s not a causal one.

A big challenge of statistics is that these two fields don’t really like each other. They interact like matter and antimatter. This is because the best predictive models often involve tonnes of variables that makes it nigh on impossible to work out what the underlying causal diagram creating the data output we observe is.

Two-Stage Process of Causal Inference:

The way one approaches causal inference is quite difference and usually involves a two stage process. The first stage involves making some assumptions or postulates (hopefully with some firm reasoning to back them up!) about the causal diagram underlying what’s generating the data output here. An example of such assumptions might be the instrumental variable assumption. The second stage involves, assuming these assumptions are true, to conduct a statistical analysis allowing one to conclude whether one believes, based on the statistical evidence and the assumptions made, if there is a causal effect and if so of what magnitude. An example here is conducting an instrumental variable analysis. Upon making the IV assumptions, any association between the instrumental variable and the outcome must be causal. This reduces our statistical analysis to one of assessing if two variables are associated with each other.

This type of reasoning in causal inference is called abductive reasoning, and it turns the field into an art as much as a science in many respects. Often key to assessing the validity of these stage one assumptions and concluding therefore that causality (or lack thereof) is the most plausible explanation for the data we see often involves Occam’s Razor. Namely, that the simplest explanation is the one most likely to be true, which makes intuitive sense since the fewer moving parts something has, the easier it is to build and/or exist fundamentally. Without Occam’s Razor, it becomes very hard (unless you’re a super sophisticated philosopher unlike myself) to argue that the data we observe and stage two findings we make from it are best explained by causality.

Of course, the criticality of these stage one assumptions is what makes the design of the study and gathering of data so important, as we will now see.

So while causal inference requires both stages of this process, prediction in some sense is only interested in the second stage of this process, namely the statistical analysis. Prediction just takes the results of stage two to see how data can predict an outcome, rather than using the results of stage two combined with the assumptions or postulates of stage one to make a causal inference.

Importance of Study Design in Causal Inference

The reason those in causal inference love randomised control trials and call them the gold standard is because they are designed in such a way that it guarantees any association between our risk factor (call it X) and outcome (call it Y) must be causal. We need not worry about whether our key assumptions in the first stage of the two stage process I mentioned earlier, they absolutely do hold in an RCT context. We therefore have reduced our causal inference problem down to a standard statistical problem of the kind seen in prediction where one needs to assess whether an association exists based on the data we have, and if so what this precise association is.

Unfortunately, most real world problems do not allow for randomised control trials, certainly not unless you’re willing to put ethics aside and turn into Josef Mengele! This means we have to work with alternative methods, such as instrumental variable methods like Mendelian Randomisation, that rely on certain assumptions which one hopes are likely to hold but are not guaranteed to.

This is where the study design becomes all too important! The study design is how one generates and/or collects the data that we analyse. It’s this that determines the likelihood of our first stage assumptions holding and thus allowing to make valid causal conclusions. A randomised control trial is the ideal study design, but in the absence of that being possible, one has to look for alternatives such as an observational study that collects data from the real world, hopefully on variables that are likely to satisfy the instrumental variable assumptions, making causal analysis via that method possible.

This is something that is absolutely foundational to making correct causal conclusions, and isn’t really something that can be overcome that much by improved analytical methods. Methods that are designed to be robust to pleiotropy to some extent, such as MR-Egger, have been developed but they can only go so far, and they usually come at a significant cost in power.

Design in Prediction vs Causation

We shall now look at where Study Design matters when focusing on prediction compared to when one is focusing on causation.

Interpolation vs Extrapolation:

Perhaps the most important thing around study design in prediction is that one ensures any sensible prediction involves interpolation, as opposed to extrapolation.

After looking at a dataset, one may make a prediction or a causal inference, based on a new data entry but with the outcome missing, with the entry believed to come from the same system that generated the data we used to make this prediction. Interpolation is when the new data entry is within the same area that the data used in the predictive or causal inference analysis covered. Extrapolation is when it’s not in this area. The latter is far less reliable because there’s no guarantee that the pattern in a certain area of the dataspace will continue to a later area of a dataspace. For example, a relationship that might be linear in a certain part of the dataspace may become non-linear elsewhere in the space.

Therefore, for good prediction, it’s important that one designs the study such that one is using interpolation to predict outcomes and not extrapolation. This is an important feature of the study design when making causal inferences as well.

Other Model Assumptions:

Predictive models also often have other assumptions as well to ensure consistency (ie the estimates converge to the truth) and asymptotic efficiency of the model. However, especially large sample size machine learning models are usually extremely flexible in practice and quite robust to these assumptions breaking down.

The challenge causal inference has is it needs a good study design not just to ensure the validity of the stage two analysis (which is what prediction is about) but that the stage one postulates are also valid. In this sense, causal inference is strictly more reliant on good study design than prediction statistics is. Prediction only has to worry about good study design in respect of stage two, causal inference has to worry about it for both stage one and stage two.

Given any dataset, one can usually make predictions from it, even if the variables in the dataset aren’t the best predictors of what we’re interested in and hence the final prediction is quite shoddy or prone to noise. However, even if this prediction isn’t super informative, it usually is still a valid prediction just not a strong one. In contrast, in causal inference, a poor dataset means it can be fundamentally impossible to come up with valid stage one postulates without which no causal conclusions can be reached. In that sense, assuming one is using interpolation and not extrapolation, poor datasets don’t jeopardise the fundamental integrity of the analysis in the case of making predictions, in the same way it does in the case of analysing causation.

Implications of Design Importance for the Field

Having good study designs that make our stage one postulates valid is so foundational to the success of the field that I would genuinely give it an order of magnitude more importance than other areas of the field such as improving methods used in the stage two phase.

If I was given a pot of money to improve the field of causal inference and wasn’t allowed to spend it on myself, I would absolutely devote up to 90% of it to gathering improved and wider datasets.

More broad datasets would likely open up more avenues for research into different medical conditions. Moreover, even in areas of MR research that already have datasets for them, improved datasets that make stage one assumptions (such as the instrumental variable assumptions) more reliable would also mean one doesn’t need to deploy methods that rely on more robustness to the breakdown of these assumptions at the expense of efficiency. This means that improved datasets would allow for more efficient (and less robust) analysis methods to be deployed at stage two which would improve the power of our analysis. This is likely to yield far bigger gains in power than making more marginal improvements in robust methods designed to handle the breakdown of stage one assumptions in lower quality datasets.

Furthermore, in the Mendelian Randomisation setting specifically (recall that MR uses genetic variants as instrumental variables) challenges arise from the fact the biobanks are often restricted to certain demographics and geographic regions in the world. While restricting to certain demographics might reduce issues with assortative mating and hence make the IV2 Exchangeability assumption more reliable, this probably comes at a massive loss of power. Moreover, datasets like UK Biobank that focus on specific demographics will inevitably skew research into focusing on the medical challenges faced by the overrepresented demographics in these datasets, at the expense of those who are underrepresented. Investing in datasets that gather data from underrepresented demographics both at home (in the UK, Europe, North America etc) as well as, at the global level, other nationalities is important to ensure their most important medical challenges get attention also.

Conclusion

Research into improving the methods used in Mendelian Randomisation analysis (and causal inference analysis more generally) is always welcome. You never know when an improved method might be decisive in making a key causal finding given a certain dataset. However, while this remains important, far more important is improving the quality of datasets used in causal inference in order to make improved study designs. The gains from this are likely to be far greater than research into improved analysis methods and should definitely be the priority of available resources spent in the field.

KingStream

Discussion about this post