I studied Design of Experiments (DOE) in college -- it was part of the core curriculum in my major. The content feels like it ought to be very useful (it's basically techniques on how to conduct experiments efficiently and reducing the number of runs, thus reducing overall cost).... yet I've never had occasion to ever apply it.
One area where I feel it might be relevant to my work is in hyperparameter tuning, especially when function evals are expensive. Instead of doing grid search (which is a brute-force exhaustive search), fractional factorial designs can search the multidimensional space with way less function evals. Yet I don't do it because it's easier to write code to do brute force and let the computer chug away over night.
In the world of web analytics, fractional factorial lets you run multivariate experiments efficiently rather than a combinatorial combination of A/B tests. From online articles, I'm guessing people do use it there, but A/B tests still dominate.
Fractional factorial is also most effective when the covariates are orthogonal and independent, which is rarely the case. We can get around this by projecting high dimensional covariates into lower dimensional space using PCA, which guarantees orthogonality, but this also seems to not be done so much.
Just wondering if anyone is using fractional factorial designs in real life? (or optimal designs like D-optimal designs)
I've at times made fairly extensive use of d-optimal designs in formulation type research and development, usually in an iterative process, pruning and adding variables as well as refining ranges. Lot of footguns, but the approach can drastically decrease optimization time. In my experience it needs the right combination of: too many factors to make full factorial reasonable, too much (expected) covariance to trust single-vaeiable optimization, and cheap enough to commit to doing 8+ experiments before a major evaluation.
Mainly around bounding variables, especially when coupled with overly ambitious scope/number of tests. The short version is that its really easy to wind up with say a 80 experiment test where half or more of the combinations are invalid or perform poorly to the degree they don't adequately lend confidence to the prediction in the region of the global maximum. Leads to wasted work and the poor predictive ability doesn't lead well into focused future work. There are also the usual statistical foot guns of p-hacking and such.
For a concrete example, consider a bread baking optimization, considering time, temperature, and baking soda %. This gives a cubic design space, for which the naively (ignoring expected covariance of temperature*time) optimal design for 9 experiments is the cube's corners + the center point. If, for instance, your predetermined t_max always results in a briquette instead of a loaf, ~half of your experimental data is going to be worthless.
With more nuance, and temp_max is genuinely the highest reasonable temperature there are still two problems:
a. covariance is likely to drive combinations into 'invalid' territory (e.g. temp_max + time_max is likely to be invalid, or 2/9ths of your experimental. temp_min / time_min is also likely invalid for another 2/9ths).
b. predictive power / linearity of response over the ranges specified. Even if the combinations aren't invalid, if they are all the min/max combinations are poorly performing (due to overly wide, but valid, boundaries) you can wind up with poor predictive performance.
Covariance can be accounted for, design space trimmed (e.g. to a cube with a corner or two cut off), and bounds set conservatively but that is all tricky manual intervention that relies on knowledge of the problem domain and scaling factors of the underlying physics. When the problem isn't well understood it is easy to make errors in assumptions, those errors have a high cost, and if the problem was well understood a DOE probably wouldn't be necessary.
edit: for a less trivial example of a suitable problem for a d-optimal DOE, but with tricky bounding / underlying physics, consider: a 4 part formulation of fumed silica, cyanoacrylate, isopropanol, and water to make a gap-filling/quick-setting adhesive.
I'm in biotech (as a software engineer/data scientist) and here this stuff is used a lot. Experiments are very hands on and expensive in terms of inputs so you want to be efficient with your use of them.
Bayesian approaches (e.g. optimization with Gaussian processes) tend to be very serial in nature: later sampling points are decided from the information you get earlier on. But these biological experiments can take days or weeks so you want to run them in parallel as much as possible. DoE is excellent for that.
I'm speccing out a D-optimal design as we speak (though I'm not sure this is what we will use for the particular project I'm on).
> Fractional factorial is also most effective when the covariates are orthogonal and independent, which is rarely the case. We can get around this by projecting high dimensional covariates into lower dimensional space using PCA, which guarantees orthogonality, but this also seems to not be done so much.
I guess you're right in that if there were very many interactions between variables, then the aliasing of a fractional design would be an incredible nuisance, but on the other hand if there were no dependence between any of the variables at all then there would be no point to any sort of factorial design as you could just test each variable sequentially.
> Just wondering if anyone is using fractional factorial designs in real life? (or optimal designs like D-optimal designs)
I have to admit, I have never ever seen anyone calculate a D-optimal or G-optimal design for anything -- that said, I only graduated as a statistician 6 or 7 years ago. As you say, there might be some use to it for continuous variables but for factors, given that latin squares (etc.) are known to be optimal so you can just grab that off the shelf.
I used a lot of these techniques as an operations research analyst in the early 2010’s.
I also wondered the same thing about DoE for hyperparameter tuning. Always felt like it was a case of grid search/random search/bayes opt being “good enough” and easier. But maybe DoE would be worth it for something like LLMs where training runs for a month on 25k GPUs.
I think in many industrial settings Bayesian optimisation is more practical.
Since it allows use to refine our design on the fly and put resources into promising areas of the space earlier. Some kind of pre-planned fractional design could be useful for generating an initial set of points to test.
For coffee affectionados out there: picking best coffee setup settings is very similar to a physics experiment, and you can totally use the knowledge of experiment design theory there.
Settings:
* Kinds of beans at various levels of roasting available to you.
* Grind size on a grinder with controllable grind size in
steps.
* Ratio of ground coffee weight to water weight used for brewing.
* For filter coffee: preinfusion time, brew time or flow rate
(see V60 brewer).
* For espresso coffee: preinfusion time & pressure, brew time &
pressure or flow rate (see Flair Espresso).
* Water temperature.
* Water mineral composition.
* For milk-based coffee drinks: kinds of milk, milk percentage as ratio of the total drink volume, steaming duration, final milk temperature.
All of these create a huge factorial space of possible configurations, especially if you've go into flow or pressure profiling. If you frame it as an experiment you might isolate some of the variables that make the most impact tailored to your coffee preferences.
This article is actually in response to that video by NightHawkInLight, the method used in the video is however different from the one explained in the article, at least that is what the first paragraphs say, I did not read it completely.
Something about this doesn't make sense - from the C=A case:
> we wouldn’t be able to tell the difference in results between only doing option B (doubling the butter) and doing all three options: adding an egg, doubling the butter, and adding nuts on top.
However, the experiments listed include exactly those 2 experiments (Test 1 and Test 3 in the table with a column titled C=A), so unless I'm missing something, "wouldn't be able to tell the difference" doesn't mean what I expect it to.
As best I can tell from reading https://www.stat.purdue.edu/~yuzhu/stat514s2006/Lecnot/fracf...
the unstated assumption here is that we're going to do linear regression where (critically) the "-" case for each condition is -1, and the "+" is +1. This has the surprising-to-me effect of making "I", which looks like it might be the control group based on notation, actually a positive recipient of AC interaction (and any even-order interaction). You can think of this as a change in basis in how you parse out the effects, where we're talking about
1 -1
-1 1
(like a covariance matrix) instead of
0 0
0 1
for an interaction.
I have a gut feeling it's done this way mostly because the tools being used expect things to be expressed this way rather than any conscious choice by experimenters. Through this lens, if you test
I, B, AC, ABC
every experiment has a positive effect from AC interaction, and taking B-I, which we might think of as the effect from B, is in this paradigm also sensitive to the ABC interaction and the AB and BC interactions. The "real" effect from B would be approximated as (B + ABC - AC - I)/2, which is exactly the same as the effect from ABC interaction (which is positive when an odd number of its constituents are positive...).
I'm pretty sure this is just a difference in mathematical perspective - you can represent exactly the same data, but the coefficients (i.e. effect values) will change, and there's a different notion of what you know vs don't know. Maybe there's a more convincing reason to do this when you have more than two "levels", but from the presentation in TFA it just feels like overcomplicating things with a confusing prior about how effects work.
It also seems like the given example is just bad. If the parameters are numeric and there's not a reasonable "control", this perspective feels much more natural.
where v_A is 1 or -1 depending on whether A is present or absent in the experiment.
f_AB is being abbreviated to AB, which is causing some confusion, since when heading a column, AB means v_A*v_B. The article should say that we can't tell the difference between the effect associated with B and the effect associated the 3-way interaction (for this definition of the effect associated with the 3-way interaction).
> Lastly, I = AC means that we can’t tell the difference between doing nothing at all, compared to adding an egg and adding nuts on top.
But test 0 and test 2 are doing nothing and adding an egg & nuts respectively. What are we missing here?
On further reading, it seems like "doing all three options" should be interpreted to mean "the interaction of all three options"? So we aren't able to tell if an improvement of taste comes from B alone or from the interaction of A, B, and C. I'm mostly guessing though.
Wow. As an experiment-design nerd, this may be one of my personal top favorite HN posts of all time.
I don't 100% get the generation part yet, but I believe that's just me needing practice.
A long time ago back in New York, Grace and Trevor from Javelin (Née Lean Startup Machine) taught me the scientific method over the course of a weekend workshop.
Is it just me or does this require background knowledge that isn't widely available? I couldn't get through it as it never seemed to really explain what Aliases or Alias structures are, how determining those aliasing structures works (why multiply by C?), how this primary effect thing works and how any of this actually relates to the experiments in any way. As in, like, ok I do tests 1-4, now how do I turn that into insight with this algebra?
If someone could elaborate how this all fits together that would be great, because it does seem like there's some nuggets of insight to be found here
Let's say you're testing different formulations for cake but because you don't have time to do the full cartesian product of all possible variations over sugar content x fat content x type of flour, you just pick a handful, and it turns out that every time you're testing a high-sugar variation you are also testing spelt flour -- that's an alias because if you really like these particular cakes there's no way to know whether it was the spelt or the sugar that did it.
What experimental design brings to the table is a disciplined way to figure out what variables will be aliased and to make sure that these aliases are mostly harmless, either with the help of subject matter knowledge (you can taste sugar so it's fine if it's aliased with other things) or mathematically (let's try to avoid aliasing sugar content with fat content, but instead alias sugar content with the interaction effect of sugar, fat and flour type because that higher-order interaction is unlikely to matter over and above the first-order effects.)
Fractional designs in particular are typically used in agriculture and industrial settings, places where you want to try to optimize over very many factors at once but cannot afford to test every variation. It is not common in web analytics because (1) we usually assume that one particular change to a site or app will be independent of another change elsewhere, so there is no need to test them concurrently to see if a particular combo stands out and (2) if we did want to test combinations of variables concurrently, there's usually enough users or visitors to just test the entire grid and not worry about picking a selection of variations.
It provides (in some sense) maximally un-aliased parameter choices when trying to cover a (continuous) parameter space. It generalizes to arbitrary dimensions, doesn't require you to choose the number of samples/experiments beforehand, and actually expands trivially to parameters representing classes (including possibly weighting the classes).
I think it would be interesting to see how using this sequence fares in such a fractional factorial analysis, i.e. how close to optimal the (e.g. binarized) pseudorandom parameter choices are for different numbers of experiments!
I have studied at least a little combinatorics, and the language here is completely inpenetrable to me. Neither the question of "what property is this design supposed to have?" or "why does it have it?" are stated in plain or even mathematical english. The wiki entry is similarly unhelpful.
For whatever reasons, much of the community that does DOE's is beset with heavy use of opaque jargon and flagrant use of "canned tests" from software packages like minitab.
Making statistical inferences from data is nuanced and complicated if the thing you're interested in doesn't just pop-out from simple descriptive statistics treatments (histograms, box-plots, scatterplots, simple linear models).
There are some islands of sanity:
- NIST has a great resource (https://www.itl.nist.gov/div898/handbook/index.htm). Don't be alarmed by the retro html appearance. It's actually meticulously maintained and contains reproduceable examples with data. The plain language is such a relief compared to other references.
- "Statistics for Experimenters" the book by George Box et al, is a great resource that's fairly comprehensive in my opinion. Good, clear writing. It seems to be widely available online (don't know if that's intentional).
In a full factorial design you’re able to calculate the main effects and all the interactions between everything and everything else: not just pairwise but all combinations of variables.
You may not care for that because you don’t think there are such interactions or perhaps you just don’t have the experimental budget.
In fractional designs you can chose to confound some of the interactions so that you can’t tell what’s happening beyond a certain level. Eg just main and pairwise effects, with the rest confounding those effects. You would do this because it requires fewer runs.
You can also use it as a sort of screener when you have many variables by confounding things together in such a way that you can at least tell which variables are having the effects even if you can’t calculate the individual effects.
JMP is also a very criminally underrated and flexible tool to do data analysis with – both exploratory and modelling. It's my preferred tool of choice over R, especially after they added a structural equation modelling platform.
It's a pity they got rid of pricy perpetual licensing quite a few years ago for a no less pricy subscription model.
For biology, Synthace supports designing experimental protocols using DOE, executing them on robotically automated lab equipment, then performing analysis on the results: https://www.synthace.com/
The operator defined is associative and commutative and doing it twice gives you the identity. So to me, it seems a strange choice to notate this as some kind of word algebra; it feels much more natural to use binary vector notation.
In a later example, the post says ABD*BCE = ACDE. Let's translate this to binary vectors by putting a 1 when the letter is present, 0 when absent, and changing * to +. Then it becomes 11010 + 01101 = 10111. Clearly the * operator being discussed is XOR of binary vectors, which is why I choose to use + instead.
Now I'll attempt to translate the post into more standard linear algebra terminology. If you write the table rows as binary vectors, "setting up A and B and leaving C blank" is essentially setting up a 2x3 matrix (in general MxN) and leaving the third column blank, filling in the leftmost 2x2 (MxM) square with an identity matrix. Regardless of how you fill in C, this gives you a rank M matrix in reduced row echelon form.
Then "decide on a formula for how to set C’s value" is basically setting the remaining column(s) as some linear function of the original columns.
The alias is solving for the nullspace, the large example gives:
You can find all solutions by setting x3, x4 as free variables, and x0, x1, x2 are completely determined:
x0 = x3
x1 = x3 + x4
x2 = x4
(This looks like a sign mistake, but because + represents XOR, we can freely switch between negative and positive, as addition and subtraction are the same modulo 2.) You proceed by assigning all possible values to x3 and x4, obtaining a nullspace of {00000, 01101, 11010, 10111}, which matches the first row of the table (I, BCE, ABD and ACDE). The remaining rows of the table are obtained by adding that nullspace to every element of the span of the original matrix. (Of course this hits all 32 possible combinations, there is a fairly easy-to-prove linear algebra theorem guaranteeing this.)
These are some abstract computations on binary vectors, why are they relevant to the real world? It mystified me for a bit, but I think the answer is that if two vectors x, y are aliases, we've set things up so that y = x+z for some z in the nullspace of our matrix M, which represents some linear function f(). Then f(y) = f(x+z) = f(x)+f(z) = f(x). The function f() should be related to our experiment and the fact f(y) = f(x) should be related to the idea "our setup can't distinguish between x and y" but it's not 100% clear to me how this conclusion follows. Any stats experts care to chime in?
It didn't discuss at all how to pick generators, but I would guess (1) you want all variables to be varied at least once, and (2) you want alias classes to contain preferably at most one "low-entropy" entry (where "low-entropy" means combinatorially a low popcount, because a priori a simpler explanation is likelier than a complex one (Occam's Razor), and possibly some application-specific context.)
Dang I was hoping the article would have been about non-traditional higher level fractional factorial designs. Like for example you have an experiment with 3 factors and 5, 4, 3 levels each but can only run 20 experiments.
One area where I feel it might be relevant to my work is in hyperparameter tuning, especially when function evals are expensive. Instead of doing grid search (which is a brute-force exhaustive search), fractional factorial designs can search the multidimensional space with way less function evals. Yet I don't do it because it's easier to write code to do brute force and let the computer chug away over night.
In the world of web analytics, fractional factorial lets you run multivariate experiments efficiently rather than a combinatorial combination of A/B tests. From online articles, I'm guessing people do use it there, but A/B tests still dominate.
Fractional factorial is also most effective when the covariates are orthogonal and independent, which is rarely the case. We can get around this by projecting high dimensional covariates into lower dimensional space using PCA, which guarantees orthogonality, but this also seems to not be done so much.
Just wondering if anyone is using fractional factorial designs in real life? (or optimal designs like D-optimal designs)