Transfer Learning for Estimating Causal Effects using ... · 3 Transfer Learning 3.1 Background The key idea in transfer learning is that new experiments should transfer insights

Transfer Learning for Estimating Causal Effectsusing Neural Networks

Sören R. Künzel∗UC Berkeley

[email protected]

Bradly C. Stadie∗UC Berkeley

[email protected]

Nikita VemuriUC Berkeley

Varsha RamakrishnanUC Berkeley

Jasjeet S. SekhonUC Berkeley

[email protected]

Pieter AbbeelUC Berkeley

[email protected]

Abstract

We develop new algorithms for estimating heterogeneous treatment effects, combin-ing recent developments in transfer learning for neural networks with insights fromthe causal inference literature. By taking advantage of transfer learning, we areable to efficiently use different data sources that are related to the same underlyingcausal mechanisms. We compare our algorithms with those in the extant literatureusing extensive simulation studies based on large-scale voter persuasion experi-ments and the MNIST database. Our methods can perform an order of magnitudebetter than existing benchmarks while using a fraction of the data.

1 Introduction

The rise of massive datasets that provide fine-grained information about human beings and theirbehavior provides unprecedented opportunities for evaluating the effectiveness of treatments. Re-searchers want to exploit these large and heterogeneous datasets, and they often seek to estimatehow well a given treatment works for individuals conditioning on their observed covariates. Thisproblem is important in medicine (where it is sometimes called personalized medicine) (Hendersonet al., 2016; Powers et al., 2018), digital experiments (Taddy et al., 2016), economics (Athey andImbens, 2016), political science (Green and Kern, 2012), statistics (Tian et al., 2014), and many otherfields. A large number of articles are being written on this topic, but many outstanding questionsremain. We present the first paper that applies transfer learning to this problem.

In the simplest case, treatment effects are estimated by splitting a training set into a treatment anda control group. The treatment group receives the treatment, while the control group does not.The outcomes in those groups are then used to construct an estimator for the Conditional AverageTreatment Effect (CATE), which is defined as the expected outcome under treatment minus theexpected outcome under control given a particular feature vector (Athey and Imbens, 2015). This is achallenging task because, for every unit, we either observe its outcome under treatment or control,but never both. Assumptions, such as the random assignment of treatment and additional regularityconditions, are needed to make progress. Even with these assumptions, the resulting estimates areoften noisy and unstable because the CATE is a vector parameter. Recent research has shown that itis important to use estimators which consider both treatment groups simultaneously (Künzel et al.(2017); Wager and Athey (2017); Nie and Wager (2017); Hill (2011)). Unfortunately, these recentadvances are often still insufficient to train robust CATE estimators because of the large sample sizesrequired when the number of covariates is not small.

∗These authors contributed equally to this work.

Preprint. Work in progress. Code not available. Code will be released with conference version of the paper.

arX

iv:1

808.

0780

4v1

[st

at.M

L]

23

Aug

201

8

However, researchers usually fail to use ancillary datasets that are available to them in applications.This is surprising, given the need for additional data to estimate CATE reliably. These ancillarydatasets are related to the causal mechanism under investigation, but they are also partially distinctso they cannot be pooled naively, which explains why researches often do not use them. Examplesof such ancillary datasets include observations from: experiments in different locations on differentpopulations, different treatment arms, different outcomes, and non-experimental observational studies.The key idea underlying our contributions is that one can substantially improve CATE estimators bytransferring information from other data sources.

Our contributions are as follows:

1. We introduce the new problem of transfer learning for estimating heterogeneous treat-ment effects.

2. We develop the Y-learner for CATE estimation. We consider the problem of CATEestimation with deep neural networks. We propose the Y-Learner, a CATE estimatordesigned from the ground up to take advantage of deep neural networks’ ability to easilyshare information across layers. The Y-Learner often achieves state-of-the-art performanceon CATE estimation. The Y-learner does not use transfer learning.

3. MLRW Transfer for CATE Estimation adapts the idea of meta-learning regressionweights (MLRW) to CATE estimation. Using these learned weights, regression problemscan be optimized much more quickly than with random initializations. Though a variety ofMLRW algorithms exist, it is not immediately obvious how one should use these methods forCATE estimation. The principle difficulty is that CATE estimation requires the simultaneousestimation of outcomes under both treatment and control, when we only observe one of theoutcomes for any individual unit. However, most MLRW transfer methods optimize on aper-task basis to estimate a single quantity. We show that one can overcome this problemwith clever use of the Reptile algorithm (Nichol et al., 2018).While adapting Reptile towork with our problem, we discovered a slight modification to the original algorithm. Todistinguish this modification, we refer to it in this paper as SF Reptile, Slow-Fast Reptile.

4. We provide several additional methods for transfer learning for CATE estimation:warm start, frozen-features, multi-head, and joint training.

5. We apply our methods to difficult data problems and show that they perform betterthan existing benchmarks. We reanalyze a set of large field experiments that evaluatethe effect of a mailer on voter turnout in the 2014 U.S. midterm elections (Gerber et al.,2017). This includes 17 experiments with 1.96 million individuals in total. We also simulateseveral randomized controlled trials using image data of handwritten digits found in theMNIST database (LeCun, 1998). We show that our methods, MLRW in particular, obtainbetter than state-of-the-art performance in estimating CATE, and that they require far fewerobservations than extant methods.

6. We provide open source code for our algorithms.2

2 CATE ESTIMATION

We begin by formally introducing the CATE estimation problem. Following the potential outcomesframework (Rubin, 1974), assume there exists a single experiment wherein we observe N i.i.d.distributed units from some super population, (Yi(0), Yi(1), Xi,Wi) ∼ P . Yi(0) ∈ R denotes thepotential outcome of unit i if it is in the control group, Yi(1) ∈ R is the potential outcome of i if it isin the treatment group, Xi ∈ Rd is a d-dimensional feature vector, and Wi ∈ 0, 1 is the treatmentassignment. For each unit in the treatment group (Wi = 1), we only observe the outcome undertreatment, Yi(1). For each unit under control (Wi = 0), we only observe the outcome under control.Crucially, there cannot exist overlap between the set of units for which Wi = 1 and the set for whichWi = 0. It is impossible to observe both potential outcomes for any unit. This is commonly referredto as the fundamental problem of causal inference.

2The software will be released once anonymity is no longer needed. We can also provide an anynomizedcopy to reviewers upon request.

2

However, not all hope is lost. We can still estimate the Conditional Average Treatment Effect (CATE)of the treatment. Let x be an individual feature vector. Then the CATE of x, denoted τ(x), is definedby

τ(x) = E[Y (1)− Y (0)|X = x].

Estimating τ is impossible without making further assumptions on the distribution of(Yi(0), Yi(1), Xi,Wi). In particular, we need to place two assumptions on our data.

Assumption 1 (Strong Ignorability, Rosenbaum and Rubin (1983))(Yi(1), Yi(0)) ⊥W |X.

Assumption 2 (Overlap) Define the propensity score of x as,e(x) := P(W = 1|X = x).

Then there exists constant 0 < emin, emax < 1 such that for all x ∈ Support(X),0 < emin < e(x) < emax < 1.

In words, e(x) is bounded away from 0 and 1.

Assumption 1 ensures that there is no unobserved confounder, a random variable which influences boththe probability of treatment and the potential outcomes, which would make the CATE unidentifiable.The assumption is particularly strong and difficult to check in applications. Meanwhile, Assumption2 rectifies the situation wherein a certain part of the population is always treated or always in thecontrol group. If, for example, all women were in the control group, one cannot identify the treatmenteffect for women. Though both assumptions are strong, they are nevertheless satisfied by design inrandomized controlled trials. While the estimators we discuss would be sensible in observationalstudies when the assumptions are satisfied, we warn practitioners to be cautious in such studies,especially when the number of covariates is large (D’Amour et al., 2017).

Given these two assumptions, there exist many valid CATE estimators. The crux of these methods isto estimate two quantities: the control response function,

µ0(x) = E[Y (0)|X = x],

and the treatment response function,µ1(x) = E[Y (1)|X = x].

If we denote our learned estimates as µ0(x) and µ1(x), then we can form the CATE estimate as thedifference between the two

τ(x) = µ1(x)− µ0(x).

The astute reader may be wondering why we don’t simply estimate µ0 and µ1 with our favoritefunction approximation algorithm at this point and then all go home. After all, we have access to theground truths µ0 and µ1 and the corresponding inputs x. In fact, it is commonplace to do exactly that.When people directly estimate µ0 and µ1 with their favorite model, we call the procedure a T-learner(Künzel et al., 2017). Common choices of models include linear models and random forests, thoughneural networks have recently been considered (Nie and Wager, 2017).

While it may seem like we’ve triumphed, the T-learner does have some drawbacks (Athey and Imbens,2015). It is usually an inefficient estimator. For example, it will often perform poorly when one canborrow information across the treatment conditions. To overcome these deficiencies, a variety ofalternative learners have been suggested. Closely related to the T-learner is the idea of estimating theoutcome using all of the features and the treatment indicator, without giving the treatment indicator aspecial role (Hill, 2011). The predicted CATE for an individual unit is then the difference between thepredicted values when the treatment assignment indicator is changed from control to treatment, withall other features held fixed. This is called the S-learner, because it uses a single prediction model.

In this paper, we suggest another new learner called the Y-learner (See Figure 1). This learner hasbeen engineered from the ground up to take advantage of some of the unique capabilities of neuralnetworks. See the appendix for a full description of the Y-learner, and additional learners found in theliterature. Below, we will use these learners as base algorithms for transfer learning. That is to say,we will use the knowledge gained by training one learner on one experiment to help a new learnerwith a new underlying experiment train faster with less data.

3

5/17/2018 y_learner

1/3

Figure 1: Y-learner with Neural Networks. One of many advanced methods for CATE estimation.See the appendix Section B and D for a more detailed overview.

3 Transfer Learning

3.1 Background

The key idea in transfer learning is that new experiments should transfer insights from previousexperiments rather than starting learning anew. The most straightforward example of transfer comesfrom computer vision (Welinder et al., 2010; Saenko and Darrell, 2010; Bourdev et al., 2011; Donahueet al., 2014). Here, it is standard practice to train a neural network πθ for one task and then use thetrained network weights θ as initialization for a new task. The hope is that some basic low-levelfeatures of a vision system should be quite general and reusable. Starting optimization from networksthat have already learned these general features should be faster than starting from scratch.

Despite its promise, fine-tuning often fails to produce initializations that are uniformly good forsolving new tasks (Finn et al., 2017). One potent fix to this problem is a class of algorithms thatseek to optimize meta-learning initialization weights (Finn et al., 2017; Nichol et al., 2018). In thesealgorithms, one meta-optimizes over many experiments to obtain neural network weights that canquickly find solutions to new experiments. We will use the Reptile algorithm to learn initializationweights for CATE estimation.

3.2 Transfer Learning CATE Estimators

In this section, we consider a scenario wherein one has access to many related causal inferenceexperiments. Across all experiments, the input space X is the same. Let i index an experiment. Eachexperiment has its own distinct outcome when treatment is received, µi1(x), and when no treatment isreceived, µi0(x). Together, these quantities define the CATE τ i = µi1−µi0, which we want to estimate.We are usually interested in estimating the CATE by using X to predict µi0 and µi1. However, intransfer learning, the hope is that we can transfer knowledge between experiments such that beingable to predict µi0, µ

i1, and τ i from experiment i accurately will help us predict µj0, µ

j1, and τ j from

experiment j.

Below, let πθ be a generic expression for a neural network parameterized by θ. Sometimes, parameterswill have a subscript indicating if their neural network predicts treatment or control (0 for control and1 for treatment). Parameters may also have a superscript indicating the experiment number whoseoutcome is being predicted. For example, πθ20 (x) predicts µ2

0(x), the outcome under control forExperiment 2. We will sometimes drop the superscript i when the meaning is clear. All of the transferalgorithms described here are presented in detail in Appendix D.

Warm start (also known as fine-tuning): Experiment 0 predicts πθ00 (x) = µ00(x) and πθ01 (x) =

µ01(x) to form the CATE estimator τ = µ0

1(x)− µ10(x). Suppose θ0

0 , θ01 are fully trained and produce

4

Experim

ent 1

Experim

ent 0

Warm start method Frozenfeatures method

X11

X10L1

πϕ10

πϕ11 μ11

μ10

πθ00

πθ01X01

X00L0

πϕ00

πϕ01 μ01

μ00

πθ10

πθ11backprop

backprop

warm start

random initialization

X11

X10L1

πϕ10

πϕ11 μ11

μ10

πγ0

πγ1X01

X00L0

πϕ00

πϕ01 μ01

μ00

πγ0

πγ1

backprop

backprop


warm start

warm start

Multihead method

X11

X10L1

πϕ10

πϕ11 μ11

μ10

πγ0

πγ1X01

X00L0

πϕ00

πϕ01 μ01

μ00

πγ0

πγ1

backprop

backprop


freeze during backprop

Figure 2: Warm start, frozen-features, and multi-head methods for CATE transfer learning. Forthese figures, we use the T-learner as the base learner for simplicity. All three methods attempt toreuse neural network features from previous experiments. See the appendix for an illustration ofjoint-training.

a good CATE estimate. For experiment 1, the input space X is identical to the input space forexperiment 0, but the outcomes µ1

0(x) and µ11(x) are different. However, we suspect the underlying

data representations learned by πθ00 and πθ01 are still useful. Hence, rather than randomly initializeθ1

0 and θ11 for experiment 1, we set θ1

0 = θ00 and θ1

1 = θ01 . We then train πθ10 (x) = µ1

0(x) andπθ11 (x) = µ1

1(x). See Figure 2 and Algorithm 8 in the appendix.

Frozen-features: Begin by training πθ00 and πθ01 to produce good CATE estimates for experiment 0.Assuming θ0

0 and θ01 have more than k layers, let γ0 be the parameters corresponding to the first k

layers of θ00 . Define γ1 analogously. Since we think the features encoded by πγi(X) would make

a more informative input than the raw features X , we want to use those features as a transformedinput space for πθ10 and πθ11 . To wit, set z0 = πγ0(x) and z1 = πγ1(x). Then form the estimatesπθ10 (z0) = µ1

0 and πθ11 (z1) = µ11. During training of experiment 1, we only backpropagate through

θ10 , θ1

1 and not through the features we borrowed from θ00 and θ0

1 . See Figure 2 and Algorithm 9 inthe appendix.

Multi-head: In this setup, all experiments share base layers that are followed by experiment-specificlayers. The intuition is that the base layers should learn general features, and the experiment-specificlayers should transform those features into estimates of µij . More concretely, let γ0 and γ1 be sharedbase layers. Set z0 = πγ0(x0) and z1 = πγ1(x1). The base layers are followed by experiment-specific layers φi0 and φi1. Let θij =

[γj , φ

ij

]. Then πθij (x) = πφi

j

(πγj (x)

)= πφi

j(zj) = µij .

Training alternates between experiments: each θi0 and θi1 is trained for some small number ofiterations, and then the experiment and head being trained are switched. Every head is usually trainedseveral times. See Figure 2 Algorithm 10 in the appendix.

Joint training: All predictions share base layers θ. From these base layers, there are two headsper-experiment i: one to predict µi0 and one to predict µi1. Every head and the base features are trainedsimultaneously by optimizing with respect to the loss functionL =

∑i ‖(µi0 − µi0

)‖+‖

(µi1 − µi1

)‖

and minimizing over all weights. This will encourage the base layers to learn generally applicablefeatures and the heads to learn features specific to predicting a single µij . See Figure Algorithm 6.

SF Reptile transfer for CATE estimators: Similarly to fine-tuning, we no longer provide eachexperiment with its own weights. Instead, we use data from all experiments to learn weights θ0 andθ1, which are good initializers. By good initializers, we mean that starting from θ0 and θ1, one cantrain neural networks πθ0 and πθ1 to estimate µi0 and µi1 for any arbitrary experiment much faster andwith less data than starting from random initializations. To learn these good initializations, we use atransfer learning technique called Reptile. The idea is to perform experiment-specific inner updatesU(θ) and then aggregate them into outer updates of the form θnew = ε · U(θ) + (1− ε) · θ. In thispaper, we consider a slight variation of Reptile. In standard Reptile, ε is either a scalar or correlated toper-parameter weights furnished via SGD. For our problem, we would like to encourage our networklayers to learn at different rates. The hope is that the lower layers can learn more general, slowly-

5

changing features like in the frozen features method, and the higher layers can learn comparativelyfaster features that more quickly adapt to new tasks after ingesting the stable lower-level features. Toaccomplish this, we take the path of least resistance and make ε a vector which assigns a differentlearning rate to each neural network layer. Because our intuition involves slow and fast weights, wewill refer to this modification in this paper as SF Reptile: Slow Fast Reptile. Though this change isseemingly small, we found it boosted performance on our problems. See Figure 7 and Algorithm 11.

MLRW transfer for CATE estimation: In this method, there exists one single set of weights θ.There are no experiment-specific weights. Furthermore, we do not use separate networks to estimateµ0 and µ1. Instead, πθ is trained to estimate one µij at a time. We train θ with SF Reptile so that inthe future πθ requires minimal samples to fit µij from any experiment. To actually form the CATEestimate, we use a small number of training samples to fit πθ to µi0 and then a small number oftraining samples to fit πθ to µi1. We call θ meta-learned regression weights (MLRW) because theyare meta-learned over many experiments to quickly regress onto any µij . The full MLRW algorithmis presented as Algorithm 5.

4 Evaluation

We evaluate our transfer learning estimators on both real and simulated data. In our data example,we consider the important problem of voter encouragement. Analyzing a large data set of 1.96million potential voters, we show how transfer learning across elections and geographic regionscan dramatically improve our CATE estimators. This example shows that transfer learning cansubstantially improve the performance of CATE estimators. To the best of our knowledge, this isthe first successful demonstration of transfer learning for CATE estimation. The simulated datahas been intentionally chosen to be different in character from our real-world example. In particular,the simulated input space is images and the estimated outcome variable is continuous.

4.1 GOTV Experiment

To evaluate transfer learning for CATE estimation on real data, we reanalyze a set of large fieldexperiments with more than 1.96 million potential voters (Gerber et al., 2017). The authors conducted17 experiments to evaluate the effect of a mailer on voter turnout in the 2014 U.S. Midterm Elections.The mailer informs the targeted individual whether or not they voted in the past four major elections(2006, 2008, 2010, and 2012), and it compares their voting behavior with that of the people in thesame state. The mailer finishes with a reminder that their voting behavior will be monitored. The ideais that social pressure—i.e., the social norm of voting—will encourage people to vote. The likelihoodof voting increases by about 2.2% (s.e.=0.001) when given the mailer.

Each of the experiments target a different state. This results in different populations, different ballots,and different electoral environments. In addition to this, the treatment is slightly different in eachexperiment, as the median voting behavior in each state is different. However, there are still manysimilarities across the experiments, so there should be gains from transferring information.

In this example, the input X is a voter’s demographic data including age, past voting turnout in2006, 2008, 2009, 2010, 2011, 2012, and 2013, marital status, race, and gender. The treatmentresponse function µ1(x) estimates the voting propensity for a potential voter who receives a mailerencouraging them to vote. The control response function µ0 estimates the voting propensity if thatvoter did not receive a mailer. The CATE τ is thus the change in the probability of voting when a unitreceives a mailer. The complete dataset has this data over 17 different states. Treating each state as aseparate experiment, we can perform transfer learning across them.

x outcome µ0 µ1 τ

a voter profile The voter’spropensity tovote

The voter’spropensity tovote whentheydo not receivea mailer

The voter’spropensity tovote whentheydo receive amailer

Change in thevoter’spropensity tovote afterreceiving amailer

6

baseline

SF

frozen

multi head

warm

baseline

frozen

multi head

SF

baseline

frozen

multi head

SF

warm

joint

S−RF

T−RF

MLRW

S−NN T−NN Y−NN other estimators

0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100

0.00

0.05

0.10

0.15

Number of units in the training set (in 1000)

MS

E fo

r th

e C

ATE

Figure 3: Social Pressure and Voter Turnout, Version 1. Our results far exceed the previous stateof the art, which are represented here as S-RF, T-RF, and the baseline method for S-NN and T-NN.Our new methods are Y-NN and the transfer learning methods: warm, frozen, multi-head, joint, SFReptile, and MLRW.

Being able to estimate the treatment effect of sending a mailer is an important problem in elections.We may wish to only treat people whose likelihood of voting would significantly increase whenreceiving the mailer, to justify the cost for these mailers. Furthermore, we wish to avoid sendingmailers to voters who will respond negatively to them. This negative response has been previouslyobserved and is therefore feasible and a relevant problem—e.g., some recipients call their Secretaryof State’s office or local election registrar to complain (Mann, 2010; Michelson, 2016).

Evaluating CATE estimators on real data

Evaluating a CATE estimator on real data is difficult since one does not observe the true CATE orthe individual treatment effect, Yi(1)− Yi(0), for any unit because by definition only one of the twooutcomes is observed for any unit. One could use the original features and simulate the outcomefeatures, but this would require us to create a response model. Instead, we estimate the "truth" onthe real data using linear models (version 1) or random forests (version 2), and we then draw thedata based on these estimates. For a detailed description, we refer to Appendix A.2. We then ask thequestion: how do the various methods perform when they have less data than the entire sample?

We evaluate S-NN, T-NN, and Y-NN using our transfer learning methods. We also added a baselinebenchmark which does not use any transfer learning for each of the CATE estimators. In additionto this, we added the S-RF and T-RF as random forest baselines, as well as the Joint estimator andthe MLRW estimator, both of which use transfer learning. Figure 3 shows the performance of theseestimators when the regression functions were created using a linear model, and Figure 4 shows thesame, but the response functions are created using a random forest fitted on the real data.

In previous work, the non-transfer tree-based estimators such as T-RF and S-RF have achieved stateof the art results on this problem (Künzel et al., 2017). For CATE estimation, these methods are verycompetitive baselines (Green and Kern, 2012). Happily for us, even non-transfer neural-network-based learners vastly outperform the prior art. In both examples, non-transfer S-NN, T-NN, and Y-NNlearners are better or not much worse than T-RF and S-RF. S-NN and Y-NN perform extremely wellin this example. Better still, our transfer learning approaches consistently outperform all classicalbaselines and non-transfer neural network learners on this benchmark. Positive transfer betweenexperiments is readily apparent.

We find that multi-head, frozen features, and SF are usually the best methods to improve an existingneural network-based CATE estimator. The best estimator is MLRW. This algorithm consistentlyconverges to a very good solution with very few observations.

4.2 MNIST Example

In the previous experiment, we observed that the MLRW estimator performed most favorably andtransfer learning significantly improved upon the baseline. To confirm that this conclusion is notspecific to voter persuasion studies, we consider in this section intentionally a very different typeof data. Recently, Nie and Wager (2017) introduced a simulation study wherein MNIST digits are

7

baseline

SF frozenmulti head

warm

baseline

frozen

multi head

SF

baseline

frozen

multi head

SFwarm

joint

S−RF

T−RF

MLRW

S−NN T−NN Y−NN other estimators

0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100

0.000

0.025

0.050

0.075

0.100

0.125

Number of units in the training set (in 1000)

MS

E fo

r th

e C

ATE

Figure 4: Social Pressure and Voter Turnout, Version 2. Our results exceed the previous state of theart results, which are represented here as S-RF, T-RF, and the baseline method for S-NN and T-NN.Our new methods are Y-NN and the transfer learning methods: warm, frozen, multi-head, joint, SFReptile, and MLRW.

rotated by some number of degrees α; with α furnished via a single data generating process thatdepends on the value of the depicted digit. They then attempt to do CATE estimation to measure theheterogeneous treatment effect of a digit’s label.

Motivated by this example, we develop a data generating process using MNIST digits wherein transferlearning for CATE estimation is applicable. In our example, the input X is an MNIST image. Wehave k data generating processes which return different outcomes for each input when given eithertreatment or control. Thus, under some fixed data generating process, µ0 represents the outcome whenthe input image X is given the control, µ1 represents the outcome when X is given the treatment, andτ is the difference in outcomes given the placement of X in the treatment or control group. Each datagenerating process has different response functions (µ0 and µ1) and thus different CATEs (τ ), buteach of these functions only depend on the image label presented in the image X . We thus hope thattransfer learning could expedite the process of learning features which are indicative of the label. SeeAppendix A for full details of the data generation process. In Figure 5 of Appendix A, we confirmthat a transfer learning strategy outperforms its non-transfer learning counterpart, even on image data,and also that MLRW performs well.

5 Discussion and Conclusion

In this paper, we proposed the problem of transfer learning for CATE estimation. One immediatequestion the reader may be left with is why we chose the transfer learning techniques we did. Weonly considered two common types of transfer: (1) Basic fine tuning and weights sharing techniquescommon in the computer vision literature (Welinder et al., 2010; Saenko and Darrell, 2010; Bourdevet al., 2011; Donahue et al., 2014; Koch, 2015), (2) Techniques for learning an initialization that canbe quickly optimized (Finn et al., 2017; Ravi and Larochelle, 2017; Nichol et al., 2018). However,many further techniques exist. Yet, transfer learning is an extensively studied and perennial problem(Schmidhuber, 1992; Bengio et al., 1992; Thrun, 1996; Thrun and Pratt, 1998; Taylor and Stone,2009; Silver et al., 2013). In Vinyals et al. (2016), the authors attempt to combine feature embeddingsthat can be utilized with non-parametric methods for transfer. Snell et al. (2017) is an extension ofthis work that modifies the procedure for sampling examples from the support set during training.Andrychowicz et al. (2016) and related techniques try to meta-learn an optimizer that can more quicklysolve new tasks. Rusu et al. (2016) attempts to overcome forgetting during transfer by systematicallyintroducing new network layers with lateral connections to old frozen layers. Munkhdalai and Yu(2017) uses networks with memory to adapt to new tasks. We invite the reader to review Finn et al.(2017) for an excellent overview of the current transfer learning landscape. Though the majorityof the discussed techniques could be extended to CATE estimation, our implementations of Rusuet al. (2016); Andrychowicz et al. (2016) proved difficult to tune and consequently learned very little.Furthermore, we were not able to successfully adapt Snell et al. (2017) to the problem of regression.We decided to instead focus our attention on algorithms for obtaining good initializations, whichwere easy to adapt to our problem and quickly delivered good results without extensive tuning.

8

ReferencesAndrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul, T., and de Freitas, N.

(2016). Learning to learn by gradient descent by gradient descent. Neural Information ProcessingSystems (NIPS).

Athey, S. and Imbens, G. W. (2015). Machine learning methods for estimating heterogeneous causaleffects. stat, 1050(5).

Athey, S. and Imbens, G. W. (2016). Recursive partitioning for heterogeneous causal effects.Proceedings of the National Academy of Sciences of the United States of America, 113(27):7353–60.

Bengio, S., Bengio, Y., Cloutier, J., and Gecsei, J. (1992). On the optimization of a synaptic learningrule. Biological Neural Networks.

Bourdev, L., Maji, S., and Malik, J. (2011). Describing people: A poselet-based approach to attributeclassification. ICCV.

D’Amour, A., Ding, P., Feller, A., Lei, L., and Sekhon, J. (2017). Overlap in observational studieswith high-dimensional covariates. arXiv preprint arXiv:1711.02582.

Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., and Darrell, T. (2014). Decaf: Adeep convolutional activation feature for generic visual recognition. International Conference onMachine Learning (ICML).

Finn, C., Abbeel, P., and Levine, S. (2017). Model-agnostic metalearning for fast adaptation of deepnetworks. ICML.

Gerber, A. S., Huber, G. A., Fang, A. H., and Gooch, A. (2017). The generalizability of socialpressure effects on turnout across high-salience electoral contexts: Field experimental evidencefrom 1.96 million citizens in 17 states. American Politics Research, 45(4):533–559.

Green, D. P. and Kern, H. L. (2012). Modeling heterogeneous treatment effects in survey experimentswith bayesian additive regression trees. Public opinion quarterly, 76(3):491–511.

Henderson, N. C., Louis, T. A., Wang, C., and Varadhan, R. (2016). Bayesian analysis of heteroge-neous treatment effects for patient-centered outcomes research. Health Services and OutcomesResearch Methodology, 16(4):213–233.

Hill, J. L. (2011). Bayesian nonparametric modeling for causal inference. Journal of Computationaland Graphical Statistics, 20(1):217–240.

Koch, G. (2015). Siamese neural networks for one-shot image recognition. ICML Deep LearningWorkshop.

Künzel, S., Sekhon, J., Bickel, P., and Yu, B. (2017). Meta-learners for estimating heterogeneoustreatment effects using machine learning. arXiv preprint arXiv:1706.03461.

LeCun, Y. (1998). The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/.

Mann, C. B. (2010). Is there backlash to social pressure? a large-scale field experiment on votermobilization. Political Behavior, 32(3):387–407.

Michelson, M. R. (2016). The risk of over-reliance on the institutional review board: An approvedproject is not always an ethical project. PS: Political Science & Politics, 49(02):299–303.

Munkhdalai, T. and Yu, H. (2017). Meta networks. ICML.

Nichol, A., Achiam, J., and Schulman, J. (2018). On first-order meta-learning algorithms. CoRR,abs/1803.02999.

Nie, X. and Wager, S. (2017). Learning objectives for treatment effect estimation. arXiv preprintarXiv:1712.04912.

9

Powers, S., Qian, J., Jung, K., Schuler, A., Shah, N. H., Hastie, T., and Tibshirani, R. (2018). Somemethods for heterogeneous treatment effect estimation in high dimensions. Statistics in medicine.

Ravi, S. and Larochelle, H. (2017). Optimization as a model for few-shot learning. InternationalConference on Learning Representations (ICLR).

Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observationalstudies for causal effects. Biometrika, 70(1):41–55.

Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomizedstudies. Journal of educational Psychology, 66(5):688.

Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu,R., and Hadsell, R. (2016). Progressive neural networks. CoRR, vol. abs/1606.04671.

Saenko, K., K. B. F. M. and Darrell, T. (2010). Adapting visual category models to new domains.ECCV.

Schmidhuber, J. (1992). Learning to control fast-weight memories: An alternative to dynamicrecurrent networks. Neural Computation.

Silver, Yand, and Li (2013). Lifelong machine learning systems: Beyond learning algorithms. DAAAISpring Symposium-Technical Report, 2013.

Snell, J., , Swersky, K., and Zemel, R. (2017). Prototypical networks for few-shot learning. arXivpreprint arXiv:1703.05175.

Taddy, M., Gardner, M., Chen, L., and Draper, D. (2016). A nonparametric bayesian analysisof heterogenous treatment effects in digital experimentation. Journal of Business & EconomicStatistics, 34(4):661–672.

Taylor and Stone (2009). Transfer learning for reinforcement learning domains: A survey. DAAAISpring Symposium-Technical Report, 2013.

Thrun (1996). Is learning the n-th thing any easier than learning the first? NIPS.

Thrun, S. and Pratt, L. (1998). Learning to learn. Springer Science and Business Media.

Tian, L., Alizadeh, A. A., Gentles, A. J., and Tibshirani, R. (2014). A simple method for estimatinginteractions between a treatment and a large number of covariates. Journal of the AmericanStatistical Association, 109(508):1517–1532.

Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., and et al (2016). Matching networks for oneshot learning. Neural Information Processing Systems (NIPS).

Wager, S. and Athey, S. (2017). Estimation and inference of heterogeneous treatment effects usingrandom forests. Journal of the American Statistical Association.

Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., and Perona, P. (2010).Caltech-ucsd birds 200. technical report cns-tr-2010-001. California Institute of Technology.

10

A Appendix: Simulation Studies and Application

baselinefrozenwarm

multi head

baseline

warmMLRW

S−NN Y−NN

0 5 10 0 5 10

0.2

0.4

0.8

1.6

Number of units in the training set in 1000 units

MS

E fo

r th

e C

ATE

Figure 5: MNIST task

A.1 MNIST Simulation

For our MNIST simulation study (Section 4.2), we used the MNIST database (LeCun, 1998) whichcontains labeled handwritten images. We follow here the notation of Nie and Wager (2017), whointroduce a very similar simulation study which is not trying to evaluate transfer learning for CATEestimation, but instead emulates a RCT with the goal to evaluate different CATE estimators.

The MNIST data set contains labeled image data (Xi, Ci), where Xi denotes the raw image of i andCi ∈ 0, . . . , 9 denotes its label. We create k Data Generating Processes (DGPs), D1, . . . , Dk,each of which specifies a distribution of (Yi(0), Yi(1),Wi, Xi) and represents different CATE esti-mation problems.

In this simulation, we let Wi = 0 if the image Xi is placed in the control, and Wi = 1 if the imageXi is placed in the treatment. Yi(Wi) quantifies the the outcome of Xi under Wi.

To generate a DGP Dj , we first sample weights in the following way,

mj(0), mj(1), . . . , mj(9)iid∼ Unif(−3, 3),

tj(0), tj(1), . . . , tj(9)iid∼ Unif(−1, 1),

pj(0), pj(1), . . . , pj(9)iid∼ Unif(0.3, 0.7),

and we define the response functions and the propensity score as

µj0(Ci) = mj(Ci) + 3Ci,

µj1(Ci) = µj0(Ci) + tj(Ci),

ej(Ci) = pj(Ci).

To generate (Yi(0), Yi(1),Wi, Xi) from Dj , we fist sample a (Xi, Ci) from the MNIST data set, andwe then generate Yi(0), Yi(1), and Wi in the following way:

εiiid∼ N (0, 1)

Yi(0) = µ0(Ci) + εiYi(1) = µ1(Ci) + εiWi ∼ Bern(e(Ci)).

During training, Xi,Wi, and Yi are made available to the convolutional neural network, which thenpredicts τ given a test image Xi and a treatment Wi. τ is the difference in the outcome given thedifference in treatment and control.

Having access to multiple DGPs can be interpreted as having access to prior experiments done on asimilar population of images, allowing us to explore the effects of different transfer learning methodswhen predicting the effect of a treatment in a new image.

11

A.2 GOTV Data Example and Simulation

In this section, we describe how the simulations for the GOTV example in the main paper weredone and we discuss the results of a much bigger simulation study with 51 experiments which issummarized in Tables 1, 2, and 3.

A.2.1 Data Generating Processes for Our Real World Example

For our data example, we took one of the experiments conducted by Gerber et al. (2017). The studytook place in 2014 in Alaska and 252,576 potential voters were randomly assigned in a control and atreatment group. Subjects in the treatment group were sent a mailer as described in the main text andtheir voting turnout was recorded.

To evaluate the performance of different CATE estimators we need to know the true CATEs, whichare unknown due to the fundamental problem of causal inference. To still be able to evaluate CATEestimators researchers usually estimate the potential outcomes using some machine learning methodand then generate the data from this estimate. This is to some extend also a simulation, but unlikeclassical simulation studies it is not up to the researcher to determine the data generating distribution.The only choice of the researcher lies in the type of estimator she uses to estimate the responsefunctions. To avoid being mislead by artifacts created by a particular method, we used a linear modelin Real World Data Set 1 and random forests estimator in Real World Data Set 2.

Specifically, we generate for each experiment a true CATE and we simulate new observed outcomesbased on the real data in four steps.

1. We first use the estimator of choice (e.g., a random forests estimator) and train it on thetreated units and on the control units separately to get estimates for the response functions,µ0 and µ1.

2. Next, we sampleN units from the underlying experiment to get the features and the treatmentassignment of our samples (Xi,Wi)

Ni=1.

3. We then generate the true underlying CATE for each unit using τi = τ(Xi) = µ1(Xi)−µ0(Xi).

4. Finally we generate the observed outcome by sampling a Bernoulli distributed variablearound mean µi.

Y obsi ∼ Bern(µi), µi =

µ0(Xi) if W = 0,

µ1(Xi) if W = 1.

After this procedure, we have 17 data sets corresponding to the 17 experiments for which we knowthe true CATE function, which we can now use to evaluate CATE estimators and CATE transferlearners.

A.2.2 Data Generating Processes for Our Simulation Study

Simulations motivated by real-world experiments are important to assess whether our methods workwell for voter persuasion data sets, but it is important to also consider other settings to evaluate thegeneralizability of our conclusions.

To do this, we first specify the control response function, µ0(x) = E[Y (0)|X = x] ∈ [0, 1], and thetreatment response function, µ1(x) = E[Y (1)|X = x] ∈ [0, 1].

We then use each of the 17 experiments to generate a simulated experiment in the following way:

1. We sample N units from the underlying experiment to get the features and the treatmentassignment of our samples (Xi,Wi)

Ni=1.

2. We then generate the true underlying CATE for each unit using τi = τ(Xi) = µ1(Xi)−µ0(Xi).

3. Finally we generate the observed outcome by sampling a Bernoulli distributed variablearound mean µi.

Y obsi ∼ Bern(µi), µi =

µ0(Xi) if W = 0,

µ1(Xi) if W = 1.

12

The experiments range in size from 5,000 units to 400,000 units per experiment and the covariatevector is 11 dimensional and the same as in the main part of the paper. We will present here threedifferent setup.

Simulation LM (Table 1): We choose hereN to be all units in the corresponding experiment. Sampleβ0 = (β0

1 , . . . , β0d)

iid∼ N (0, 1) and β1 = (β11 , . . . , β

1d)

iid∼ N (0, 1) and define,

µ0(x) = logistic(xβ0

),

µ1(x) = logistic(xβ1

).

Simulation RF (Table 2): We choose here N to be all units in the corresponding experiment.

1. Train a random forests estimator on the real data set and define µ0 to be the resultingestimator,

2. Sample a covariate f (e.g., age),

3. ample a random value in the support of f (e.g., 38),

4. Sample a shift s ∼ N (0, 4).

Now define the potential outcomes as follows:

µ0(x) = trained Random Forests algorithmµ1(x) = logistic (logit (µ0(x) + s ∗ 1f≥v))

Simulation RFt (Table 3): This experiment is the same as Simulation RF, but use only one percentof the data, N = #units

100 .

Results of 42 Simulations

Even though we combine each Simulation setup with 17 experiments, we only report the first 14,because the last three don’t add any new insight, but they don’t fit well on the page. Looking at Tables1, 2, and 3, we observe that MLRW is the best performing transfer learner. In fact, for Simulation LMit is the best in 8 out of 17 experiments, in Simulation RF it is the best in 11 out of 17 experiments,and in Simulation RFt it is best in 10 out of 17 experiments. We also notice that in cases, where itis not the best performing estimator, it is usually very close to the best and it does not fail terriblyanywhere. For the other transfer method, we note that frozen features, multi-head, and SF works verywell and consistently improves upon the baseline learners which are not using outside information.Warm Start, however, does not work well and often even leads to worse results than the baselineestimators.

B Y-learner

In this section, we show the favorable behavior of the Y-learner over the X-learner. In order to showthis, we implemented the X-learner exactly as it is described in (Künzel et al., 2017) and the Y-learneras it is described in Algorithm 4. Figure 6 shows the MSE in proportion to its sample size. We can seethat the X–learner is consistently outperformed on all these data sets by the Y-learner. We note thatall these data sets were intentionally crated to be very similar to the GOTV data set we are interestedin studying. Therefore these data sets are not extremely different from each other, and it is possiblethat the X-NN performs much better on different data sets.

The Y-Learner

Another important advantage of neural networks is that they can be trained jointly. This enablesus to adapt well-performing meta-learners to perform even better. Specifically, we used the idea ofX-NN to propose a new CATE estimator, which we call Y-NN.3 The X-learner is essentially a two

3Y is chosen as it is the next letter in the alphabet after X. However, this is not a meta-learner because thereis no obvious way to extend it to arbitrary base learners, such as RF or BART.

13

GOTV_3_1_lm_100pc GOTV_3_1_lm_10pc GOTV_3_1_lm_1pc

GOTV_2_1_lm GOTV_2_2_rf GOTV_3_1_lm_01pc

0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50

0.0

0.1

0.2

0.3

0.4

0.5

0.0

0.1

0.2

0.3

0.4

0.5

Number of units in the training set in 1000

MS

E fo

r th

e C

ATE

Estimator X−NN Y−NN

Figure 6: In this figure, we compare the Y and the X learner on six simulated data sets. A precisedescription on how the data was created can be found in Section A.2

step procedure. In the first stage, the outcome functions, µ0 and µ1, are estimated and the individualtreatment effects are imputed:

D1i := Y (1)− µ0(Xi) and D0

i := µ1(Xi)− Yi(0).

In the second stage, estimators for the CATE are derived by regressing the features X on the imputedtreatment effects. Künzel et al. (2017) provides details. In the X-learner, the estimators of thefirst stage are held fixed and are not updated in the second stage. This is necessary since, unlikeneural networks, many machine learning algorithms, such as RF and BART, cannot be updated in ameaningful way once they have been trained. For neural networks and similar gradient optimization-based algorithms, it is possible to jointly update the estimators in the first and the second stage.

This is exactly the motivation of the Y-learner. Instead of first deriving an estimator for the controlresponse functions and then an estimator for the CATE function, these functions are optimized jointly.The pseudo-code in Algorithm 4 shows how these two stages are updated simultaneously. In Figure6, we compare Y-NN with X-NN, and we find that Y-NN outperforms X-NN for our data sets.

14

C Pseudo Code for CATE Estimators

In this section, we will present pseudo code for the CATE estimators in this paper. We present codefor the meta learning algorithms in Section D. We denote by Y 0 and Y 1 the observed outcomes forthe control and the treated group. For example, Y 1

i is the observed outcome of the ith unit in thetreated group. X0 andX1 are the features of the control and treated units, and hence, X1

i correspondsto the feature vector of the ith unit in the treated group. Mk(Y ∼ X) is the notation for a regressionestimator, which estimates x 7→ E[Y |X = x]. It can be any regression/machine learning estimator,but in this paper we only choose it to be a neural network or random forest.

Algorithm 1 T-learner1: procedure T–LEARNER(X,Y obs,W )2: µ0 = M0(Y 0 ∼ X0)3: µ1 = M1(Y 1 ∼ X1)

4: τ(x) = µ1(x)− µ0(x)5: end procedureM0 and M1 are here some, possibly different machine learning/regression algorithms.

Algorithm 2 S-learner1: procedure S–LEARNER(X,Y obs,W )2: µ = M(Y obs ∼ (X,W ))3: τ(x) = µ(x, 1)− µ(x, 0)4: end procedure

M(Y obs ∼ (X,W )) is the notation for estimating (x,w) 7→ E[Y |X = x,W = w] while treating W as a0,1–valued feature.

Algorithm 3 X–learner1: procedure X–LEARNER(X,Y obs,W, g)

2: µ0 = M1(Y 0 ∼ X0) . Estimate response function3: µ1 = M2(Y 1 ∼ X1)

4: D1i = Y 1

i − µ0(X1i ) . Compute imputed treatment effects

5: D0i = µ1(X0

i )− Y 0i

6: τ1 = M3(D1 ∼ X1) . Estimate CATE for treated and control7: τ0 = M4(D0 ∼ X0)

8: τ(x) = g(x)τ0(x) + (1− g(x))τ1(x) . Average the estimates9: end procedureg(x) ∈ [0, 1] is a weighing function which is chosen to minimize the variance of τ(x). It is sometimes possibleto estimate Cov(τ0(x), τ1(x)), and compute the best g based on this estimate. However, we have made goodexperiences by choosing g to be an estimate of the propensity score, but also choosing it to be constant and equalto the ratio of treated units usually leads to a good estimator of the CATE.

15

Algorithm 4 Y-Learner Pseudo Code1: if Wi == 0 then2: Update the network πθ0 to predict Y obsi

3: Update the network πθ1 to predict Y obsi + πτ (Xi)4: Update the network πτ to predict πθ1(Xi)− Y obsi5: end if6: if Wi == 1 then7: Update the network πθ0 to predict Y obsi − πτ (Xi)8: Update the network πθ1 to predict Y obsi

9: Update the network πτ to predict Y obsi − πθ0(Xi)10: end ifThis process describes training the Y-Learner for one step given a data point (Y obs

i , Xi,Wi)

16

D Explicit Transfer Learning Algorithms for CATE Estimation

D.1 MLRW Transfer for CATE Estimation

Algorithm 5 MLRW Transfer for Cate Estimation.

1: Let µ(i)0 and µ(i)

1 be the outcome under treatment and control for experiment i.2: Let numexps be the number of experiments.3: Let πθ be an N layer neural network parameterized by θ = [θ0, . . . , θN ].4: Let ε = [ε0, . . . , εN ] be a vector, where N is the number of layers in πθ.5: Let outeriters be the total number of training iterations.6: Let inneriters be the number of inner loop training iterations.7: for oiter < outeriters do8: for i < numexps do9: Sample X0 and X1: control and treatment units from experiment i

10: for j = [0, 1] do . j iterating over treatment and control11: Let U0(θ) = θ12: for k < inneriters do13: L = ‖πUk(θ)(Xj)− µj(Xj)‖14: Compute∇θL.15: Use ADAM with∇θL to obtain Uk+1(θ).16: Uk(θ) = Uk+1(θ)17: end for18: for p < N do19: θp = εp · Uk(θp) + (1− εp) · θp.20: end for21: end for22: end for23: end for24: To Evaluate CATE estimate, do25: C = []26: for i < numexps do27: Sample X0 and X1: control and treatment units from experiment i28: Sample X: test units from experiment i.29: for j = [0, 1] do . j iterating over treatment and control30: for k < innteriters do31: L = ‖πUk(θ)(Xj)− µj(Xj)‖32: Compute∇θL.33: Use ADAM with∇θL to obtain Uk+1(θ).34: Uk(θ) = Uk+1(θ)35: end for36: µj = πUk(θ)(X)37: end for38: τi = µ0 − µ1

39: C.append(τi)40: end for41: return C

17

D.2 Joint Training

Algorithm 6 Joint Training

1: Let µ(i)0 and µ(i)

1 be the outcome under control and treatment for experiment i.2: Let numexps be the number of experiments.3: Let πρ be a generic expression for a neural network parameterized by ρ.4: Let θ be base neural network layers shared by all experiments.5: Let φ(i)

0 be neural network layers predicting µ(i)0 in experiment i.

6: Let φ(i)1 be neural network layers predicting µ(i)

1 in experiment i.

7: Let ω(i)0 =

[θ, φ

(i)0

]be the full prediction network for µ0 in experiment i.

8: Let ω(i)1 =

[θ, φ

(i)1

]be the full prediction network for µ1 in experiment i.

9: Let Ω =⋃1j=0

⋃numexpsi=1 ω

(i)j be all trainable parameters.

10: Let numiters be the total number of training iterations11: for iter < numiters do12: L = 013: for i < numexps do14: Sample X0 and X1: control and treatment units from experiment i15: for j = [0, 1] do . j iterating over treatment and control16: L(i)

j = ‖πω

(i)j

(Xj)− µj(Xj)‖

17: L = L+ L(i)j

18: end for19: end for20: Compute ∇ΩL = ∂L

∂Ω =∑i

∑j

∂L(i)j

∂ω(i)j

21: Apply ADAM with gradients given by ∇ΩL.22: for i < numexps do23: Sample X: test units from experiment i24: end for25: end for26: µ0 = π

ω(i)0

(X)

27: µ1 = πω

(i)1

(X)

28: return CATE estimate τ = µ1 − µ0

5/17/2018 joint_training

1/3

Figure 7: Joint Training - Unlike theMulti-head method which differentiatesbase layers for treatment and control,the Joint Training method has all obser-vations and experiments (regardless oftreatment and control) share the samebase network, which extracts general lowlevel features from the data.

18

D.3 T-learner Transfer CATE Estimators

Here, we present full pseudo code for the algorithms from Section 3 using the T-learner as a baselearner. All of these algorithms can be extended to other learners including S,R,X, and Y . See thereleased code for implementations.

Algorithm 7 Vanilla T-learner (also referred to as Baseline T-learner)

1: Let µ0 and µ1 be the outcome under treatment and control.2: Let X be the experimental data. Let Xt be the test data.3: Let πθ0 and πθ1 be a neural networks parameterized by θ0 and θ1.4: Let θ = θ0 ∪ θ1.5: Let numiters be the total number of training iterations.6: Let batchsize be the number of units sampled. We use 64.7: for i < numiters do8: Sample X0 and X1: control and treatment units. Sample batchsize units.9: L0 = ‖πθ(X0)− µ0(X0)‖

10: L1 = ‖πθ(X1)− µ1(X1)‖11: L = L0 + L1

12: Compute ∇θL = ∂L∂θ .

13: Apply ADAM with gradients given by ∇θL.14: end for15: µ0 = πθ0(Xt)16: µ1 = πθ1(Xt)17: return CATE estimate τ = µ1 − µ0

Algorithm 8 Warm Start T-learner

1: Let µi0 and µi1 be the outcome under treatment and control for experiment i.2: Let Xi be the data for experiment i. Let Xi

t be the test data for experiment i.3: Let πθ0 and πθ1 be a neural networks parameterized by θ0 and θ1.4: Let θ = θ0 ∪ θ1.5: Let numiters be the total number of training iterations.6: Let batchsize be the number of units sampled. We use 64.7: for i < numiters do8: Sample X0

0 and X01 : control and treatment units for experiment 0. Sample batchsize units.

9: L0 = ‖πθ0(X00 )− µ0(X0

0 )‖10: L1 = ‖πθ1(X0

1 )− µ1(X01 )‖

11: L = L0 + L1


13: Apply ADAM with gradients given by ∇θL.14: end for15: for i < numiters do16: Sample X1


17: L0 = ‖πθ0(X10 )− µ0(X1

0 )‖18: L1 = ‖πθ1(X1

1 )− µ1(X11 )‖

19: L = L0 + L1


21: Apply ADAM with gradients given by ∇θL.22: end for23: µ0 = πθ0(X1

t )24: µ1 = πθ1(X1

t )25: return CATE estimate τ = µ1 − µ0

19

Algorithm 9 Frozen Features T-learner


t be the test data for experiment i.3: Let πρ be a generic expression for a neural network parameterized by ρ.4: Let θ0

0, θ10, θ

01, θ

11 be neural network parameters. The subscript indicates the outcome that θ is

associated with predicting (0 for control and 1 for treatment) and the superscript indexes theexperiment.

5: Let γ0 be the first k layers of πθ00 . Define γ1 analogously.6: Let θi = θi0 ∪ θi1.7: Let numiters be the total number of training iterations.8: Let batchsize be the number of units sampled. We use 64.9: for i < numiters do

10: Sample X00 and X0

1 : control and treatment units for experiment 0. Sample batchsize units.11: L0 = ‖πθ00 (X0

0 )− µ0(X00 )‖

12: L1 = ‖πθ01 (X01 )− µ1(X0

1 )‖13: L = L0 + L1


15: Apply ADAM with gradients given by ∇θ0L.16: end for17: for i < numiters do18: Sample X1


19: Compute Z10 = πγ(X1

0 ) and Z11 = πγ(X1

1 )20: L0 = ‖πθ10 (Z1

0 )− µ0(Z10 )‖

21: L1 = ‖πθ11 (Z11 )− µ1(Z1

1 )‖22: L = L0 + L1

23: Compute ∇θ1L = ∂L∂θ1 . Do not compute gradients with respect to θ0 parameters.

24: Apply ADAM with gradients given by ∇θ1L.25: end for26: Compute Z1

t = πγ(X1t ).

27: µ0 = πθ10 (Z1t )

28: µ1 = πθ11 (Z1t )

29: return CATE estimate τ = µ1 − µ0

20

Algorithm 10 Multi-Head T-learner


t be the test data for experiment i.3: Let πρ be a generic expression for a neural network parameterized by ρ.4: Let θ0 be base neural network layers shared by all experiments for predicting outcomes under

control.5: Let θ1 be base neural network layers shared by all experiments for predicting outcomes under

treatment.6: Let φ(i)

0 be neural network layers receiving πθ0(xi0) as input and predicting µ(i)0 (xi0) in experiment

i.7: Let φ(i)

1 be neural network layers receiving πθ1(xi1) as input and predicting µ(i)1 (xi1) in experiment

i.8: Let ω(i)

0 =[θ, φ

(i)0

]be all trainable parameters used to predict µi0.

9: Let ω(i)1 =

[θ, φ

(i)1

]be all trainable parameters used to predict µi1.

10: Let Ωi = ω(i)0 ∪ ω

(i)1 .

11: Let numiters be the total number of training iterations.12: Let batchsize be the number of units sampled. We use 64.13: Let numexps be the number of experiments.14: for i < numiters do15: for j < numexps do16: Sample Xj

0 and Xj1 : control and treatment units for experiment j. Sample batchsize

units.17: Compute Zj0 = πθ0(Xj

0) and Zj1 = πθ1(Xj1)

18: Compute µj0 = πφj0(zj0) and µj1 = πφj

1(zj1)

19: L0 = ‖µj0 − µj0(Xj

0)‖20: L1 = ‖µj1 − µ

j1(Xj

1)‖21: L = L0 + L1

22: Compute ∇ΩiL = ∂L∂Ωi .

23: Apply ADAM with gradients given by ∇θL.24: end for25: end for26: Let C = []27: for j < numexps do28: Compute Zj0 = πθ0(Xj

t ) and Zj1 = πθ1(Xjt )

29: Compute µj0 = πφj0(zj0) and µj1 = πφj

1(zj1)

30: Estimate CATE τ = µj1 − µj0.

31: C.append(τ)32: end for33: return C

21

Algorithm 11 SF Reptile T-learner


t be the test data for experiment i.3: Let πθ0 and πθ1 be a neural networks parameterized by θ0 and θ1.4: Let θ = θ0 ∪ θ1.5: Let ε = [ε0, . . . , εN ] be a vector, where N is the number of layers in πθi .6: Let numouteriters be the total number of outer training iterations.7: Let numinneriters be the total number of inner training iterations.8: Let numexps be the number of experiments.9: Let batchsize be the number of units sampled. We use 64.

10: for iouter < numouteriters do11: for i < numexps do12: U0(θ0) = θ0

13: U0(θ1) = θ1.14: for k< numinneriters do15: Sample Xi

0 and Xi1: control and treatment units. Sample batchsize units.

16: L0 = ‖πUk(θ0)(Xi0)− µ0(Xi

0)‖17: L1 = ‖πUk(θ1)(X

i1)− µ1(Xi

1)‖18: L = L0 + L1

19: Compute∇θL = ∂L∂θ .

20: Use ADAM with gradients given by∇θL to obtain Uk+1(θ0) and Uk+1(θ1).21: Set Uk(θ0) = Uk+1(θ0) and Uk(θ1) = Uk+1(θ1)22: end for23: for p < N do24: θp = εp · Uk(θp) + (1− εp) · θp.25: end for26: end for27: end for28: To Evaluate CATE estimate, do29: C = [].30: for i < numexps do31: U0(θ0) = θ0

32: U0(θ1) = θ1.33: for k< numinneriters do34: Sample Xi

0 and Xi1: control and treatment units. Sample batchsize units.

35: L0 = ‖πUk(θ0)(Xi0)− µ0(Xi

0)‖36: L1 = ‖πUk(θ1)(X

i1)− µ1(Xi

1)‖37: L = L0 + L1

38: Compute∇θL = ∂L∂θ .

39: Use ADAM with gradients given by∇θL to obtain Uk+1(θ0) and Uk+1(θ1).40: Set Uk(θ0) = Uk+1(θ0) and Uk(θ1) = Uk+1(θ1)41: end for42: µi0 = πUk(θ0)(X

i0)

43: µi1 = πUk(θ1)(Xi1)

44: τ i = µi1 − µi045: C.append(τ i).46: end for47: return C.

22

Met

hod

LM

-1L

M-2

LM

-3L

M-4

LM

-5L

M-6

LM

-7L

M-8

LM

-9L

M-1

0L

M-1

1L

M-1

2L

M-1

3L

M-1

4t-

lm15

.95

7.82

20.1

46.

6246

.15

17.2

49.

8810

.65

44.7

77.

638.

439

11.2

120

.9s-

rf19

.13

13.1

87.

6213

.27

15.6

615

.58

11.4

416

.311

.34

16.5

312

.57

14.1

113

.49

18.5

6t-

rf20

.04

13.5

67.

7913

.62

1615

.99

11.7

616

.65

11.5

417

.11

13.6

614

.38

13.6

718

.96

R-N

N18

.95

5.19

9.33

28.9

83.

0410

.03

4.88

16.3

47.

9114

.56

19.2

33.

510

.415

.53

S-N

Nfr

ozen

6.74

6.17

2.75

5.76

5.68

6.27

4.23

7.17

4.15

6.77

4.88

5.9

6.63

9.87

mul

tihe

ad3.

33.

052.

343.

833.

64.

894.

568.

475.

875.

225.

436.

775.

155.

58SF

4.85

38.6

57.

5455

.92

11.3

63.

9850

.76

7.74

5.72

7.73

8.64

31.0

630

.81

18.0

8ba

selin

e6.

845.

293.

776.

865.

947.

194.

68.

085.

617.

124.

346.

058.

5910

.75

war

m7.

296.

444.

087.

256.

747.

176.

038.

85.

637.

65.

826.

538.

5512

.02

T-N

Nfr

ozen

6.53

6.73

4.49

6.35

7.23

6.61

5.99

7.59

5.79

6.79

5.99

6.58

7.38

9.39

mul

tihe

ad2.

732.

341.

342.

112.

492.

31.

972.

731.

942.

361.

952.

322.

653.

64SF

20.7

227

.71

8.54

20.1

823

.06

15.3

514

.27

13.0

59.

3727

.94

40.0

316

.320

.81

6.78

base

line

22.7

65.

345.

985.

845.

1610

.37

10.9

7.26

10.2

510

.18

6.26

5.69

6.71

10.3

1w

arm

23.4

67.

25.

756.

415.

2111

.56

12.2

18.

778.

936.

818.

936.

157.

2518

.98

X-N

Nfr

ozen

6.63

14.0

410

.73

19.5

717

.94

1413

.67

18.8

314

.36

10.8

112

.25

16.6

139

.96

32.8

1m

ulti

head

1.19

11.3

884

.13

174.

181.

8719

.55

62.1

222

.79.

670.

85*

3.34

5.03

0.94

*11

1.42

SF18

.83

10.7

210

.310

.61

10.1

112

.521

.37

11.1

28.

2716

.33

9.81

13.8

9.85

10.1

5ba

selin

e19

.52

8.25

4.68

5.06

6.6

11.5

10.7

812

10.2

66.

2513

.11

7.27

9.2

18.4

5w

arm

20.0

68.

545.

576.

3710

.77

9.34

11.3

613

.16

8.03

8.66

10.9

67.

5211

.57

16.7

Y-N

Nfr

ozen

1.54

1*2.

13.

31.

11*

46.0

727

.63

5.47

7.48

7.21

1.02

1.15

*0.

9743

.82

mul

tihe

ad0.

922.

211.

2615

.86

1.19

20.4

71.

762.

753.

969.

419

.11

1.26

35.4

39.

29SF

5.68

1216

.57

385.

543.

498.

24.

82.

617.

218.

4812

.98

9.58

5.08

base

line

0.9*

1.31

5.24

31.4

36.

81.

23*

7.5

1.07

*1.

321.

358.

191.

24.

117.

72w

arm

48.9

11.

265.

793.

6129

.43

2.71

3.94

12.8

721

.76

13.2

515

.78

17.4

51.

1229

.01

join

tjo

int

13.7

55.

813.

464.

123.

4612

.22

9.58

14.9

64.

758.

5513

.13

9.91

11.6

8M

LR

W1

1.41

1.05

*1.

94*

1.97

1.26

1.03

*2.

050.

9*1.

851*

1.15

1.57

2.75

*

Tabl

e1:

MSE

inpe

rcen

tfor

diff

eren

tCA

TE

estim

ator

s.

23

Met

hod

RF-

1R

F-2

RF-

3R

F-4

RF-

5R

F-6

RF-

7R

F-8

RF-

9R

F-10

RF-

11R

F-12

RF-

13R

F-14

t-lm

17.9

526

.73

2.48

4.84

1.76

6.2

7.55

24.5

83.

52.

352.

13.

397.

41s-

rf7.

187

4.18

6.86

8.43

9.15

6.03

9.28

5.95

8.49

6.7

8.1

6.46

8.64

t-rf

10.9

57.

764.

697.

799.

0310

.07

6.67

10.0

86.

310

.35

8.19

9.28

6.73

10.0

4R

-NN

11.4

11.

185.

71.

260.

73.

799.

118.

017.

175.

144.

150.

521.

266.

88S-

NN

froz

en1.

560.

660.

70.

770.

870.

830.

910.

890.

590.

870.

60.

920.

911.

23m

ulti

head

0.69

1.23

0.4

1.59

1.58

0.65

0.81

0.57

0.28

*1.

981.

081.

242.

273.

66SF

1.36

9.64

4.26

4.53

66.

7411

.68

2.8

12.0

141

.645

.72

4.59

1.98

6.75

base

line

0.91

1.28

0.84

0.93

1.78

2.05

1.16

2.04

1.61

1.06

1.29

1.49

1.99

2.69

war

m1.

081.

30.

941.

161.

852.

221.

42.

151.

651.

121.

371.

421.

832.

52T-

NN

froz

en1.

551.

020.

620.

951.

110.

990.

891.

180.

861.

030.

891.

011.

141.

49m

ulti

head

0.66

0.58

0.35

0.53

0.63

0.53

0.47

0.63

0.49

0.57

0.51

0.55

0.64

0.91

SF11

.66

27.5

6.89

22.5

920

.99

12.0

419

.77

7.85

5.39

17.3

117

.15

9.1

30.4

23.

55ba

selin

e7.

834.

251.

441.

471.

351.

332.

058.

551.

793.

212.

293.

622.

038.

7w

arm

12.7

43.

321.

331.

581.

061.

62.

199.

281.

522.

773.

734.

991.

648.

74X

-NN

froz

en2.

5322

.89

55.5

744

.11

4.18

5.72

33.1

82.

627.

594.

964.

412.

013.

011.

23m

ulti

head

3.45

2.53

47.0

939

.72

27.6

210

.39

0.78

11.8

10.8

50.

939.

7211

.74

30.6

2SF

4.34

1.85

5.47

6.82

4.21

11.8

97.

915.

024.

676.

8216

.39

6.99

5.22

2.11

base

line

4.09

2.05

1.67

0.73

0.58

1.16

4.97

9.23

4.05

1.49

5.17

2.15

4.07

8.05

war

m2.

511.

731.

721.

480.

971.

047.

814.

354.

591.

343.

861.

372.

084.

83Y-

NN

froz

en0.

420.

32*

10.6

50.

7236

1.62

0.82

0.87

15.1

91.

610.

7746

.328

.37

1.98

mul

tihe

ad29

.41

9.71

2.74

12.2

748

.87

16.5

950

.21

73.7

41.

651.

763.

5224

.26

0.76

216.

69SF

5.45

2.51

1.53

4.76

9.61

20.8

13.

684.

8410

.82.

872.

781.

369.

074.

08ba

selin

e0.

711.

061.

540.

572.

440.

470.

731.

381.

251.

770.

710.

290.

631.

98w

arm

0.9

27.0

10.

5222

.71.

3724

.94

0.3*

2.08

12.3

63.

399.

582.

742.

4711

.48

join

tjo

int

33.4

71.

30.

615.

150.

472.

682.

670.

37*

7.76

28.1

47.

4810

.04

ML

RW

0.37

*1.

120.

18*

0.3*

0.46

*0.

21*

0.42

0.45

*0.

371.

180.

35*

0.23

*0.

260.

16*

Tabl

e2:

MSE

inpe

rcen

tfor

diff

eren

tCA

TE

estim

ator

s.

24

Met

hod

RFt

-1R

Ft-2

RFt

-3R

Ft-4

RFt

-5R

Ft-6

RFt

-7R

Ft-8

RFt

-9R

Ft-1

0R

Ft-1

1R

Ft-1

2R

Ft-1

3R

Ft-1

4t-

lm22

.65

2.74

2.3

5.07

4.78

8.3

9.77

38.7

323

.94

8.94

83.3

74.

264.

3131

.76

s-rf

3.54

7.04

4.39

6.4

8.09

6.41

11.4

46.

336.

056.

525.

099.

2911

.64

4.28

t-rf

13.3

13.5

58.

4111

.914

.11

15.4

814

.42

12.5

410

.19

18.9

912

.48

14.6

413

.61

12.0

2R

-NN

62.0

817

.68

5.22

19.9

114

.43

20.5

318

.51

49.6

38.

9986

.89

54.1

723

.86

3.02

18.9

8S-

NN

froz

en2.

4951

.39

37.4

749

.543

.17

29.1

921

.81

27.2

561

.34

27.0

217

.25

64.2

224

.94

11.2

4m

ulti

head

1.4

0.57

0.73

1.48

1.54

1.27

1.1

1.63

0.88

1.07

1.53

0.79

0.74

1.59

SF0.

8468

.31

4.12

10.7

717

.28

20.0

610

.22

1.08

8.95

9.11

5.97

19.6

711

.64

8.68

base

line

14.6

62.

080.

951.

382.

247.

33.

171.

080.

885.

291.

442.

372.

851.

78w

arm

17.6

42.

761.

382.

311.

998.

644.

321.

370.

855.

53.

765.

443.

133.

49T-

NN

froz

en2.

571.

691.

121.

551.

811.

61.

521.

861.

431.

631.

471.

681.

852.

27m

ulti

head

0.69

0.55

0.34

0.49

0.63

0.64

0.51

0.64

0.5

0.53

0.46

*0.

55*

0.66

0.8

SF18

.04

9.54

7.33

6.61

22.1

19.

2120

.210

.65

15.0

613

.87

27.6

417

.96

14.0

84.

49ba

selin

e43

.37

14.5

59.

4819

.35

15.6

820

.91

1641

.72

6.99

30.0

245

.59

16.0

34.

7918

.63

war

m69

.41

19.8

112

.99

24.1

220

.28

37.0

719

.64

39.3

79.

3640

.65

37.5

918

.62

6.87

32.5

2X

-NN

froz

en12

.41

221.

4921

5.01

131.

4694

.89

328.

5319

9.48

41.1

439

8.33

83.7

513

243

4.25

125.

8816

8.92

mul

tihe

ad12

.24

0.79

3.22

12.4

62.

8712

.11

21.3

81.

861.

4818

.46

1.82

99.3

814

.67

28.4

3SF

3.98

11.0

34.

335.

315.

424.

6714

.03

5.99

77

6.42

7.21

4.89

2.34

base

line

60.4

415

.03

4.93

7.08

9.51

21.2

214

.716

.83

880

.71

23.8

925

.54.

6717

.1w

arm

30.4

613

.02

5.11

9.95

9.21

13.1

29.

9512

.84.

4850

.45

19.2

712

.92

8.1

15.7

3Y-

NN

froz

en1.

720.

33*

0.87

2.47

1.8

5.44

1.35

19.3

737

.98

0.81

0.7

33.1

11.

3413

.64

mul

tihe

ad5.

9511

.12

0.41

2.94

2.49

24.7

33.

4458

.63

12.9

90.

419.

550.

63.

930.

49SF

3.98

14.5

612

.85

4.3

3.66

9.78

11.1

33.

035.

783.

3711

.87

11.3

34.

512.

64ba

selin

e0.

8810

.92

0.28

1.32

1.89

11.3

58.

421.

65.

660.

540.

542.

010.

612.

13w

arm

8.03

0.48

0.36

0.71

2.53

1.21

0.31

*1.

770.

41*

0.74

23.4

81.

41.

1517

.72

join

tjo

int

102.

8338

.47

133.

8815

.31

37.3

241

0.04

38.2

814

.96

5.02

11.9

733

.71

26.3

311

.99

147.

45M

LR

W0.

31*

1.25

0.21

*0.

27*

0.35

*0.

29*

0.33

0.61

*0.

610.

4*0.

571.

320.

80.

24*

Tabl

e3:

MSE

inpe

rcen

tfor

diff

eren

tCA

TE

estim

ator

s.

25

Documents

Transfer Learning for Estimating Causal Effects using ... · 3 Transfer Learning 3.1 Background The key idea in transfer learning is that new experiments should transfer insights