Estimators

Machine learning-based estimators of economic models

In many instances, economists are interested in using machine learning models for specific purposes that go beyond their ability to predict variables to a good accuracy. For example:

understanding the relationship between covariates and the outcome, usually to demonstrate that a non-trivial effect of one variable on another exists;
identifying which covariates are related or not to a certain outcome, often to demonstrate the relevance of a certain theory;
estimating a certain measure with certain desirable statistical and econometric properties, such as causal inference, where the object of interest is the predicted outcome of an adapted algorithm; and
process non-traditional data (eg, text) for inclusion in a traditional econometrics regression, especially useful in settings where measurable quantitative data is complemented with this other type of data.

The gingado.estimators module contains machine learning algorithms adapted to enable the types of analyses described above. More estimators can be expected over time.

For more academic discussions of machine learning methods in economics covering a broad range of topics, see Athey and Imbens (2019).

Covariate selection

Clustering

The clustering algorithms used below are not themselves adapted from the general use methods. Rather, the functions offer convenience functionalities to find and retain the other variables in the same cluster.

These variables are usually entities (individuals, countries, stocks, etc) in a larger population.

The gingado clustering routines are designed to allow users standalone usage, or a seamless integration as part of a pipeline.

There are three levels of sophistication that users can choose from:

using the off-the-shelf clustering routines provided by gingado, which were selected to be applied cross various use cases;
selecting an existing clustering routine from the scikit-learn.cluster module; or
designing their own clustering algorithm.

source

FindCluster

 FindCluster
              (cluster_alg:[BaseEstimator,ClusterMixin]=AffinityPropagatio
              n(), auto_document:ggdModelDocumentation=<class
              'gingado.model_documentation.ModelCard'>,
              random_state:int|None=None)

Retain only the columns of X that are in the same cluster as y.

	Type	Default	Details
cluster_alg	[BaseEstimator, ClusterMixin]	AffinityPropagation()	An instance of the clustering algorithm to use
auto_document	ggdModelDocumentation	ModelCard	gingado Documenter template to facilitate model documentation
random_state	int \| None	None	The random seed to be used by the algorithm, if relevant

source

fit

 fit (X, y)

Fit FindCluster

	Details
X	The population of entities, organised in columns
y	The entity of interest

source

transform

 transform (X)

Keep only the entities in X that belong to the same cluster as y

	Type	Details
X		The population of entities, organised in columns
Returns	np.array	Columns of `X` that are in the same cluster as `y`

source

fit_transform

 fit_transform (X, y)

Fit a FindCluster object and keep only the entities in X that belong to the same cluster as y

	Type	Details
X		The population of entities, organised in columns
y		The entity of interest
Returns	np.array	Columns of `X` that are in the same cluster as `y`

source

document

 document (documenter:Optional[gingado.model_documentation.ggdModelDocumen
           tation]=None)

Document the FindCluster model using the template in documenter

	Type	Default	Details
documenter	ggdModelDocumentation \| None	None	A gingado Documenter or the documenter set in `auto_document` if None.

Example: finding similar countries

The Barro and Lee (1994) dataset is used to illustrate the use of FindCluster. It is a country-level dataset. Let’s use it to answer the following question: for some specific country, what other countries are the closest to it considering the data available?

First, we import the data:

from gingado.datasets import load_BarroLee_1994

The data is organized by rows: each row is a different country, and the variables are organised in columns.

The dataset is originally organised for a regression of GDP growth (here denoted y) on the covariates (X). This is not what we want to do in this case. So instead of keeping GDP as a separate variable, the next step is to include it in the X DataFrame.

X, y = load_BarroLee_1994()
X['gdp'] = y
X.head()

	Unnamed: 0	gdpsh465	bmp1l	freeop	freetar	h65	hm65	hf65	p65	pm65	...	syr65	syrm65	syrf65	teapri65	teasec65	ex1	im1	xr65	tot1	gdp
0	0	6.591674	0.2837	0.153491	0.043888	0.007	0.013	0.001	0.29	0.37	...	0.033	0.057	0.010	47.6	17.3	0.0729	0.0667	0.348	-0.014727	-0.024336
1	1	6.829794	0.6141	0.313509	0.061827	0.019	0.032	0.007	0.91	1.00	...	0.173	0.274	0.067	57.1	18.0	0.0940	0.1438	0.525	0.005750	0.100473
2	2	8.895082	0.0000	0.204244	0.009186	0.260	0.325	0.201	1.00	1.00	...	2.573	2.478	2.667	26.5	20.7	0.1741	0.1750	1.082	-0.010040	0.067051
3	3	7.565275	0.1997	0.248714	0.036270	0.061	0.070	0.051	1.00	1.00	...	0.438	0.453	0.424	27.8	22.7	0.1265	0.1496	6.625	-0.002195	0.064089
4	4	7.162397	0.1740	0.299252	0.037367	0.017	0.027	0.007	0.82	0.85	...	0.257	0.287	0.229	34.5	17.6	0.1211	0.1308	2.500	0.003283	0.027930

5 rows × 63 columns

Now we remove the first column (an identifier) and transpose the DataFrame, so that countries are organized in columns.

Each country is identified by a number: 0, 1, …

X = X.iloc[:, 1:]
countries = X.T
countries.columns = ['country_' + str(c) for c in countries.columns]
countries.head()

	country_0	country_1	country_2	country_3	country_4	country_5	country_6	country_7	country_8	country_9	...	country_80	country_81	country_82	country_83	country_84	country_85	country_86	country_87	country_88	country_89
gdpsh465	6.591674	6.829794	8.895082	7.565275	7.162397	7.218910	7.853605	7.703910	9.063463	8.151910	...	9.030974	8.995537	8.234830	8.332549	8.645586	8.991064	8.025189	9.030137	8.865312	8.912339
bmp1l	0.283700	0.614100	0.000000	0.199700	0.174000	0.000000	0.000000	0.277600	0.000000	0.148400	...	0.000000	0.000000	0.036300	0.000000	0.000000	0.000000	0.005000	0.000000	0.000000	0.000000
freeop	0.153491	0.313509	0.204244	0.248714	0.299252	0.258865	0.182525	0.215275	0.109614	0.110885	...	0.293138	0.304720	0.288405	0.345485	0.288440	0.371898	0.296437	0.265778	0.282939	0.150366
freetar	0.043888	0.061827	0.009186	0.036270	0.037367	0.020880	0.014385	0.029713	0.002171	0.028579	...	0.005517	0.011658	0.011589	0.006503	0.005995	0.014586	0.013615	0.008629	0.005048	0.024377
h65	0.007000	0.019000	0.260000	0.061000	0.017000	0.023000	0.039000	0.024000	0.402000	0.145000	...	0.245000	0.246000	0.183000	0.188000	0.256000	0.255000	0.108000	0.288000	0.188000	0.257000

5 rows × 90 columns

Suppose we are interested in country No 13. What other countries are similar to it?

First, country No 13 needs to be carved out of the DataFrame with the other countries.

Second, we can now pass the larger DataFrame and country 13’s data separately to an instance of FindCluster.

country_of_interest = countries.pop('country_13')

similar = FindCluster(AffinityPropagation(convergence_iter=5000))
similar

FindCluster(cluster_alg=AffinityPropagation(convergence_iter=5000))

same_cluster = similar.fit_transform(X=countries, y=country_of_interest)

assert same_cluster.equals(similar.fit(X=countries, y=country_of_interest).transform(X=countries))

same_cluster

/Users/douglasaraujo/Coding/.venv_gingado/lib/python3.10/site-packages/sklearn/cluster/_affinity_propagation.py:236: ConvergenceWarning: Affinity propagation did not converge, this model may return degenerate cluster centers and labels.
  warnings.warn(
/Users/douglasaraujo/Coding/.venv_gingado/lib/python3.10/site-packages/sklearn/cluster/_affinity_propagation.py:236: ConvergenceWarning: Affinity propagation did not converge, this model may return degenerate cluster centers and labels.
  warnings.warn(

	country_2	country_9	country_41	country_48	country_49	country_52	country_60	country_64	country_66
gdpsh465	8.895082	8.151910	7.360740	6.469250	5.762051	9.224933	8.346168	7.655864	7.830028
bmp1l	0.000000	0.148400	0.418100	0.538800	0.600500	0.000000	0.319900	0.134500	0.488000
freeop	0.204244	0.110885	0.218471	0.153491	0.151848	0.204244	0.110885	0.164598	0.136287
freetar	0.009186	0.028579	0.027087	0.043888	0.024100	0.009186	0.028579	0.044446	0.046730
h65	0.260000	0.145000	0.032000	0.015000	0.002000	0.393000	0.272000	0.080000	0.146000
...	...	...	...	...	...	...	...	...	...
ex1	0.174100	0.052400	0.190500	0.069200	0.148400	0.255800	0.062500	0.052500	0.076400
im1	0.175000	0.052300	0.225700	0.074800	0.186400	0.241200	0.057800	0.057200	0.086600
xr65	1.082000	2.119000	3.949000	0.348000	7.367000	1.017000	36.603000	30.929000	40.500000
tot1	-0.010040	0.007584	0.205768	0.035226	0.007548	0.018636	0.014286	-0.004592	-0.007018
gdp	0.067051	0.039147	0.016775	-0.048712	0.024477	0.050757	-0.034045	0.046010	-0.011384

62 rows × 9 columns

The default clustering algorithm used by FindCluster is affinity propagation (Frey and Dueck 2007). It is the algorithm of choice because of it combines several desireable characteristics, in particular: - the number of clusters is data-driven instad of set by the user, - the number of entities in each cluster is also chosen by the model, - all entities are part of a cluster, and - each cluster might have a different number of entities.

However, we may want to try different clustering algorithms. Let’s compare the result above with the same analyses using DBSCAN (Ester et al. 1996).

from sklearn.cluster import DBSCAN

similar_dbscan = FindCluster(cluster_alg=DBSCAN())
similar_dbscan

FindCluster(cluster_alg=DBSCAN())

same_cluster_dbscan = similar_dbscan.fit_transform(X=countries, y=country_of_interest)

assert same_cluster_dbscan.equals(similar_dbscan.fit(X=countries, y=country_of_interest).transform(X=countries))

same_cluster_dbscan

	country_0	country_1	country_2	country_3	country_4	country_5	country_6	country_7	country_8	country_9	...	country_80	country_81	country_82	country_83	country_84	country_85	country_86	country_87	country_88	country_89
gdpsh465	6.591674	6.829794	8.895082	7.565275	7.162397	7.218910	7.853605	7.703910	9.063463	8.151910	...	9.030974	8.995537	8.234830	8.332549	8.645586	8.991064	8.025189	9.030137	8.865312	8.912339
bmp1l	0.283700	0.614100	0.000000	0.199700	0.174000	0.000000	0.000000	0.277600	0.000000	0.148400	...	0.000000	0.000000	0.036300	0.000000	0.000000	0.000000	0.005000	0.000000	0.000000	0.000000
freeop	0.153491	0.313509	0.204244	0.248714	0.299252	0.258865	0.182525	0.215275	0.109614	0.110885	...	0.293138	0.304720	0.288405	0.345485	0.288440	0.371898	0.296437	0.265778	0.282939	0.150366
freetar	0.043888	0.061827	0.009186	0.036270	0.037367	0.020880	0.014385	0.029713	0.002171	0.028579	...	0.005517	0.011658	0.011589	0.006503	0.005995	0.014586	0.013615	0.008629	0.005048	0.024377
h65	0.007000	0.019000	0.260000	0.061000	0.017000	0.023000	0.039000	0.024000	0.402000	0.145000	...	0.245000	0.246000	0.183000	0.188000	0.256000	0.255000	0.108000	0.288000	0.188000	0.257000
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
ex1	0.072900	0.094000	0.174100	0.126500	0.121100	0.063400	0.034200	0.086400	0.059400	0.052400	...	0.166200	0.259700	0.104400	0.286600	0.129600	0.440700	0.166900	0.323800	0.184500	0.187600
im1	0.066700	0.143800	0.175000	0.149600	0.130800	0.076200	0.042800	0.093100	0.046000	0.052300	...	0.161700	0.228800	0.179600	0.350000	0.145800	0.425700	0.220100	0.313400	0.194000	0.200700
xr65	0.348000	0.525000	1.082000	6.625000	2.500000	1.000000	12.499000	7.000000	1.000000	2.119000	...	4.286000	2.460000	32.051000	0.452000	652.850000	2.529000	25.553000	4.152000	0.452000	0.886000
tot1	-0.014727	0.005750	-0.010040	-0.002195	0.003283	-0.001747	0.009092	0.011630	0.008169	0.007584	...	-0.006642	-0.003241	-0.034352	-0.001660	-0.046278	-0.011883	-0.039080	0.005175	-0.029551	-0.036482
gdp	-0.024336	0.100473	0.067051	0.064089	0.027930	0.046407	0.067332	0.020978	0.033551	0.039147	...	0.038095	0.034213	0.052759	0.038416	0.031895	0.031196	0.034096	0.046900	0.039773	0.040642

62 rows × 89 columns

As illustrated above, the results can be quite different. In this case, affinity propagation converged to more tightly defined clusters, while DBSCAN selected a cluster that contains almost all other countries (therefore, not useful in this particular case).

Note that model documentation is already jumpstarted when the cluster is fit. A glimpse of the current template, including the questions in the documentation template that have been automatically filled, are shown below.

similar.model_documentation.show_json()

{'model_details': {'developer': 'Person or organisation developing the model',
  'datetime': '2023-09-16 01:43:45 ',
  'version': 'Model version',
  'type': 'Model type',
  'info': {'_estimator_type': 'clusterer',
   'affinity_matrix_': array([[-0.00000000e+00, -5.97375771e+07, -5.35974361e+07, ...,
           -1.92434215e+09, -8.60822083e+07, -3.77976931e+07],
          [-5.97375771e+07, -0.00000000e+00, -2.26471602e+08, ...,
           -2.66217555e+09, -2.43057326e+06, -1.92555486e+08],
          [-5.35974361e+07, -2.26471602e+08, -0.00000000e+00, ...,
           -1.33575671e+09, -2.75395788e+08, -1.37934978e+06],
          ...,
          [-1.92434215e+09, -2.66217555e+09, -1.33575671e+09, ...,
           -0.00000000e+00, -2.82418157e+09, -1.42280304e+09],
          [-8.60822083e+07, -2.43057326e+06, -2.75395788e+08, ...,
           -2.82418157e+09, -0.00000000e+00, -2.37881124e+08],
          [-3.77976931e+07, -1.92555486e+08, -1.37934978e+06, ...,
           -1.42280304e+09, -2.37881124e+08, -0.00000000e+00]]),
   'cluster_centers_': array([[ 6.82979374e+00,  6.14100000e-01,  3.13509000e-01, ...,
            5.25000000e-01,  5.75000000e-03,  1.00472567e-01],
          [ 8.89508153e+00,  0.00000000e+00,  2.04244000e-01, ...,
            1.08200000e+00, -1.00400000e-02,  6.70514822e-02],
          [ 7.56527528e+00,  1.99700000e-01,  2.48714000e-01, ...,
            6.62500000e+00, -2.19500000e-03,  6.40891662e-02],
          ...,
          [ 8.33254894e+00,  0.00000000e+00,  3.45485000e-01, ...,
            4.52000000e-01, -1.66000000e-03,  3.84156381e-02],
          [ 8.86531163e+00,  0.00000000e+00,  2.82939000e-01, ...,
            4.52000000e-01, -2.95510000e-02,  3.97733722e-02],
          [ 8.91233857e+00,  0.00000000e+00,  1.50366000e-01, ...,
            8.86000000e-01, -3.64820000e-02,  4.06415381e-02]]),
   'cluster_centers_indices_': array([ 1,  2,  3,  4,  5,  7,  8, 10, 13, 14, 16, 18, 19, 25, 27, 32, 35,
          39, 42, 45, 46, 49, 50, 52, 53, 55, 57, 58, 60, 62, 67, 68, 69, 71,
          76, 82, 87, 88]),
   'feature_names_in_': array(['gdpsh465', 'bmp1l', 'freeop', 'freetar', 'h65', 'hm65', 'hf65',
          'p65', 'pm65', 'pf65', 's65', 'sm65', 'sf65', 'fert65', 'mort65',
          'lifee065', 'gpop1', 'fert1', 'mort1', 'invsh41', 'geetot1',
          'geerec1', 'gde1', 'govwb1', 'govsh41', 'gvxdxe41', 'high65',
          'highm65', 'highf65', 'highc65', 'highcm65', 'highcf65', 'human65',
          'humanm65', 'humanf65', 'hyr65', 'hyrm65', 'hyrf65', 'no65',
          'nom65', 'nof65', 'pinstab1', 'pop65', 'worker65', 'pop1565',
          'pop6565', 'sec65', 'secm65', 'secf65', 'secc65', 'seccm65',
          'seccf65', 'syr65', 'syrm65', 'syrf65', 'teapri65', 'teasec65',
          'ex1', 'im1', 'xr65', 'tot1', 'gdp'], dtype=object),
   'labels_': array([29,  0,  1,  2,  3,  4, 18,  5,  6,  1,  7, 30, 14,  8,  9, 29, 10,
          29, 11, 12, 12, 18, 29, 36, 18, 13, 18, 14, 29, 36, 36, 14, 15, 36,
          29, 16, 18, 14, 36, 17,  1, 14, 18, 29, 29, 19, 20,  1,  1, 21, 22,
           1, 23, 24, 21, 25, 36, 26, 27,  1, 28, 12, 29,  1, 14,  1, 29, 30,
          31, 32, 12, 33, 18, 29, 30, 18, 34, 14, 18, 36, 36, 29, 35, 36, 29,
          29, 14, 36, 37,  1]),
   'n_features_in_': 62,
   'n_iter_': 200},
  'paper': 'Paper or other resource for more information',
  'citation': 'Citation details',
  'license': 'License',
  'contact': 'Where to send questions or comments about the model'},
 'intended_use': {'primary_uses': 'Primary intended uses',
  'primary_users': 'Primary intended users',
  'out_of_scope': 'Out-of-scope use cases'},
 'factors': {'relevant': 'Relevant factors',
  'evaluation': 'Evaluation factors'},
 'metrics': {'performance_measures': 'Model performance measures',
  'thresholds': 'Decision thresholds',
  'variation_approaches': 'Variation approaches'},
 'evaluation_data': {'datasets': 'Datasets',
  'motivation': 'Motivation',
  'preprocessing': 'Preprocessing'},
 'training_data': {'training_data': 'Information on training data'},
 'quant_analyses': {'unitary': 'Unitary results',
  'intersectional': 'Intersectional results'},
 'ethical_considerations': {'sensitive_data': 'Does the model use any sensitive data (e.g., protected classes)?',
  'human_life': 'Is the model intended to inform decisions about matters central to human life or flourishing - e.g., health or safety? Or could it be used in such a way?',
  'mitigations': 'What risk mitigation strategies were used during model development?',
  'risks_and_harms': 'What risks may be present in model usage? Try to identify the potential recipients,likelihood, and magnitude of harms. If these cannot be determined, note that they were considered but remain unknown',
  'use_cases': 'Are there any known model use cases that are especially fraught?',
  'additional_information': 'If possible, this section should also include any additional ethical considerations that went into model development, for example, review by an external board, or testing with a specific community.'},
 'caveats_recommendations': {'caveats': 'For example, did the results suggest any further testing? Were there any relevant groups that were not represented in the evaluation dataset?',
  'recommendations': 'Are there additional recommendations for model use? What are the ideal characteristics of an evaluation dataset for this model?'}}

FindCluster can also be used as part of a pipeline. In this case, only the entities in the same cluster as the entity of interest will continue on to the next steps of the estimation.

from gingado.benchmark import RegressionBenchmark
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ('cluster', FindCluster(AffinityPropagation(convergence_iter=5000))),
    ('rf', RegressionBenchmark())
])

pipe.fit(X=countries, y=country_of_interest)

/Users/douglasaraujo/Coding/.venv_gingado/lib/python3.10/site-packages/sklearn/cluster/_affinity_propagation.py:236: ConvergenceWarning: Affinity propagation did not converge, this model may return degenerate cluster centers and labels.
  warnings.warn(

Pipeline(steps=[('cluster',
                 FindCluster(cluster_alg=AffinityPropagation(convergence_iter=5000))),
                ('rf',
                 RegressionBenchmark(cv=ShuffleSplit(n_splits=10, random_state=None, test_size=None, train_size=None)))])

Causal inference

Comparative case studies

source

MachineControl

 MachineControl
                 (cluster_alg:[BaseEstimator,ClusterMixin]|None=AffinityPr
                 opagation(),
                 estimator:BaseEstimator=RegressionBenchmark(),
                 manifold:BaseEstimator=TSNE(), with_placebo:bool=True,
                 auto_document:ggdModelDocumentation=<class
                 'gingado.model_documentation.ModelCard'>,
                 random_state:int|None=None)

Synthetic controls with machine learning methods

	Type	Default	Details
cluster_alg	[BaseEstimator, ClusterMixin] \| None	AffinityPropagation()	An instance of the clustering algorithm to use, or None to retain all entities
estimator	BaseEstimator	RegressionBenchmark()	Method to weight the control entities
manifold	BaseEstimator	TSNE()	Algorithm for manifold learning
with_placebo	bool	True	Include placebo estimations during prediction?
auto_document	ggdModelDocumentation	ModelCard	gingado Documenter template to facilitate model documentation
random_state	int \| None	None	The random seed to be used by the algorithm, if relevant

source

fit

 fit (X:pandas.core.frame.DataFrame,
      y:Union[pandas.core.frame.DataFrame,pandas.core.series.Series])

Fit the MachineControl model

	Type	Details
X	pd.DataFrame	A pandas DataFrame with pre-intervention data of shape (n_samples, n_control_entites)
y	pd.DataFrame \| pd.Series	A pandas DataFrame or Series with pre-intervention data of shape (n_samples,)

source

predict

 predict (X:pandas.core.frame.DataFrame,
          y:Union[pandas.core.frame.DataFrame,pandas.core.series.Series])

Calculate the model predictions before and after the intervention

	Type	Details
X	pd.DataFrame	A pandas DataFrame with complete time series (pre- and post-intervention) of shape (n_samples, n_control_entites)
y	pd.DataFrame \| pd.Series	A pandas DataFrame or Series with complete time series of shape (n_samples,)

source

get_controls

 get_controls ()

Get the list of control entities

source

document

 document (documenter:Optional[gingado.model_documentation.ggdModelDocumen
           tation]=None)

Document the MachineControl model using the template in documenter

	Type	Default	Details
documenter	ggdModelDocumentation \| None	None	A gingado Documenter or the documenter set in `auto_document` if None.

Brief econometric description

The goal of MachineControl is to estimate:

\[ \tau_t = Y_{1, t}^{I} - Y_{1, t}^{N}, t > T0 \]

where:

\(\tau\) is the effect on entity \(i=1\) of the intervention of interest
without loss of generality, \(i=1\) is an entity that has undergone the intervention of interest, amongst \(N\) total entities
time period \(T0\) is a date in which the intervention occurred
superscript \(I\) in an outcome variable denotes the occurence of the intervention, whereas superscript \(N\) is absence of intervention
for \(t > T0\), \(Y_{i, t}^{I}\) is observed while \(Y_{i, t}^{N}\) must be estimated because it is a counterfacual.

\(Y_{i, t}^{N}\) is calculated from the values of the other entities, \(i \neq 1\). Collect this data in a vector \(\mathbb{Y}_{-1, t}^{N}\). Then, following Doudchenko and Imbens (2016):

\[ \hat{Y}_{i, t}^{N} = f^*(\mathbb{Y}_{-1, t}^{N}), \]

with the star (\(*\)) superscript on the function \(f(\cdot)\) representing that it was trained only with data up until the intervention date. The exact form of \(f(\cdot)\) depends on the argument estimator. A general use estimator is the random forest (Breiman 2001).

The panel data itself might be the whole population in the data, or a subset when using the whole population might be too cumbersome to run analyses (eg, if the data contains too many entities). One way to select this subsample of control units without including subjective judgment in the data is quantitatilve. The control units are selected through a clustering algorithm (argument cluster_arg). One cluster algorithm that can be used is affinity propagation (Frey and Dueck 2007).

To finalise, the quality of the synthetic control can be assessed in many ways. One fully data-driven way to achieve this is by using manifold learning: lower-dimensional embeddings of a higher-dimensional data. A preferred manifold learning algorithm is t-SNE (Van der Maaten and Hinton 2008).

The relative distance between embeddings and the target centre, as well as the control and the target, represent the chance that a better feasible control (either from real or combined) will materialise. The intuition behind this test is:

let \(d_{i,j}\) be the Euclidean distance between the embeddings (2d points) of entities \(i\) and \(j\)
if only a very small percentage of \(d_{1, j \in (2, ..., N)}\) are lower than \(d_{1, \text{Synthetic control}}\), than the synthetic control produced with \(f(\cdot)\) is indeed a formula that provides one of the best alternative.

Main references:

Abadie and Gardeazabal (2003)
Abadie, Diamond, and Hainmueller (2010)
Abadie, Diamond, and Hainmueller (2015)
Doudchenko and Imbens (2016)
Abadie (2021)

Example: impact of labour reform on productivity

See Machine controls: Synthetic controls with machine learning.

References

Abadie, Alberto. 2021. “Using Synthetic Controls: Feasibility, Data Requirements, and Methodological Aspects.” Journal of Economic Literature 59 (2): 391–425.

Abadie, Alberto, Alexis Diamond, and Jens Hainmueller. 2010. “Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program.” Journal of the American Statistical Association 105 (490): 493–505.

———. 2015. “Comparative Politics and the Synthetic Control Method.” American Journal of Political Science 59 (2): 495–510.

Abadie, Alberto, and Javier Gardeazabal. 2003. “The Economic Costs of Conflict: A Case Study of the Basque Country.” American Economic Review 93 (1): 113–32.

Athey, Susan, and Guido W. Imbens. 2019. “Machine Learning Methods That Economists Should Know About.” Annual Review of Economics 11 (1): 685–725. https://doi.org/10.1146/annurev-economics-080217-053433.

Barro, Robert J., and Jong-Wha Lee. 1994. “Sources of Economic Growth.” Carnegie-Rochester Conference Series on Public Policy 40: 1–46. https://doi.org/10.1016/0167-2231(94)90002-7.

Breiman, Leo. 2001. “Random Forests.” Machine Learning 45 (1): 5–32.

Doudchenko, Nikolay, and Guido W Imbens. 2016. “Balancing, Regression, Difference-in-Differences and Synthetic Control Methods: A Synthesis.” National Bureau of Economic Research.

Ester, Martin, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. 1996. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.” In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, 96:226–31. 34.

Frey, Brendan J, and Delbert Dueck. 2007. “Clustering by Passing Messages Between Data Points.” Science 315 (5814): 972–76.

Van der Maaten, Laurens, and Geoffrey Hinton. 2008. “Visualizing Data Using t-SNE.” Journal of Machine Learning Research 9 (11).