Data augmentation

Functions to augment the user’s dataset with information from official sources.

gingado provides data augmentation functionalities that can help users to augment their datasets with a time series dimension. This can be done both on a stand-alone basis as the user incorporates new data on top of the original dataset, or as part of a scikit-learn Pipeline that also includes other steps like data transformation and model estimation.

Data augmentation with SDMX

The Statistical Data and Metadata eXchange (SDMX) is an ISO standard comprising:

technical standards
statistical guidelines, including cross-domain concepts and codelists
an IT architecture and tools

SDMX is sponsored by the Bank for International Settlements, European Central Bank, Eurostat, International Monetary Fund, Organisation for Economic Co-operation and Development, United Nations, and World Bank Group.

More information about the SDMX is available on its webpage.

gingado uses SDMX to augment user datasets through the transformer AugmentSDMX.

For example, the code below is a simple illustration of AugmentSDMX augmentation under two scenarios: without a variance threshold (ie, including all data regardless if they are constants) or with a relatively high variance threshold (such that no data is actually added).

In both cases, the object is using the default data flow, which is the daily series of monetary policy rates set by central banks.

These AugmentSDMX objects are used to augment a data frame with simulated data for illustrative purposes. In real life, this data would be the user’s original data.

rng = np.random.default_rng(seed=42)

periods = 15
idx = pd.date_range(freq='d', start='2020-01-01', periods=periods)
orig_data = pd.DataFrame({'orig_col': rng.standard_normal(periods)}, index=idx)
orig_data.head()

	orig_col
2020-01-01	0.304717
2020-01-02	-1.039984
2020-01-03	0.750451
2020-01-04	0.940565
2020-01-05	-1.951035

aug_NoVarThresh = AugmentSDMX(variance_threshold=None)
aug_data = aug_NoVarThresh.fit_transform(orig_data)
aug_data

Querying data from BIS's dataflow 'WS_CBPOL_D' - Policy rates daily...

	orig_col	BIS__WS_CBPOL_D_D__AR	BIS__WS_CBPOL_D_D__AU	BIS__WS_CBPOL_D_D__BR	BIS__WS_CBPOL_D_D__CA	BIS__WS_CBPOL_D_D__CH	BIS__WS_CBPOL_D_D__CL	BIS__WS_CBPOL_D_D__CN	BIS__WS_CBPOL_D_D__CO	BIS__WS_CBPOL_D_D__CZ	...	BIS__WS_CBPOL_D_D__RO	BIS__WS_CBPOL_D_D__RS	BIS__WS_CBPOL_D_D__RU	BIS__WS_CBPOL_D_D__SA	BIS__WS_CBPOL_D_D__SE	BIS__WS_CBPOL_D_D__TH	BIS__WS_CBPOL_D_D__TR	BIS__WS_CBPOL_D_D__US	BIS__WS_CBPOL_D_D__ZA
2020-01-01	0.304717	55.0	NaN	NaN	1.75	-0.75	NaN	4.15	4.25	NaN	...	NaN	2.25	NaN	2.25	NaN	NaN	NaN	1.625	NaN
2020-01-02	-1.039984	55.0	0.75	4.5	1.75	-0.75	1.75	4.15	4.25	2.0	...	NaN	2.25	NaN	2.25	-0.25	1.25	12.0	1.625	6.5
2020-01-03	0.750451	55.0	0.75	4.5	1.75	-0.75	1.75	4.15	4.25	2.0	...	2.5	2.25	NaN	2.25	-0.25	1.25	12.0	1.625	6.5
2020-01-04	0.940565	55.0	0.75	4.5	1.75	-0.75	1.75	4.15	4.25	2.0	...	2.5	2.25	NaN	2.25	-0.25	1.25	12.0	1.625	6.5
2020-01-05	-1.951035	55.0	0.75	4.5	1.75	-0.75	1.75	4.15	4.25	2.0	...	2.5	2.25	NaN	2.25	-0.25	1.25	12.0	1.625	6.5
2020-01-06	-1.302180	55.0	0.75	4.5	1.75	-0.75	1.75	4.15	4.25	2.0	...	2.5	2.25	6.25	2.25	-0.25	1.25	12.0	1.625	6.5
2020-01-07	0.127840	55.0	0.75	4.5	1.75	-0.75	1.75	4.15	4.25	2.0	...	2.5	2.25	6.25	2.25	-0.25	1.25	12.0	1.625	6.5
2020-01-08	-0.316243	55.0	0.75	4.5	1.75	-0.75	1.75	4.15	4.25	2.0	...	2.5	2.25	6.25	2.25	0.00	1.25	12.0	1.625	6.5
2020-01-09	-0.016801	55.0	0.75	4.5	1.75	-0.75	1.75	4.15	4.25	2.0	...	2.5	2.25	6.25	2.25	0.00	1.25	12.0	1.625	6.5
2020-01-10	-0.853044	52.0	0.75	4.5	1.75	-0.75	1.75	4.15	4.25	2.0	...	2.5	2.25	6.25	2.25	0.00	1.25	12.0	1.625	6.5
2020-01-11	0.879398	52.0	0.75	4.5	1.75	-0.75	1.75	4.15	4.25	2.0	...	2.5	2.25	6.25	2.25	0.00	1.25	12.0	1.625	6.5
2020-01-12	0.777792	52.0	0.75	4.5	1.75	-0.75	1.75	4.15	4.25	2.0	...	2.5	2.25	6.25	2.25	0.00	1.25	12.0	1.625	6.5
2020-01-13	0.066031	52.0	0.75	4.5	1.75	-0.75	1.75	4.15	4.25	2.0	...	2.5	2.25	6.25	2.25	0.00	1.25	12.0	1.625	6.5
2020-01-14	1.127241	52.0	0.75	4.5	1.75	-0.75	1.75	4.15	4.25	2.0	...	2.5	2.25	6.25	2.25	0.00	1.25	12.0	1.625	6.5
2020-01-15	0.467509	52.0	0.75	4.5	1.75	-0.75	1.75	4.15	4.25	2.0	...	2.5	2.25	6.25	2.25	0.00	1.25	12.0	1.625	6.5

15 rows × 39 columns

aug_StrictVarThresh = AugmentSDMX(variance_threshold=10)
aug_data = aug_StrictVarThresh.fit_transform(orig_data)
aug_data

Querying data from BIS's dataflow 'WS_CBPOL_D' - Policy rates daily...
No columns added to original data because no feature in x meets the variance threshold 10.00000

/Users/douglasaraujo/Coding/.venv_gingado/lib/python3.10/site-packages/sklearn/feature_selection/_variance_threshold.py:104: RuntimeWarning: Degrees of freedom <= 0 for slice.
  self.variances_ = np.nanvar(X, axis=0)

	orig_col
2020-01-01	0.304717
2020-01-02	-1.039984
2020-01-03	0.750451
2020-01-04	0.940565
2020-01-05	-1.951035
2020-01-06	-1.302180
2020-01-07	0.127840
2020-01-08	-0.316243
2020-01-09	-0.016801
2020-01-10	-0.853044
2020-01-11	0.879398
2020-01-12	0.777792
2020-01-13	0.066031
2020-01-14	1.127241
2020-01-15	0.467509

source

AugmentSDMX

 AugmentSDMX (sources:dict={'BIS': 'WS_CBPOL_D'},
              variance_threshold:float|None=None,
              propagate_last_known_value:bool=True, fillna:float|int=0,
              verbose:bool=True)

A transformer that augments a dataset using SDMX

	Type	Default	Details
sources	dict	{‘BIS’: ‘WS_CBPOL_D’}	A dictionary with sources as keys and dataflows as values
variance_threshold	float \| None	None	If None (default), all variables are kept. Otherwise, variables that have a lower variance through time are removed
propagate_last_known_value	bool	True	Whether the last value that is not NA should be propagated to the following dates
fillna	float \| int	0	Value to use to fill missing data
verbose	bool	True	Whether to inform the user as the process progresses

source

fit

 fit (X:Union[pandas.core.series.Series,pandas.core.frame.DataFrame],
      y:NoneType=None)

Fits instance of AugmentSDMX to X, learning its time series frequency

	Type	Default	Details
X	pd.Series \| pd.DataFrame		Data having an index of `datetime` type
y	None	None	`y` is kept as argument for API consistency only

source

transform

 transform
            (X:Union[pandas.core.series.Series,pandas.core.frame.DataFrame
            ], y:NoneType=None, training:bool=False)

Transforms input dataset X by adding the requested data using SDMX

	Type	Default	Details
X	pd.Series \| pd.DataFrame		Data having an index of `datetime` type
y	None	None	`y` is kept as argument for API consistency only
training	bool	False	`True` if `transform` is called during training, `False` (default) if called during testing
Returns	np.ndarray		`X` augmented with data from SDMX with the same number of samples but more columns

source

fit_transform

 fit_transform
                (X:Union[pandas.core.series.Series,pandas.core.frame.DataF
                rame], y:NoneType=None)

Fit to data, then transform it.

	Type	Default	Details
X	pd.Series \| pd.DataFrame		Data having an index of `datetime` type
y	None	None	`y` is kept as argument for API consistency only
Returns	np.ndarray		`X` augmented with data from SDMX with the same number of samples but more columns

Compatibility with `scikit-learn`

As mentioned above, gingado’s transformers are built to be compatible with scikit-learn. The code below demonstrates this compatibility.

First, we create the example dataset. In this case, it comprises the daily foreign exchange rate of selected currencies to the Euro. The Brazilian Real (BRL) is chosen for this example as the dependent variable.

from gingado.utils import load_SDMX_data, Lag
from sklearn.model_selection import TimeSeriesSplit

X = load_SDMX_data(
    sources={'ECB': 'EXR'}, 
    keys={'FREQ': 'D', 'CURRENCY': ['EUR', 'AUD', 'BRL', 'CAD', 'CHF', 'GBP', 'JPY', 'SGD', 'USD']},
    params={"startPeriod": 2003}
    )
# drop rows with empty values
X.dropna(inplace=True)
# adjust column names in this simple example for ease of understanding:
# remove parts related to source and dataflow names
X.columns = X.columns.str.replace("ECB__EXR_D__", "").str.replace("__EUR__SP00__A", "")
X = Lag(lags=1, jump=0, keep_contemporaneous_X=True).fit_transform(X)
y = X.pop('BRL')
# retain only the lagged variables in the X variable
X = X[X.columns[X.columns.str.contains('_lag_')]]

Querying data from ECB's dataflow 'EXR' - Exchange Rates...

2023-06-18 11:29:19,756 pandasdmx.reader.sdmxml - INFO: Use supplied dsd=… argument for non–structure-specific message

X_train, X_test = X.iloc[:-1], X.tail(1)
y_train, y_test = y.iloc[:-1], y.tail(1)

X_train.shape, y_train.shape, X_test.shape, y_test.shape

((5239, 8), (5239,), (1, 8), (1,))

Next, the data augmentation object provided by gingado adds more data. In this case, for brevity only one dataflow from one source is listed. If users want to add more SDMX sources, simply add more keys to the dictionary. And if users want data from all dataflows from a given source provided the keys and parameters such as frequency and dates match, the value should be set to 'all', as in {'ECB': ['CISS'], 'BIS': 'all'}.

test_src = {'ECB': ['CISS'], 'BIS': ['WS_CBPOL_D']}

X_train__fit_transform = AugmentSDMX(sources=test_src).fit_transform(X=X_train)
X_train__fit_then_transform = AugmentSDMX(sources=test_src).fit(X=X_train).transform(X=X_train, training=True)

assert X_train__fit_transform.shape == X_train__fit_then_transform.shape

Querying data from ECB's dataflow 'CISS' - Composite Indicator of Systemic Stress...
Querying data from BIS's dataflow 'WS_CBPOL_D' - Policy rates daily...
Querying data from ECB's dataflow 'CISS' - Composite Indicator of Systemic Stress...
Querying data from BIS's dataflow 'WS_CBPOL_D' - Policy rates daily...

2023-06-18 11:31:53,806 pandasdmx.reader.sdmxml - INFO: Use supplied dsd=… argument for non–structure-specific message
2023-06-18 11:33:30,655 pandasdmx.reader.sdmxml - INFO: Use supplied dsd=… argument for non–structure-specific message

This is the dataset now after this particular augmentation:

print(f"No of columns: {len(X_train__fit_transform.columns)} {X_train__fit_transform.columns}")
X_train__fit_transform

No of columns: 69 Index(['AUD_lag_1', 'BRL_lag_1', 'CAD_lag_1', 'CHF_lag_1', 'GBP_lag_1',
       'JPY_lag_1', 'SGD_lag_1', 'USD_lag_1',
       'ECB__CISS_D__AT__Z0Z__4F__EC__SS_CIN__IDX',
       'ECB__CISS_D__BE__Z0Z__4F__EC__SS_CIN__IDX',
       'ECB__CISS_D__CN__Z0Z__4F__EC__SS_CIN__IDX',
       'ECB__CISS_D__DE__Z0Z__4F__EC__SS_CIN__IDX',
       'ECB__CISS_D__ES__Z0Z__4F__EC__SS_CIN__IDX',
       'ECB__CISS_D__FI__Z0Z__4F__EC__SS_CIN__IDX',
       'ECB__CISS_D__FR__Z0Z__4F__EC__SS_CIN__IDX',
       'ECB__CISS_D__GB__Z0Z__4F__EC__SS_CIN__IDX',
       'ECB__CISS_D__IE__Z0Z__4F__EC__SS_CIN__IDX',
       'ECB__CISS_D__IT__Z0Z__4F__EC__SS_CIN__IDX',
       'ECB__CISS_D__NL__Z0Z__4F__EC__SS_CIN__IDX',
       'ECB__CISS_D__PT__Z0Z__4F__EC__SS_CIN__IDX',
       'ECB__CISS_D__U2__Z0Z__4F__EC__SS_BM__CON',
       'ECB__CISS_D__U2__Z0Z__4F__EC__SS_CI__IDX',
       'ECB__CISS_D__U2__Z0Z__4F__EC__SS_CIN__IDX',
       'ECB__CISS_D__U2__Z0Z__4F__EC__SS_CO__CON',
       'ECB__CISS_D__U2__Z0Z__4F__EC__SS_EM__CON',
       'ECB__CISS_D__U2__Z0Z__4F__EC__SS_FI__CON',
       'ECB__CISS_D__U2__Z0Z__4F__EC__SS_FX__CON',
       'ECB__CISS_D__U2__Z0Z__4F__EC__SS_MM__CON',
       'ECB__CISS_D__US__Z0Z__4F__EC__SS_CI__IDX',
       'ECB__CISS_D__US__Z0Z__4F__EC__SS_CIN__IDX', 'BIS__WS_CBPOL_D_D__AR',
       'BIS__WS_CBPOL_D_D__AU', 'BIS__WS_CBPOL_D_D__BR',
       'BIS__WS_CBPOL_D_D__CA', 'BIS__WS_CBPOL_D_D__CH',
       'BIS__WS_CBPOL_D_D__CL', 'BIS__WS_CBPOL_D_D__CN',
       'BIS__WS_CBPOL_D_D__CO', 'BIS__WS_CBPOL_D_D__CZ',
       'BIS__WS_CBPOL_D_D__DK', 'BIS__WS_CBPOL_D_D__GB',
       'BIS__WS_CBPOL_D_D__HK', 'BIS__WS_CBPOL_D_D__HR',
       'BIS__WS_CBPOL_D_D__HU', 'BIS__WS_CBPOL_D_D__ID',
       'BIS__WS_CBPOL_D_D__IL', 'BIS__WS_CBPOL_D_D__IN',
       'BIS__WS_CBPOL_D_D__IS', 'BIS__WS_CBPOL_D_D__JP',
       'BIS__WS_CBPOL_D_D__KR', 'BIS__WS_CBPOL_D_D__MA',
       'BIS__WS_CBPOL_D_D__MK', 'BIS__WS_CBPOL_D_D__MX',
       'BIS__WS_CBPOL_D_D__MY', 'BIS__WS_CBPOL_D_D__NO',
       'BIS__WS_CBPOL_D_D__NZ', 'BIS__WS_CBPOL_D_D__PE',
       'BIS__WS_CBPOL_D_D__PH', 'BIS__WS_CBPOL_D_D__PL',
       'BIS__WS_CBPOL_D_D__RO', 'BIS__WS_CBPOL_D_D__RS',
       'BIS__WS_CBPOL_D_D__RU', 'BIS__WS_CBPOL_D_D__SA',
       'BIS__WS_CBPOL_D_D__SE', 'BIS__WS_CBPOL_D_D__TH',
       'BIS__WS_CBPOL_D_D__TR', 'BIS__WS_CBPOL_D_D__US',
       'BIS__WS_CBPOL_D_D__XM', 'BIS__WS_CBPOL_D_D__ZA'],
      dtype='object')

	AUD_lag_1	BRL_lag_1	CAD_lag_1	CHF_lag_1	GBP_lag_1	JPY_lag_1	SGD_lag_1	USD_lag_1	ECB__CISS_D__AT__Z0Z__4F__EC__SS_CIN__IDX	ECB__CISS_D__BE__Z0Z__4F__EC__SS_CIN__IDX	...	BIS__WS_CBPOL_D_D__RO	BIS__WS_CBPOL_D_D__RS	BIS__WS_CBPOL_D_D__RU	BIS__WS_CBPOL_D_D__SA	BIS__WS_CBPOL_D_D__SE	BIS__WS_CBPOL_D_D__TH	BIS__WS_CBPOL_D_D__TR	BIS__WS_CBPOL_D_D__US	BIS__WS_CBPOL_D_D__XM	BIS__WS_CBPOL_D_D__ZA
TIME_PERIOD
2003-01-03	1.8554	3.6770	1.6422	1.4528	0.65200	124.40	1.8188	1.0446	0.021899	0.043292	...	NaN	9.5	NaN	NaN	3.75	1.75	44.0	1.250	2.75	13.50
2003-01-06	1.8440	3.6112	1.6264	1.4555	0.65000	124.56	1.8132	1.0392	0.020801	0.039924	...	19.75	9.5	NaN	2.00	3.75	1.75	44.0	1.250	2.75	13.50
2003-01-07	1.8281	3.5145	1.6383	1.4563	0.64950	124.40	1.8210	1.0488	0.019738	0.038084	...	19.75	9.5	NaN	2.00	3.75	1.75	44.0	1.250	2.75	13.50
2003-01-08	1.8160	3.5139	1.6257	1.4565	0.64960	124.82	1.8155	1.0425	0.019947	0.040338	...	19.75	9.5	21.0	2.00	3.75	1.75	44.0	1.250	2.75	13.50
2003-01-09	1.8132	3.4405	1.6231	1.4586	0.64950	124.90	1.8102	1.0377	0.017026	0.040535	...	19.75	9.5	21.0	2.00	3.75	1.75	44.0	1.250	2.75	13.50
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2023-06-09	1.6061	5.2866	1.4342	0.9751	0.86113	149.98	1.4460	1.0737	0.167185	0.124553	...	7.00	6.0	7.5	5.75	3.50	2.00	8.5	5.125	3.75	8.25
2023-06-12	1.6023	5.2965	1.4362	0.9716	0.85795	150.24	1.4480	1.0780	0.171139	0.123640	...	7.00	6.0	7.5	5.75	3.50	2.00	8.5	5.125	3.75	8.25
2023-06-13	1.5920	5.2549	1.4357	0.9751	0.85678	150.03	1.4457	1.0765	0.164665	0.118228	...	7.00	6.0	7.5	5.75	3.50	2.00	8.5	5.125	3.75	8.25
2023-06-14	1.5922	5.2469	1.4403	0.9784	0.85850	150.62	1.4467	1.0793	0.151799	0.111484	...	7.00	6.0	7.5	5.75	3.50	2.00	8.5	5.125	3.75	8.25
2023-06-15	1.5915	5.2489	1.4378	0.9751	0.85455	151.21	1.4499	1.0809	0.137266	0.094035	...	7.00	6.0	7.5	5.75	3.50	2.00	8.5	5.125	3.75	8.25

5239 rows × 69 columns

Pipeline

AugmentSDMX can also be part of a Pipeline object, which minimises operational errors during modelling and avoids using testing data during training:

from sklearn.pipeline import Pipeline
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

pipeline = Pipeline([
    ('augmentation', AugmentSDMX(sources={'BIS': 'WS_CBPOL_D'})),
    ('imp', IterativeImputer(max_iter=10)),
    ('forest', RandomForestRegressor())
], verbose=True)

Tuning the data augmentation to enhance model performance

And since AugmentSDMX can be included in a Pipeline, it can also be fine-tuned by parameter search techniques (such as grid search), further helping users make the best of available data to enhance performance of their models.

Tip

Users can cache the data augmentation step to avoid repeating potentially lengthy data downloads. See the memory argument in the sklearn.pipeline.Pipeline documentation.

grid = GridSearchCV(
    estimator=pipeline,
    param_grid={
        'augmentation': ['passthrough', AugmentSDMX(sources={'ECB': 'CISS'})]
    },
    verbose=2,
    cv=TimeSeriesSplit(n_splits=2)
    )

y_pred_grid = grid.fit(X_train, y_train).predict(X_test)

Fitting 2 folds for each of 2 candidates, totalling 4 fits
[Pipeline] ...... (step 1 of 3) Processing augmentation, total=   0.0s
[Pipeline] ............... (step 2 of 3) Processing imp, total=   0.0s
[Pipeline] ............ (step 3 of 3) Processing forest, total=   0.7s
[CV] END ...........................augmentation=passthrough; total time=   0.7s
[Pipeline] ...... (step 1 of 3) Processing augmentation, total=   0.0s
[Pipeline] ............... (step 2 of 3) Processing imp, total=   0.0s
[Pipeline] ............ (step 3 of 3) Processing forest, total=   1.3s
[CV] END ...........................augmentation=passthrough; total time=   1.3s
Querying data from ECB's dataflow 'CISS' - Composite Indicator of Systemic Stress...
[Pipeline] ...... (step 1 of 3) Processing augmentation, total=   9.7s
[Pipeline] ............... (step 2 of 3) Processing imp, total=   0.2s
[Pipeline] ............ (step 3 of 3) Processing forest, total=   1.8s
Querying data from ECB's dataflow 'CISS' - Composite Indicator of Systemic Stress...
[CV] END ..augmentation=AugmentSDMX(sources={'ECB': 'CISS'}); total time=  21.1s
Querying data from ECB's dataflow 'CISS' - Composite Indicator of Systemic Stress...
[Pipeline] ...... (step 1 of 3) Processing augmentation, total=  18.3s
[Pipeline] ............... (step 2 of 3) Processing imp, total=   0.4s
[Pipeline] ............ (step 3 of 3) Processing forest, total=   4.4s
Querying data from ECB's dataflow 'CISS' - Composite Indicator of Systemic Stress...
[CV] END ..augmentation=AugmentSDMX(sources={'ECB': 'CISS'}); total time=  41.7s
[Pipeline] ...... (step 1 of 3) Processing augmentation, total=   0.0s
[Pipeline] ............... (step 2 of 3) Processing imp, total=   0.0s
[Pipeline] ............ (step 3 of 3) Processing forest, total=   2.0s

2023-06-18 11:37:46,240 pandasdmx.reader.sdmxml - INFO: Use supplied dsd=… argument for non–structure-specific message
2023-06-18 11:37:58,171 pandasdmx.reader.sdmxml - INFO: Use supplied dsd=… argument for non–structure-specific message
2023-06-18 11:38:07,471 pandasdmx.reader.sdmxml - INFO: Use supplied dsd=… argument for non–structure-specific message
2023-06-18 11:38:30,335 pandasdmx.reader.sdmxml - INFO: Use supplied dsd=… argument for non–structure-specific message

grid.best_params_

{'augmentation': 'passthrough'}

print(f"In this particular case, the best model was achieved by {'not ' if grid.best_params_['augmentation'] == 'passthrough' else ''}using the data augmentation.")

In this particular case, the best model was achieved by not using the data augmentation.

print(f"The last value in the training dataset was {y_train.tail(1).to_numpy()}. The predicted value was {y_pred_grid}, and the actual value was {y_test.to_numpy()}.")

The last value in the training dataset was [5.2244]. The predicted value was [5.236705], and the actual value was [5.279].

Sources of data

gingado seeks to only lists realiable data sources by choice, with a focus on official sources. This is meant to provide users with the trust that their dataset will be complemented by reliable sources. Unfortunately, it is not possible at this stage to include all official sources given the substantial manual and maintenance work. gingado leverages the existence of the Statistical Data and Metadata eXchange (SDMX), an organisation of official data sources that establishes common data and metadata formats, to download data that is relevant (and hopefully also useful) to users.

The function list_SDMX_sources returns a list of codes corresponding to the data sources available to provide gingado users with data through SDMX.

from gingado.utils import list_SDMX_sources

list_SDMX_sources()

['ABS',
 'ABS_XML',
 'BBK',
 'BIS',
 'CD2030',
 'ECB',
 'ESTAT',
 'ILO',
 'IMF',
 'INEGI',
 'INSEE',
 'ISTAT',
 'LSD',
 'NB',
 'NBB',
 'OECD',
 'SGR',
 'SPC',
 'STAT_EE',
 'UNICEF',
 'UNSD',
 'WB',
 'WB_WDI']

You can also see what the available dataflows are. The code below returns a dictionary where each key is the code for an SDMX source, and the values associated with each key are the code and name for the respective dataflows.

from gingado.utils import list_all_dataflows

dflows = list_all_dataflows()
dflows

2023-06-18 11:43:54,715 pandasdmx.reader.sdmxml - DEBUG: Truncate sub-microsecond time in <Prepared>
2023-06-18 11:43:58,219 pandasdmx.reader.sdmxml - DEBUG: Truncate sub-microsecond time in <Prepared>
2023-06-18 11:44:19,840 pandasdmx.reader.sdmxml - DEBUG: Truncate sub-microsecond time in <Prepared>
2023-06-18 11:44:20,874 pandasdmx.reader.sdmxml - DEBUG: Truncate sub-microsecond time in <Prepared>
2023-06-18 11:44:25,735 pandasdmx.reader.sdmxml - DEBUG: Truncate sub-microsecond time in <Prepared>
2023-06-18 11:44:26,799 pandasdmx.reader.sdmxml - DEBUG: Truncate sub-microsecond time in <Prepared>

ABS_XML  ABORIGINAL_POP_PROJ                 Projected population, Aboriginal and Torres St...
         ABORIGINAL_POP_PROJ_REMOTE          Projected population, Aboriginal and Torres St...
         ABS_ABORIGINAL_POPPROJ_INDREGION    Projected population, Aboriginal and Torres St...
         ABS_ACLD_LFSTATUS                   Australian Census Longitudinal Dataset (ACLD):...
         ABS_ACLD_TENURE                     Australian Census Longitudinal Dataset (ACLD):...
                                                                   ...                        
UNSD     DF_UNData_UNFCC                                                       SDMX_GHG_UNDATA
WB       DF_WITS_Tariff_TRAINS                                WITS - UNCTAD TRAINS Tariff Data
         DF_WITS_TradeStats_Development                             WITS TradeStats Devlopment
         DF_WITS_TradeStats_Tariff                                      WITS TradeStats Tariff
         DF_WITS_TradeStats_Trade                                        WITS TradeStats Trade
Name: dataflow, Length: 3354, dtype: object

For example, the dataflows from the World Bank are:

dflows['WB']

DF_WITS_Tariff_TRAINS             WITS - UNCTAD TRAINS Tariff Data
DF_WITS_TradeStats_Development          WITS TradeStats Devlopment
DF_WITS_TradeStats_Tariff                   WITS TradeStats Tariff
DF_WITS_TradeStats_Trade                     WITS TradeStats Trade
Name: dataflow, dtype: object

Data augmentation with SDMX

AugmentSDMX

fit

transform

fit_transform

Compatibility with scikit-learn

Pipeline

Tuning the data augmentation to enhance model performance

Sources of data

Compatibility with `scikit-learn`