Utils

Functions to support the use of gingado

Support for model documentation


source

get_datetime

 get_datetime ()

Returns the time now

d = get_datetime()
assert isinstance(d, str)
assert len(d) > 0

source

read_attr

 read_attr (obj)

Read object type and values of attributes from fitted object

Details
obj Object from which to attributes will be read

Function read_attr helps gingado Documenters to read the object behind the scenes.

It collects the type of estimator, and any attributes resulting from fitting an object (in ie, those that end in “_” without being double underscores).

For example, the attributes of an untrained and a trained random forest are, in sequence:

from sklearn.ensemble import RandomForestRegressor
rf_unfit = RandomForestRegressor(n_estimators=3)
rf_fit = RandomForestRegressor(n_estimators=3)\
    .fit([[1, 0], [0, 1]], [[0.5], [0.5]]) # random numbers
list(read_attr(rf_unfit)), list(read_attr(rf_fit))
/var/folders/b9/p8z57lqd55xfk68xz34dg0s40000gn/T/ipykernel_45335/3975710638.py:3: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  .fit([[1, 0], [0, 1]], [[0.5], [0.5]]) # random numbers
/Users/douglasaraujo/Coding/.venv_gingado/lib/python3.10/site-packages/sklearn/utils/deprecation.py:103: FutureWarning: Attribute `n_features_` was deprecated in version 1.0 and will be removed in 1.2. Use `n_features_in_` instead.
  warnings.warn(msg, category=FutureWarning)
([{'_estimator_type': 'regressor'}],
 [{'_estimator_type': 'regressor'},
  {'base_estimator_': DecisionTreeRegressor()},
  {'estimators_': [DecisionTreeRegressor(max_features=1.0, random_state=1632148864),
    DecisionTreeRegressor(max_features=1.0, random_state=1616501356),
    DecisionTreeRegressor(max_features=1.0, random_state=2109419996)]},
  {'feature_importances_': array([0., 0.])},
  {'n_features_': 2},
  {'n_features_in_': 2},
  {'n_outputs_': 1}])

Support for time series

Objects of the class Lag are similar to scikit-learn’s transformers.


source

Lag

 Lag (lags=1, jump=0, keep_contemporaneous_X=False)

A transformer that lags variables


source

Lag.fit

 Lag.fit (X:numpy.ndarray, y=None)

Fit the Lag transformer

Type Default Details
X ndarray Array-like data of shape (n_samples, n_features)
y NoneType None Array-like data of shape (n_samples,) or (n_samples, n_targets) or None

source

Lag.transform

 Lag.transform (X:numpy.ndarray)

Lag the dataset X

Type Details
X ndarray Array-like data of shape (n_samples, n_features)

TransformerMixin.fit_transform

 TransformerMixin.fit_transform (X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Type Default Details
X array-like of shape (n_samples, n_features) Input samples.
y NoneType None Target values (None for unsupervised transformations).
fit_params
Returns ndarray array of shape (n_samples, n_features_new) Transformed array.

The code below demonstrates how Lag works in practice. Note in particular that, because Lag is a transformer, it can be used as part of a scikit-learn’s Pipeline.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
randomX = np.random.rand(15, 2)
randomY = np.random.rand(15)

lags = 3
jump = 2

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('lagger', Lag(lags=lags, jump=jump, keep_contemporaneous_X=False))
]).fit_transform(randomX, randomY)

Below we confirm that the lagger removes the correct number of rows corresponding to the lagged observations:

assert randomX.shape[0] - lags - jump == pipe.shape[0]

And because Lag is a transformer, its parameters (lags and jump) can be calibrated using hyperparameter tuning to achieve the best performance for a model.

Support for data augmentation with SDMX

Note

please note that working with SDMX may take some minutes depending on the amount of information you are downloading.


source

list_SDMX_sources

 list_SDMX_sources ()

Fetch the list of SDMX sources

sources = list_SDMX_sources()
print(sources)

assert len(sources) > 0
# all elements are of type 'str'
assert sum([isinstance(src, str) for src in sources]) == len(sources)
['ABS', 'ABS_XML', 'BBK', 'BIS', 'CD2030', 'ECB', 'ESTAT', 'ILO', 'IMF', 'INEGI', 'INSEE', 'ISTAT', 'LSD', 'NB', 'NBB', 'OECD', 'SGR', 'SPC', 'STAT_EE', 'UNICEF', 'UNSD', 'WB', 'WB_WDI']

source

list_all_dataflows

 list_all_dataflows (codes_only:bool=False, return_pandas:bool=True)

List all SDMX dataflows. Note: When using as a parameter to an AugmentSDMX object or to the load_SDMX_data function, set codes_only=True

Type Default Details
codes_only bool False Whether to return only the dataflow codes
return_pandas bool True Whether to return the result in a pandas DataFrame format
dflows = list_all_dataflows(return_pandas=False)

assert isinstance(dflows, dict)
all_sources = list_SDMX_sources()
assert len([s for s in dflows.keys() if s in all_sources]) == len(dflows.keys())
2023-09-16 00:49:48,202 pandasdmx.reader.sdmxml - DEBUG: Truncate sub-microsecond time in <Prepared>
2023-09-16 00:50:09,352 pandasdmx.reader.sdmxml - DEBUG: Truncate sub-microsecond time in <Prepared>
2023-09-16 00:50:10,173 pandasdmx.reader.sdmxml - DEBUG: Truncate sub-microsecond time in <Prepared>
2023-09-16 00:50:19,614 pandasdmx.reader.sdmxml - DEBUG: Truncate sub-microsecond time in <Prepared>
2023-09-16 00:50:20,660 pandasdmx.reader.sdmxml - DEBUG: Truncate sub-microsecond time in <Prepared>

list_all_dataflows returns by default a pandas Series, facilitating data discovery by users like so:

dflows = list_all_dataflows(return_pandas=True)
assert type(dflows) == pd.core.series.Series

dflows
2023-09-16 00:50:44,400 pandasdmx.reader.sdmxml - DEBUG: Truncate sub-microsecond time in <Prepared>
2023-09-16 00:51:09,450 pandasdmx.reader.sdmxml - DEBUG: Truncate sub-microsecond time in <Prepared>
2023-09-16 00:51:10,058 pandasdmx.reader.sdmxml - DEBUG: Truncate sub-microsecond time in <Prepared>
2023-09-16 00:51:14,175 pandasdmx.reader.sdmxml - DEBUG: Truncate sub-microsecond time in <Prepared>
2023-09-16 00:51:19,057 pandasdmx.reader.sdmxml - DEBUG: Truncate sub-microsecond time in <Prepared>
ABS_XML  ABORIGINAL_POP_PROJ                 Projected population, Aboriginal and Torres St...
         ABORIGINAL_POP_PROJ_REMOTE          Projected population, Aboriginal and Torres St...
         ABS_ABORIGINAL_POPPROJ_INDREGION    Projected population, Aboriginal and Torres St...
         ABS_ACLD_LFSTATUS                   Australian Census Longitudinal Dataset (ACLD):...
         ABS_ACLD_TENURE                     Australian Census Longitudinal Dataset (ACLD):...
                                                                   ...                        
UNSD     DF_UNData_UNFCC                                                       SDMX_GHG_UNDATA
WB       DF_WITS_Tariff_TRAINS                                WITS - UNCTAD TRAINS Tariff Data
         DF_WITS_TradeStats_Development                             WITS TradeStats Devlopment
         DF_WITS_TradeStats_Tariff                                      WITS TradeStats Tariff
         DF_WITS_TradeStats_Trade                                        WITS TradeStats Trade
Name: dataflow, Length: 3290, dtype: object

This format allows for more easily searching dflows by source:

list_all_dataflows(codes_only=True, return_pandas=True)
2023-09-16 00:51:51,419 pandasdmx.reader.sdmxml - DEBUG: Truncate sub-microsecond time in <Prepared>
2023-09-16 00:51:57,339 pandasdmx.reader.sdmxml - DEBUG: Truncate sub-microsecond time in <Prepared>
2023-09-16 00:52:15,569 pandasdmx.reader.sdmxml - DEBUG: Truncate sub-microsecond time in <Prepared>
2023-09-16 00:52:16,277 pandasdmx.reader.sdmxml - DEBUG: Truncate sub-microsecond time in <Prepared>
2023-09-16 00:52:18,956 pandasdmx.reader.sdmxml - DEBUG: Truncate sub-microsecond time in <Prepared>
ABS_XML  0                 ABORIGINAL_POP_PROJ
         1          ABORIGINAL_POP_PROJ_REMOTE
         2    ABS_ABORIGINAL_POPPROJ_INDREGION
         3                   ABS_ACLD_LFSTATUS
         4                     ABS_ACLD_TENURE
                            ...               
UNSD     5                     DF_UNData_UNFCC
WB       0               DF_WITS_Tariff_TRAINS
         1      DF_WITS_TradeStats_Development
         2           DF_WITS_TradeStats_Tariff
         3            DF_WITS_TradeStats_Trade
Name: dataflow, Length: 3290, dtype: object
dflows['BIS']
WS_CBPOL_D                                    Policy rates daily
WS_CBPOL_M                                  Policy rates monthly
WS_CBS_PUB                              BIS consolidated banking
WS_CPMI_CASHLESS                   CPMI cashless payments (T5-6)
WS_CPMI_CT1                       CPMI comparative tables type 1
WS_CPMI_CT2                       CPMI comparative tables type 2
WS_CPMI_DEVICES                             CPMI payment devices
WS_CPMI_INSTITUTIONS                           CPMI institutions
WS_CPMI_MACRO                                         CPMI Macro
WS_CPMI_PARTICIPANTS                           CPMI participants
WS_CPMI_SYSTEMS         CPMI systems (T8-9-11-13-14-16-17-18-19)
WS_CREDIT_GAP                             BIS credit-to-GDP gaps
WS_DEBT_SEC2_PUB                             BIS debt securities
WS_DER_OTC_TOV                          OTC derivatives turnover
WS_DSR                                    BIS debt service ratio
WS_EER_D                      BIS effective exchange rates daily
WS_EER_M                    BIS effective exchange rates monthly
WS_GLI                               Global liquidity indicators
WS_LBS_D_PUB                              BIS locational banking
WS_LONG_CPI                             BIS long consumer prices
WS_OTC_DERIV2                        OTC derivatives outstanding
WS_SPP                      BIS property prices: selected series
WS_TC                            BIS long series on total credit
WS_XRU                           US dollar exchange rates, m,q,a
WS_XRU_D                         US dollar exchange rates, daily
WS_XTD_DERIV                         Exchange traded derivatives
Name: dataflow, dtype: object

Or the user can search dataflows by their human-readable name instead of their code. For example, this is one way to see if any dataflow has information on interest rates:

dflows[dflows.str.contains('Interest rates', case=False)]
BBK  BBSDI       Discount interest rates pursuant to section 25...
ECB  RIR                                     Retail Interest Rates
IMF  6SR         M&B: Interest Rates and Share Prices (6SR) for...
     INR                                            Interest rates
     INR_NSTD                          Interest rates_Non-Standard
Name: dataflow, dtype: object

The function load_SDMX_data is a convenience function that downloads data from SDMX sources (and any specific dataflows passed as arguments) if they match the key and parameters set by the user.


source

load_SDMX_data

 load_SDMX_data (sources:dict, keys:dict, params:dict, verbose:bool=True)

Loads datasets from SDMX.

Type Default Details
sources dict A dictionary with the sources and dataflows per source
keys dict The keys to be used in the SDMX query
params dict The parameters to be used in the SDMX query
verbose bool True Whether to communicate download steps to the user
df = load_SDMX_data(sources={'ECB': 'CISS', 'BIS': 'WS_CBPOL_D'}, keys={'FREQ': 'D'}, params={'startPeriod': 2003})

assert type(df) == pd.DataFrame
assert df.shape[0] > 0
assert df.shape[1] > 0
Querying data from ECB's dataflow 'CISS' - Composite Indicator of Systemic Stress...
Querying data from BIS's dataflow 'WS_CBPOL_D' - Policy rates daily...
2023-09-16 00:52:42,940 pandasdmx.reader.sdmxml - INFO: Use supplied dsd=… argument for non–structure-specific message