syndat.metrics

Metrics for the evaluation of synthetic data fidelity.

syndat.metrics.discriminator_auc(real, synthetic, n_folds=5, drop_na_threshold=0.9)

Computes the ROC AUC score of a classifier trained to differentiate between real and synthetic data.

Parameters:

real (DataFrame) – The real data.
synthetic (DataFrame) – The synthetic data.
n_folds (int) – Number of k folds for cross-validation.
drop_na_threshold (float) – Percentage of non-missing values required of any column

Return type:

float

Returns:

AUC ROC Score

syndat.metrics.jensen_shannon_distance(real, synthetic, n_unique_threshold=10)

Computes the Jensen-Shannon distance (JSD) between the distributions of each feature in the real and synthetic datasets.

The Jensen-Shannon distance is a symmetric and finite measure of similarity between two probability distributions. It ranges from 0 (identical distributions) to 1 (maximally different distributions). This function applies the JSD per column, handling categorical, ordinal, and numerical data differently:

For categorical columns, JSD is computed from the frequency counts of each category.
For ordinal columns (integer dtype with unique values below a threshold), JSD is computed from binned counts.
For numerical columns, histograms with automatically determined bins are used to compute JSD.
If distributions are disjoint, the JSD is set to 1.

Parameters:

real (DataFrame) – DataFrame containing the real data.
synthetic (DataFrame) – DataFrame containing the synthetic data.
n_unique_threshold (int) – Maximum number of unique integer values for a column to be treated as ordinal.

Return type:

Dict[str, float]

Returns:

Dictionary mapping each column name to its Jensen-Shannon distance.

syndat.metrics.normalized_correlation_difference(real, synthetic, method='spearman', cat_cols=None, ignore_cat=False)

Computes the correlation similarity of real and synthetic data by comparing the correlation matrices of both datasets. The score is calculated as the norm quotient of the difference between the correlation matrices of real and synthetic data. This quotient is 0 for equal correlation matrices and 1 for completely different correlation matrices. The norm can in theory exceed 1, in cases of opposing correlation structured (e.g. only negative correlations in the real data, but only positive correlations in the synthetic data).

Parameters:

real (DataFrame) – The real data.
synthetic (DataFrame) – The synthetic data.
method (Literal['pearson', 'kendall', 'spearman']) – The method to use for correlation computation. Either of “pearson”, “kendall” or “spearman”.
cat_cols (Optional[List[str]]) – List of categorical column names.
ignore_cat (bool) – Whether to ignore categorical columns.

Return type:

float

Returns:

correlation quotient